* [PATCH 00 of 16] OOM related fixes
@ 2007-06-08 20:02 Andrea Arcangeli
2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
` (16 more replies)
0 siblings, 17 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:02 UTC (permalink / raw)
To: linux-mm
Hello everyone,
this is a set of fixes done in the context of a quite evil workload reading
from nfs large files with big read buffers in parallel from many tasks at
the same time until the system goes oom. Mostly all of these fixes seems to be
required to fix the customer workload on top of an older sles kernel. The
forward port of the fixes has been already tested successfully on similar evil
workloads.
mainline vanilla running a somewhat simulated workload:
Jun 8 06:06:56 kvm kernel: Out of memory: Killed process 3282 (klauncher).
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 3002 (qmgr) score 11225 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 3001 (pickup) score 11216 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 2186 (hald) score 11004 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 3515 (bash) score 9447 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 2186 (hald) score 8558 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 2142 (dbus-daemon) score 5591 or a child
Jun 8 06:17:35 kvm kernel: Out of memory: kill process 3549 (recursive_readd) score 4597 or a child
Jun 8 06:17:43 kvm kernel: Out of memory: kill process 3591 (pickup) score 9756 or a child
Jun 8 06:17:43 kvm kernel: Out of memory: kill process 2204 (hald-addon-acpi) score 4121 or a child
Jun 8 06:17:43 kvm kernel: Out of memory: kill process 3515 (bash) score 3808 or a child
Jun 8 06:17:45 kvm kernel: Out of memory: kill process 3555 (recursive_readd) score 2330 or a child
Jun 8 06:17:53 kvm kernel: Out of memory: kill process 3554 (recursive_readd) score 2605 or a child
Jun 8 06:18:00 kvm kernel: Out of memory: kill process 3170 (nscd) score 1985 or a child
Jun 8 06:18:00 kvm kernel: Out of memory: kill process 3187 (nscd) score 1985 or a child
Jun 8 06:18:00 kvm kernel: Out of memory: kill process 3188 (nscd) score 1985 or a child
Jun 8 06:18:00 kvm kernel: Out of memory: kill process 2855 (portmap) score 1965 or a child
Jun 8 06:18:00 kvm kernel: Out of memory: kill process 3551 (recursive_readd) score 859 or a child
[ eventually it deadlocks and stops killing new tasks ]
mainline + fixes running the same simulated workload:
Jun 8 13:35:32 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun 8 13:35:33 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun 8 13:35:33 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun 8 13:37:33 kvm kernel: Out of memory: kill process 3505 (recursive_readd) score 622 or a child
Jun 8 13:37:34 kvm kernel: Out of memory: kill process 3510 (recursive_readd) score 418 or a child
Jun 8 13:37:36 kvm kernel: Out of memory: kill process 3535 (recursive_readd) score 377 or a child
Jun 8 13:37:36 kvm kernel: Out of memory: kill process 3498 (recursive_readd) score 370 or a child
Jun 8 13:37:36 kvm kernel: Out of memory: kill process 3516 (recursive_readd) score 364 or a child
Jun 8 13:37:36 kvm kernel: Out of memory: kill process 3515 (recursive_readd) score 357 or a child
Jun 8 13:40:49 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun 8 13:41:55 kvm kernel: Out of memory: kill process 3558 (recursive_readd) score 356 or a child
Jun 8 13:41:56 kvm kernel: Out of memory: kill process 3578 (recursive_readd) score 355 or a child
Jun 8 13:41:56 kvm kernel: Out of memory: kill process 3577 (recursive_readd) score 350 or a child
Jun 8 13:41:56 kvm kernel: Out of memory: kill process 3572 (recursive_readd) score 347 or a child
Jun 8 13:41:56 kvm kernel: Out of memory: kill process 3568 (recursive_readd) score 346 or a child
The oom deadlock detection triggers a couple of times against the PG_locked
deadlock:
Jun 8 13:51:19 kvm kernel: Killed process 3504 (recursive_readd)
Jun 8 13:51:19 kvm kernel: detected probable OOM deadlock, so killing another task
Jun 8 13:51:19 kvm kernel: Out of memory: kill process 3532 (recursive_readd) score 1225 or a child
Example of stack trace of TIF_MEMDIE killed task (not literally verified that
this was the one with TIF_MEMDIE set but it's the same as before with the
verified one):
recursive_rea D ffff810001056418 0 3548 3544 (NOTLB)
ffff81000e57dba8 0000000000000082 ffff8100010af5e8 ffff8100148df730
ffff81001ff3ea10 0000000000bd2e1b ffff8100148df908 0000000000000046
ffff81001fd5f170 ffffffff8031c36d ffff81001fd5f170 ffff810001056418
Call Trace:
[<ffffffff8031c36d>] __generic_unplug_device+0x13/0x24
[<ffffffff80244163>] sync_page+0x0/0x40
[<ffffffff804cdf5b>] io_schedule+0xf/0x17
[<ffffffff8024419e>] sync_page+0x3b/0x40
[<ffffffff804ce162>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80244150>] __lock_page+0x5e/0x64
[<ffffffff802321f1>] wake_bit_function+0x0/0x23
[<ffffffff802440c0>] find_get_page+0xe/0x40
[<ffffffff80244a33>] do_generic_mapping_read+0x200/0x450
[<ffffffff80243f26>] file_read_actor+0x0/0x11d
[<ffffffff80247fd4>] get_page_from_freelist+0x2d3/0x36e
[<ffffffff802464d0>] generic_file_aio_read+0x11d/0x159
[<ffffffff80260bdc>] do_sync_read+0xc9/0x10c
[<ffffffff80252adb>] vma_merge+0x10c/0x195
[<ffffffff802321c3>] autoremove_wake_function+0x0/0x2e
[<ffffffff80253a06>] do_mmap_pgoff+0x5e1/0x74c
[<ffffffff8026134d>] vfs_read+0xaa/0x132
[<ffffffff80261662>] sys_read+0x45/0x6e
[<ffffffff8020991e>] system_call+0x7e/0x83
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 77+ messages in thread* [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli @ 2007-06-08 20:02 ` Andrea Arcangeli 2007-06-10 17:36 ` Rik van Riel 2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli ` (15 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:02 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332959 -7200 # Node ID 8e38f7656968417dfee09fbb6450a8f1e70f8b21 # Parent 8b84ac74c8464bb6e4a2c08ff2a656d06c8667ca remove nr_scan_inactive/active The older atomic_add/atomic_set were pointless (atomic_set vs atomic_add would race), but removing them didn't actually remove the race, the race is still there, for the same reasons atomic_add/set couldn't prevent it. This is really the kind of code that I dislike because it's sort of buggy, and it shouldn't be making any measurable difference and when it does something for real it can only hurt! The real focus is on shrink_zone (ignore the other places where it's being used that are even less interesting). Assume two tasks adds to nr_scan_*active at the same time (first line of the old buggy code), they'll effectively double their scan rate, for no good reason. What can happen is that instead of scanning nr_entries each, they'll scan nr_entries*2 each. The more CPUs the bigger the race and the higher the multiplication effect and the harder it will be to detect oom. In the case that nr_*active < sc->swap_cluster_max, regardless of whatever future invocation of alloc_pages, we'll be going down in the priorities in the current alloc_pages invocation if the DEF_PRIORITY was too high to make any work, so again accumulating the nr_scan_*active doesn't seem interesting even when it's smaller than sc->swap_cluster_max. Each task should work for itself without much care of what the others are doing. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -220,8 +220,6 @@ struct zone { spinlock_t lru_lock; struct list_head active_list; struct list_head inactive_list; - unsigned long nr_scan_active; - unsigned long nr_scan_inactive; unsigned long pages_scanned; /* since last reclaim */ int all_unreclaimable; /* All pages pinned */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2649,8 +2649,6 @@ static void __meminit free_area_init_cor zone_pcp_init(zone); INIT_LIST_HEAD(&zone->active_list); INIT_LIST_HEAD(&zone->inactive_list); - zone->nr_scan_active = 0; - zone->nr_scan_inactive = 0; zap_zone_vm_stats(zone); atomic_set(&zone->reclaim_in_progress, 0); if (!size) diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -915,20 +915,11 @@ static unsigned long shrink_zone(int pri * Add one to `nr_to_scan' just to make sure that the kernel will * slowly sift through the active list. */ - zone->nr_scan_active += - (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; - nr_active = zone->nr_scan_active; - if (nr_active >= sc->swap_cluster_max) - zone->nr_scan_active = 0; - else + nr_active = zone_page_state(zone, NR_ACTIVE) >> priority; + if (nr_active < sc->swap_cluster_max) nr_active = 0; - - zone->nr_scan_inactive += - (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; - nr_inactive = zone->nr_scan_inactive; - if (nr_inactive >= sc->swap_cluster_max) - zone->nr_scan_inactive = 0; - else + nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority; + if (nr_inactive < sc->swap_cluster_max) nr_inactive = 0; while (nr_active || nr_inactive) { @@ -1392,22 +1383,14 @@ static unsigned long shrink_all_zones(un /* For pass = 0 we don't shrink the active list */ if (pass > 0) { - zone->nr_scan_active += - (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; - if (zone->nr_scan_active >= nr_pages || pass > 3) { - zone->nr_scan_active = 0; - nr_to_scan = min(nr_pages, - zone_page_state(zone, NR_ACTIVE)); + nr_to_scan = (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; + if (nr_to_scan >= nr_pages || pass > 3) { shrink_active_list(nr_to_scan, zone, sc, prio); } } - zone->nr_scan_inactive += - (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; - if (zone->nr_scan_inactive >= nr_pages || pass > 3) { - zone->nr_scan_inactive = 0; - nr_to_scan = min(nr_pages, - zone_page_state(zone, NR_INACTIVE)); + nr_to_scan = (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; + if (nr_to_scan >= nr_pages || pass > 3) { ret += shrink_inactive_list(nr_to_scan, zone, sc); if (ret >= nr_pages) return ret; diff --git a/mm/vmstat.c b/mm/vmstat.c --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -554,7 +554,7 @@ static int zoneinfo_show(struct seq_file "\n min %lu" "\n low %lu" "\n high %lu" - "\n scanned %lu (a: %lu i: %lu)" + "\n scanned %lu" "\n spanned %lu" "\n present %lu", zone_page_state(zone, NR_FREE_PAGES), @@ -562,7 +562,6 @@ static int zoneinfo_show(struct seq_file zone->pages_low, zone->pages_high, zone->pages_scanned, - zone->nr_scan_active, zone->nr_scan_inactive, zone->spanned_pages, zone->present_pages); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli @ 2007-06-10 17:36 ` Rik van Riel 2007-06-10 18:17 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-10 17:36 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm Andrea Arcangeli wrote: > - else > + nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority; > + if (nr_inactive < sc->swap_cluster_max) > nr_inactive = 0; This is a problem. On workloads with lots of anonymous memory, for example running a very large JVM or simply stressing the system with AIM7, the inactive list can be very small. If dozens (or even hundreds) of tasks get into the pageout code simultaneously, they will all spend a lot of time moving pages from the active to the inactive list, but they will not even try to free any of the (few) inactive pages the system has! We have observed systems in stress tests that spent well over 10 minutes in shrink_active_list before the first call to shrink_inactive_list was made. Your code looks like it could exacerbate that situation, by not having zone->nr_scan_inactive increment between calls. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-10 17:36 ` Rik van Riel @ 2007-06-10 18:17 ` Andrea Arcangeli 2007-06-11 14:58 ` Rik van Riel 2007-06-26 17:08 ` Rik van Riel 0 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-10 18:17 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote: > Andrea Arcangeli wrote: > > >- else > >+ nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority; > >+ if (nr_inactive < sc->swap_cluster_max) > > nr_inactive = 0; > > This is a problem. > > On workloads with lots of anonymous memory, for example > running a very large JVM or simply stressing the system > with AIM7, the inactive list can be very small. > > If dozens (or even hundreds) of tasks get into the > pageout code simultaneously, they will all spend a lot > of time moving pages from the active to the inactive > list, but they will not even try to free any of the > (few) inactive pages the system has! > > We have observed systems in stress tests that spent > well over 10 minutes in shrink_active_list before > the first call to shrink_inactive_list was made. > > Your code looks like it could exacerbate that situation, > by not having zone->nr_scan_inactive increment between > calls. If all tasks spend 10 minutes in shrink_active_list before the first call to shrink_inactive_list that could mean you hit the race that I'm just trying to fix with this very patch. (i.e. nr_*active going totally huge because of the race triggering, and trashing over the few pages left in the *active_list until the artificially boosted nr_*active finally goes down to zero in all tasks that read it at the unlucky time when it got huge) So my patch may actually fix your situation completely if your trouble was nr_scan_active becoming huge for no good reason, just because many tasks entered the VM at the same time on big-SMP systems. Did you monitor the real sizes of the active lists during those 10 min and compared it to the nr_active stored in the stack? Normally if the highest priority passes only calls into shrink_active_list that's because the two lists needs rebalancing. But I fail to see how it could ever take 10min for the first shrink_inactive_list to trigger with my patch applied, while if it happens in current vanilla that could be the race triggering, or anyway something unrelated is going wrong in the VM. Overall this code seems quite flakey in its current "racy" form, so I doubt it can be allowed to live as-is. Infact even if we fix the race with a slow-shared-lock in a fast path or if we only make sure not to avoid exacerbate your situation with something a simple and lock-less as "nr_active = min(sizeof_active_list, nr_scan_active)", I think it would still wrong to do more work in the current tasks, if we've other tasks helping us at the same time. We should do nothing more, nothing less. So I think if we want those counters to avoid restarting from zero at each priority step (what I understand is your worry), those counters should be in the stack, task-local. That will still take into account the previously not scanned "nr_inactive" value. Not sure what's best. I've the feeling that introducing a task-local *nr_active *nr_inactive counter shared by all priority steps, won't move the VM needle much, but I sure wouldn't be against it. It will change the balancing to be more fair, but in practice I don't expect huge differences, there are only 12 steps anyway, very quickly the inactive list should be shrunk even if the active list is huge. I'm only generally against the current per-zone global and racy approach without limits, so potentially exacerbating your situation when nr_active becomes very huge despite the active list being very small. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-10 18:17 ` Andrea Arcangeli @ 2007-06-11 14:58 ` Rik van Riel 2007-06-26 17:08 ` Rik van Riel 1 sibling, 0 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-11 14:58 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm Andrea Arcangeli wrote: > On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote: >> Andrea Arcangeli wrote: >> >>> - else >>> + nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority; >>> + if (nr_inactive < sc->swap_cluster_max) >>> nr_inactive = 0; >> This is a problem. >> >> On workloads with lots of anonymous memory, for example >> running a very large JVM or simply stressing the system >> with AIM7, the inactive list can be very small. >> >> If dozens (or even hundreds) of tasks get into the >> pageout code simultaneously, they will all spend a lot >> of time moving pages from the active to the inactive >> list, but they will not even try to free any of the >> (few) inactive pages the system has! >> >> We have observed systems in stress tests that spent >> well over 10 minutes in shrink_active_list before >> the first call to shrink_inactive_list was made. >> >> Your code looks like it could exacerbate that situation, >> by not having zone->nr_scan_inactive increment between >> calls. > > If all tasks spend 10 minutes in shrink_active_list before the first > call to shrink_inactive_list that could mean you hit the race that I'm > just trying to fix with this very patch. (i.e. nr_*active going > totally huge because of the race triggering, Nope. In this case it spends its time in shrink_active_list because the active list is 99% of memory (several GB) while the inactive list is so small that nr_inactive_pages >> priority is zero. > Normally if the highest priority passes only calls into > shrink_active_list that's because the two lists needs rebalancing. But > I fail to see how it could ever take 10min for the first > shrink_inactive_list to trigger with my patch applied, while if it > happens in current vanilla that could be the race triggering, or > anyway something unrelated is going wrong in the VM. Yeah, I have no real objection to your patch, but was just pointing out that it does not fix the big problem with this code. -- All Rights Reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-10 18:17 ` Andrea Arcangeli 2007-06-11 14:58 ` Rik van Riel @ 2007-06-26 17:08 ` Rik van Riel 2007-06-26 17:55 ` Andrew Morton 2007-06-26 20:37 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-26 17:08 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton Andrea Arcangeli wrote: > If all tasks spend 10 minutes in shrink_active_list before the first > call to shrink_inactive_list that could mean you hit the race that I'm > just trying to fix with this very patch. I got around to testing it now. I am using AIM7 since it is a very anonymous memory heavy workload. Unfortunately your patch does not fix the problem, but behaves as I had feared :( Both the normal kernel and your kernel fall over once memory pressure gets big enough, but they explode differently and at different points. I am running the test on a quad core x86-64 system with 2GB memory. I am "zooming in" on the 4000 user range, because that is where they start to diverge. I am running aim7 to cross-over, which is the point at which fewer than 1 jobs/min/user are being completed. First vanilla 2.6.22-rc5-git8: Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 4000 119.97 432.86 47.17 204051.01 51.01 11.52 9.99 90 4100 141.59 517.31 48.92 177215.91 43.22 6.67 4.84 95 4200 154.95 569.16 50.51 165885.77 39.50 5.07 3.35 96 4300 166.24 613.40 51.58 158301.25 36.81 10.59 6.51 93 4400 170.40 628.63 52.72 158028.17 35.92 5.46 3.27 96 4500 188.88 701.84 54.06 145806.86 32.40 6.13 3.31 96 4600 200.37 745.73 55.55 140500.07 30.54 4.98 2.54 97 4700 219.25 819.80 57.01 131192.70 27.91 5.38 2.51 97 4800 219.70 820.36 58.22 133709.60 27.86 5.40 2.52 97 4900 232.45 870.08 59.56 129008.39 26.33 4.65 2.02 97 5105 1704.46 5406.56 64.03 18329.91 3.59 264.38 18.85 81 Crossover achieved Max Jobs per Minute 204051.01 Now 2.6.22-rc5-git8 with your patches 01/16 and 15/16: Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 4000 141.51 518.37 47.46 172991.31 43.25 5.20 3.75 96 4100 147.07 539.16 48.91 170612.63 41.61 5.11 3.58 96 4200 155.43 571.36 50.18 165373.48 39.37 5.42 3.58 96 4300 1317.89 4558.95 52.53 19968.28 4.64 219.76 18.42 81 Crossover achieved Max Jobs per Minute 172991.31 One thing I noticed is that with the vanilla kernel, the lower numbers of users allowed the system to still run fine, while with your patches the system seemed to get stuck at ~90% system time pretty quickly... -- All Rights Reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 17:08 ` Rik van Riel @ 2007-06-26 17:55 ` Andrew Morton 2007-06-26 19:02 ` Rik van Riel 2007-06-28 22:44 ` Rik van Riel 2007-06-26 20:37 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Andrew Morton @ 2007-06-26 17:55 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Tue, 26 Jun 2007 13:08:57 -0400 Rik van Riel <riel@redhat.com> wrote: > > If all tasks spend 10 minutes in shrink_active_list before the first > > call to shrink_inactive_list that could mean you hit the race that I'm > > just trying to fix with this very patch. > > I got around to testing it now. I am using AIM7 since it is > a very anonymous memory heavy workload. > > Unfortunately your patch does not fix the problem, but behaves > as I had feared :( > > Both the normal kernel and your kernel fall over once memory > pressure gets big enough, but they explode differently and > at different points. > > I am running the test on a quad core x86-64 system with 2GB > memory. I am "zooming in" on the 4000 user range, because > that is where they start to diverge. I am running aim7 to > cross-over, which is the point at which fewer than 1 jobs/min/user > are being completed. with what command line and config scripts does one run aim7 to reproduce this? Where's the system time being spent? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 17:55 ` Andrew Morton @ 2007-06-26 19:02 ` Rik van Riel 2007-06-28 22:44 ` Rik van Riel 1 sibling, 0 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-26 19:02 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > On Tue, 26 Jun 2007 13:08:57 -0400 Rik van Riel <riel@redhat.com> wrote: > >>> If all tasks spend 10 minutes in shrink_active_list before the first >>> call to shrink_inactive_list that could mean you hit the race that I'm >>> just trying to fix with this very patch. >> I got around to testing it now. I am using AIM7 since it is >> a very anonymous memory heavy workload. >> >> Unfortunately your patch does not fix the problem, but behaves >> as I had feared :( >> >> Both the normal kernel and your kernel fall over once memory >> pressure gets big enough, but they explode differently and >> at different points. >> >> I am running the test on a quad core x86-64 system with 2GB >> memory. I am "zooming in" on the 4000 user range, because >> that is where they start to diverge. I am running aim7 to >> cross-over, which is the point at which fewer than 1 jobs/min/user >> are being completed. > > with what command line and config scripts does one run aim7 to > reproduce this? reaim -x -i 100 -s 5000 Using the default reaim.config and workfile.shared > Where's the system time being spent? I will run the tests again with profiling enabled. -- All Rights Reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 17:55 ` Andrew Morton 2007-06-26 19:02 ` Rik van Riel @ 2007-06-28 22:44 ` Rik van Riel 2007-06-28 22:57 ` Andrew Morton 2007-06-29 13:38 ` Lee Schermerhorn 1 sibling, 2 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-28 22:44 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > Where's the system time being spent? OK, it turns out that there is quite a bit of variability in where the system spends its time. I did a number of reaim runs and averaged the time the system spent in the top functions. This is with the Fedora rawhide kernel config, which has quite a few debugging options enabled. _raw_spin_lock 32.0% page_check_address 12.7% __delay 10.8% mwait_idle 10.4% anon_vma_unlink 5.7% __anon_vma_link 5.3% lockdep_reset_lock 3.5% __kmalloc_node_track_caller 2.8% security_port_sid 1.8% kfree 1.6% anon_vma_link 1.2% page_referenced_one 1.1% In short, the system is waiting on the anon_vma lock. I wonder if Lee Schemmerhorn's patch to turn that spinlock into an rwlock would help this workload, or if we simply should scan fewer pages in the pageout code. Andrea, with your VM patches for some reason the number of users where reaim has its crossover point is also somewhat variable, between 4200 and 5100 users, with 9 out of 10 runs under 4500 on my system. A kernel without your patches is not as variable, but has visibly more unfairness between tasks, as seen in the reaim "Std_dev" columns. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 22:44 ` Rik van Riel @ 2007-06-28 22:57 ` Andrew Morton 2007-06-28 23:04 ` Rik van Riel 2007-06-29 13:38 ` Lee Schermerhorn 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-06-28 22:57 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 18:44:56 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > > Where's the system time being spent? > > OK, it turns out that there is quite a bit of variability > in where the system spends its time. I did a number of > reaim runs and averaged the time the system spent in the > top functions. > > This is with the Fedora rawhide kernel config, which has > quite a few debugging options enabled. > > _raw_spin_lock 32.0% > page_check_address 12.7% > __delay 10.8% > mwait_idle 10.4% > anon_vma_unlink 5.7% > __anon_vma_link 5.3% > lockdep_reset_lock 3.5% > __kmalloc_node_track_caller 2.8% > security_port_sid 1.8% > kfree 1.6% > anon_vma_link 1.2% > page_referenced_one 1.1% > > In short, the system is waiting on the anon_vma lock. Sigh. We had a workload (forget which, still unfixed) in which things would basically melt down in that linear anon_vma walk, walking 10,000 or more vma's. I wonder if that's what's happening here? Also, one thing to watch out for here is a problem with the spinlocks themselves: the problem wherein the cores in one package keep rattling the lock around between them and never let it out for the cores in another package to grab. > I wonder if Lee Schemmerhorn's patch to turn that > spinlock into an rwlock would help this workload, > or if we simply should scan fewer pages in the > pageout code. Maybe. I'm thinking that the problem here is really due to the huge amount of processing which needs to occur when we are in the "all pages active, referenced" state and then we hit pages_low. Panic time, we need to scan and deactivate a huge amount of stuff. Would it not be better to prevent that situation from occurring by doing a bit of scanning and balancing when adding pages to the LRU? Make sure that the lists will be in reasonable shape for when reclaim starts? That'd deoptimise those workloads which allocate and free pages but never enter reclaim. Probably liveable with. We would want to avoid needlessly unmapping pages and causing more minor faults. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 22:57 ` Andrew Morton @ 2007-06-28 23:04 ` Rik van Riel 2007-06-28 23:13 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-28 23:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > On Thu, 28 Jun 2007 18:44:56 -0400 > Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >> >>> Where's the system time being spent? >> OK, it turns out that there is quite a bit of variability >> in where the system spends its time. I did a number of >> reaim runs and averaged the time the system spent in the >> top functions. >> >> This is with the Fedora rawhide kernel config, which has >> quite a few debugging options enabled. >> >> _raw_spin_lock 32.0% >> page_check_address 12.7% >> __delay 10.8% >> mwait_idle 10.4% >> anon_vma_unlink 5.7% >> __anon_vma_link 5.3% >> lockdep_reset_lock 3.5% >> __kmalloc_node_track_caller 2.8% >> security_port_sid 1.8% >> kfree 1.6% >> anon_vma_link 1.2% >> page_referenced_one 1.1% >> >> In short, the system is waiting on the anon_vma lock. > > Sigh. We had a workload (forget which, still unfixed) in which things > would basically melt down in that linear anon_vma walk, walking 10,000 or > more vma's. I wonder if that's what's happening here? That would be a large multi-threaded application that fills up memory. Customers are reproducing this with JVMs on some very large systems. > Also, one thing to watch out for here is a problem with the spinlocks > themselves: the problem wherein the cores in one package keep rattling the > lock around between them and never let it out for the cores in another > package to grab. This is a single package quad core system, though. >> I wonder if Lee Schemmerhorn's patch to turn that >> spinlock into an rwlock would help this workload, >> or if we simply should scan fewer pages in the >> pageout code. > > Maybe. I'm thinking that the problem here is really due to the huge amount > of processing which needs to occur when we are in the "all pages active, > referenced" state and then we hit pages_low. Panic time, we need to scan > and deactivate a huge amount of stuff. > > Would it not be better to prevent that situation from occurring by doing a > bit of scanning and balancing when adding pages to the LRU? Make sure that > the lists will be in reasonable shape for when reclaim starts? Agreed, we need to simply scan fewer pages. Doing something like SEQ replacement on the anonymous (and other swap backed) pages might just do the trick here. Page cache, of course, should continue using a used-once scheme. I suspect we want to split out the lists for many other reasons anyway, as detailed on http://linux-mm.org/PageoutFailureModes I'll whip up a patch that does this... > That'd deoptimise those workloads which allocate and free pages but never > enter reclaim. Probably liveable with. If we do true SEQ replacement for anonymous pages (deactivating active pages without regard to the referenced bit) and keep the inactive list reasonably small that penalty should be negligable. > We would want to avoid needlessly unmapping pages and causing more minor > faults. That's a minor issue, the page fault path is pretty cheap and very scalable. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:04 ` Rik van Riel @ 2007-06-28 23:13 ` Andrew Morton 2007-06-28 23:16 ` Rik van Riel 2007-06-28 23:25 ` Andrea Arcangeli 0 siblings, 2 replies; 77+ messages in thread From: Andrew Morton @ 2007-06-28 23:13 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 19:04:05 -0400 Rik van Riel <riel@redhat.com> wrote: > > Sigh. We had a workload (forget which, still unfixed) in which things > > would basically melt down in that linear anon_vma walk, walking 10,000 or > > more vma's. I wonder if that's what's happening here? > > That would be a large multi-threaded application that fills up > memory. Customers are reproducing this with JVMs on some very > large systems. So.... does that mean "yes, it's scanning a lot of vmas"? If so, I expect there will still be failure modes, whatever we do outside of this. A locked, linear walk of a list whose length is application-controlled is going to be a problem. Could be that we'll need an O(n) -> O(log(n)) conversion, which will be tricky in there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:13 ` Andrew Morton @ 2007-06-28 23:16 ` Rik van Riel 2007-06-28 23:29 ` Andrew Morton 2007-06-28 23:25 ` Andrea Arcangeli 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-28 23:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > On Thu, 28 Jun 2007 19:04:05 -0400 > Rik van Riel <riel@redhat.com> wrote: > >>> Sigh. We had a workload (forget which, still unfixed) in which things >>> would basically melt down in that linear anon_vma walk, walking 10,000 or >>> more vma's. I wonder if that's what's happening here? >> That would be a large multi-threaded application that fills up >> memory. Customers are reproducing this with JVMs on some very >> large systems. > > So.... does that mean "yes, it's scanning a lot of vmas"? Not necessarily. The problem can also be reproduced if you have many threads, from "enough" CPUs, all scanning pages in the same huge VMA. > If so, I expect there will still be failure modes, whatever we do outside > of this. A locked, linear walk of a list whose length is > application-controlled is going to be a problem. Could be that we'll need > an O(n) -> O(log(n)) conversion, which will be tricky in there. Scanning fewer pages in the pageout path is probably the way to go. No matter how efficient we make the scanning of one individual page, we simply cannot scan through 1TB worth of anonymous pages (which are all referenced because they've been there for a week) in order to deactivate something. Systems that big are only a year or two away from general purpose use. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:16 ` Rik van Riel @ 2007-06-28 23:29 ` Andrew Morton 2007-06-29 0:00 ` Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-06-28 23:29 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 19:16:45 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > On Thu, 28 Jun 2007 19:04:05 -0400 > > Rik van Riel <riel@redhat.com> wrote: > > > >>> Sigh. We had a workload (forget which, still unfixed) in which things > >>> would basically melt down in that linear anon_vma walk, walking 10,000 or > >>> more vma's. I wonder if that's what's happening here? > >> That would be a large multi-threaded application that fills up > >> memory. Customers are reproducing this with JVMs on some very > >> large systems. > > > > So.... does that mean "yes, it's scanning a lot of vmas"? > > Not necessarily. > > The problem can also be reproduced if you have many > threads, from "enough" CPUs, all scanning pages in > the same huge VMA. I wouldn't have expected the anon_vma lock to be the main problem for a single vma. If it _is_ the problem then significant improvements could probably be obtained by passing the whole isolate_lru_pages() pile of pages into the rmap code rather than doing them one-at-a-time. > > If so, I expect there will still be failure modes, whatever we do outside > > of this. A locked, linear walk of a list whose length is > > application-controlled is going to be a problem. Could be that we'll need > > an O(n) -> O(log(n)) conversion, which will be tricky in there. > > Scanning fewer pages in the pageout path is probably > the way to go. I don't see why that would help. The bottom-line steady-state case is that we need to reclaim N pages per second, and we need to scan N*M vmas per second to do so. How we chunk that up won't affect the aggregate amount of work which needs to be done. Or maybe you're referring to the ongoing LRU balancing thing. Or to something else. > No matter how efficient we make the scanning of one > individual page, we simply cannot scan through 1TB > worth of anonymous pages (which are all referenced > because they've been there for a week) in order to > deactivate something. Sure. And we could avoid that sudden transition by balancing the LRU prior to hitting the great pages_high wall. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:29 ` Andrew Morton @ 2007-06-29 0:00 ` Rik van Riel 2007-06-29 0:19 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-29 0:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: >> Scanning fewer pages in the pageout path is probably >> the way to go. > > I don't see why that would help. The bottom-line steady-state case is that > we need to reclaim N pages per second, and we need to scan N*M vmas per > second to do so. How we chunk that up won't affect the aggregate amount of > work which needs to be done. > > Or maybe you're referring to the ongoing LRU balancing thing. Or to something > else. Yes, I am indeed talking about LRU balancing. We pretty much *know* that an anonymous page on the active list is accessed, so why bother scanning them all? We could just deactivate the oldest ones and clear their referenced bits. Once they reach the end of the inactive list, we check for the referenced bit again. If the page was accessed, we move it back to the active list. The only problem with this is that anonymous pages could be easily pushed out of memory by the page cache, because the page cache has totally different locality of reference. The page cache also benefits from the use-once scheme we have in place today. Because of these three reasons, I want to split the page cache LRU lists from the anonymous memory LRU lists. Does this make sense to you? >> No matter how efficient we make the scanning of one >> individual page, we simply cannot scan through 1TB >> worth of anonymous pages (which are all referenced >> because they've been there for a week) in order to >> deactivate something. > > Sure. And we could avoid that sudden transition by balancing the LRU prior > to hitting the great pages_high wall. Yes, we will need to do some preactive balancing. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 0:00 ` Rik van Riel @ 2007-06-29 0:19 ` Andrew Morton 2007-06-29 0:45 ` Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-06-29 0:19 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 20:00:03 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > >> Scanning fewer pages in the pageout path is probably > >> the way to go. > > > > I don't see why that would help. The bottom-line steady-state case is that > > we need to reclaim N pages per second, and we need to scan N*M vmas per > > second to do so. How we chunk that up won't affect the aggregate amount of > > work which needs to be done. > > > > Or maybe you're referring to the ongoing LRU balancing thing. Or to something > > else. > > Yes, I am indeed talking about LRU balancing. > > We pretty much *know* that an anonymous page on the > active list is accessed, so why bother scanning them > all? Because there might well be pages in there which haven't been accessed in days. Confused. > We could just deactivate the oldest ones and clear > their referenced bits. > > Once they reach the end of the inactive list, we > check for the referenced bit again. If the page > was accessed, we move it back to the active list. ok. > The only problem with this is that anonymous > pages could be easily pushed out of memory by > the page cache, because the page cache has > totally different locality of reference. I don't immediately see why we need to change the fundamental aging design at all. The problems afacit are a) that huge burst of activity when we hit pages_high and b) the fact that this huge burst happens on lots of CPUs at the same time. And balancing the LRUs _prior_ to hitting pages_high can address both problems? It will I guess impact the page aging a bit though. > The page cache also benefits from the use-once > scheme we have in place today. > > Because of these three reasons, I want to split > the page cache LRU lists from the anonymous > memory LRU lists. > > Does this make sense to you? Could do, don't know. What new problems will it introduce? :( > >> No matter how efficient we make the scanning of one > >> individual page, we simply cannot scan through 1TB > >> worth of anonymous pages (which are all referenced > >> because they've been there for a week) in order to > >> deactivate something. > > > > Sure. And we could avoid that sudden transition by balancing the LRU prior > > to hitting the great pages_high wall. > > Yes, we will need to do some preactive balancing. OK.. And that huge anon-vma walk might need attention. At the least we could do something to prevent lots of CPUs from piling up in there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 0:19 ` Andrew Morton @ 2007-06-29 0:45 ` Rik van Riel 2007-06-29 1:12 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-29 0:45 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > On Thu, 28 Jun 2007 20:00:03 -0400 > Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >> >>>> Scanning fewer pages in the pageout path is probably >>>> the way to go. >>> I don't see why that would help. The bottom-line steady-state case is that >>> we need to reclaim N pages per second, and we need to scan N*M vmas per >>> second to do so. How we chunk that up won't affect the aggregate amount of >>> work which needs to be done. >>> >>> Or maybe you're referring to the ongoing LRU balancing thing. Or to something >>> else. >> Yes, I am indeed talking about LRU balancing. >> >> We pretty much *know* that an anonymous page on the >> active list is accessed, so why bother scanning them >> all? > > Because there might well be pages in there which haven't been accessed in > days. Confused. We won't know that unless we actually did some background scanning. Currently hours old (or days old) referenced bits are not cleared from anonymous pages. >> We could just deactivate the oldest ones and clear >> their referenced bits. >> >> Once they reach the end of the inactive list, we >> check for the referenced bit again. If the page >> was accessed, we move it back to the active list. > > ok. > >> The only problem with this is that anonymous >> pages could be easily pushed out of memory by >> the page cache, because the page cache has >> totally different locality of reference. > > I don't immediately see why we need to change the fundamental aging design > at all. The problems afacit are > > a) that huge burst of activity when we hit pages_high and > > b) the fact that this huge burst happens on lots of CPUs at the same time. > > And balancing the LRUs _prior_ to hitting pages_high can address both > problems? That may work on systems with up to a few GB of memory, but customers are already rolling out systems with 256GB of RAM for general purpose use, that's 64 million pages! Even doing a background scan on that many pages will take insane amounts of CPU time. In a few years, they will be deploying systems with 1TB of memory and throwing random workloads at them. > It will I guess impact the page aging a bit though. Yes, it will. However, I believe that the current system of page aging is simply not sustainable when memory size gets insanely large. >> The page cache also benefits from the use-once >> scheme we have in place today. >> >> Because of these three reasons, I want to split >> the page cache LRU lists from the anonymous >> memory LRU lists. >> >> Does this make sense to you? > > Could do, don't know. What new problems will it introduce? :( The obvious problem is how to balance the eviction of page cache backed pages versus the eviction of swap backed pages. The "good news" here is that the current VM does not really balance this either, but relies on system administrators to tweak /proc/sys/vm/swappiness on systems that run a "corner case" workload. >>>> No matter how efficient we make the scanning of one >>>> individual page, we simply cannot scan through 1TB >>>> worth of anonymous pages (which are all referenced >>>> because they've been there for a week) in order to >>>> deactivate something. >>> Sure. And we could avoid that sudden transition by balancing the LRU prior >>> to hitting the great pages_high wall. >> Yes, we will need to do some preactive balancing. > > OK.. > > And that huge anon-vma walk might need attention. At the least we could do > something to prevent lots of CPUs from piling up in there. Speaking of which, I have also seen a thousand processes waiting to grab the iprune_mutex in prune_icache. Maybe direct reclaim processes should not dive into this cache at all, but simply increase some variable indicating that kswapd might want to prune some extra pages from this cache on its next run? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 0:45 ` Rik van Riel @ 2007-06-29 1:12 ` Andrew Morton 2007-06-29 1:20 ` Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-06-29 1:12 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 20:45:20 -0400 Rik van Riel <riel@redhat.com> wrote: > >> The only problem with this is that anonymous > >> pages could be easily pushed out of memory by > >> the page cache, because the page cache has > >> totally different locality of reference. > > > > I don't immediately see why we need to change the fundamental aging design > > at all. The problems afacit are > > > > a) that huge burst of activity when we hit pages_high and > > > > b) the fact that this huge burst happens on lots of CPUs at the same time. > > > > And balancing the LRUs _prior_ to hitting pages_high can address both > > problems? > > That may work on systems with up to a few GB of memory, > but customers are already rolling out systems with 256GB > of RAM for general purpose use, that's 64 million pages! > > Even doing a background scan on that many pages will take > insane amounts of CPU time. > > In a few years, they will be deploying systems with 1TB > of memory and throwing random workloads at them. I don't see how the amount of memory changes anything here: if there are more pages, more work needs to be done regardless of when we do it. Still confused. > >>>> No matter how efficient we make the scanning of one > >>>> individual page, we simply cannot scan through 1TB > >>>> worth of anonymous pages (which are all referenced > >>>> because they've been there for a week) in order to > >>>> deactivate something. > >>> Sure. And we could avoid that sudden transition by balancing the LRU prior > >>> to hitting the great pages_high wall. > >> Yes, we will need to do some preactive balancing. > > > > OK.. > > > > And that huge anon-vma walk might need attention. At the least we could do > > something to prevent lots of CPUs from piling up in there. > > Speaking of which, I have also seen a thousand processes waiting > to grab the iprune_mutex in prune_icache. > It would make sense to only permit one cpu at a time to go in and do reclaimation against a particular zone (or even node). But the problem with the vfs caches is that they aren't node/zone-specific. We wouldn't want to get into the situation where 1023 CPUs are twiddling thumbs waiting for one CPU to free stuff up (or less extreme variants of this). > Maybe direct reclaim processes should not dive into this cache > at all, but simply increase some variable indicating that kswapd > might want to prune some extra pages from this cache on its next > run? Tell the node's kswapd to go off and do VFS reclaim while the CPUs on that node wait for it? That would help I guess, but those thousand processes would still need to block _somewhere_ waiting for the memory to come back. Of course, iprune_mutex is a particularly dumb place in which to do that, because the memory may get freed up from somewhere else. The general design here could/should be to back off to the top-level when there's contention (that's presently congestion_wait()) and to poll for memory-became-allocatable. So what we could do here is to back off when iprune_mutex is busy and, if nothing else works out, block in congestion_wait() (which is becoming increasingly misnamed). Then, add some more smarts to congestion_wait(): deliver a wakeup when "enough" memory got freed from the VFS caches. One suspects that at some stage, congestion_wait() will need to be told what the calling task is actually waiting for (perhaps a zonelist) so that the wakup delivery can become smarter. But for now, the question is: is this a reasonable overall design? Back off from contention points, block at the top-level, polling for allocatable memory to turn up? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 1:12 ` Andrew Morton @ 2007-06-29 1:20 ` Rik van Riel 2007-06-29 1:29 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-29 1:20 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm Andrew Morton wrote: > On Thu, 28 Jun 2007 20:45:20 -0400 > Rik van Riel <riel@redhat.com> wrote: > >>>> The only problem with this is that anonymous >>>> pages could be easily pushed out of memory by >>>> the page cache, because the page cache has >>>> totally different locality of reference. >>> I don't immediately see why we need to change the fundamental aging design >>> at all. The problems afacit are >>> >>> a) that huge burst of activity when we hit pages_high and >>> >>> b) the fact that this huge burst happens on lots of CPUs at the same time. >>> >>> And balancing the LRUs _prior_ to hitting pages_high can address both >>> problems? >> That may work on systems with up to a few GB of memory, >> but customers are already rolling out systems with 256GB >> of RAM for general purpose use, that's 64 million pages! >> >> Even doing a background scan on that many pages will take >> insane amounts of CPU time. >> >> In a few years, they will be deploying systems with 1TB >> of memory and throwing random workloads at them. > > I don't see how the amount of memory changes anything here: if there are > more pages, more work needs to be done regardless of when we do it. > > Still confused. If we deactivate some of the active pages regardless of whether or not they were recently referenced, you end up with "hey, I need to deactivate 1GB worth of pages", instead of with "I need to scan through 1TB worth of pages to find 1GB of not recently accessed ones". Note that is the exact same argument used against the used-once cleanups that have been proposed in the past: it is more work to scan through the whole list than to have pages end up in a "reclaimable" state by default. > But the problem with the vfs caches is that they aren't node/zone-specific. > We wouldn't want to get into the situation where 1023 CPUs are twiddling > thumbs waiting for one CPU to free stuff up (or less extreme variants of > this). The direct reclaimers can free something else. Chances are they don't care about the little bit of memory coming out of these caches. We just need to make sure the pressure gets evened out later. >> Maybe direct reclaim processes should not dive into this cache >> at all, but simply increase some variable indicating that kswapd >> might want to prune some extra pages from this cache on its next >> run? > > Tell the node's kswapd to go off and do VFS reclaim while the CPUs on that > node wait for it? That would help I guess, but those thousand processes > would still need to block _somewhere_ waiting for the memory to come back. Not for the VFS memory. They can just recycle some page cache memory or start IO on anonymous memory going into swap. > So what we could do here is to back off when iprune_mutex is busy and, if > nothing else works out, block in congestion_wait() (which is becoming > increasingly misnamed). Then, add some more smarts to congestion_wait(): > deliver a wakeup when "enough" memory got freed from the VFS caches. Yeah, that sounds doable. Not sure if they should wait in congestion_wait() though, or if they should just return to __alloc_pages() since they may already have reclaimed enough pages from the anonymous list. > But for now, the question is: is this a reasonable overall design? Back > off from contention points, block at the top-level, polling for allocatable > memory to turn up? I'm not convinced. If we have already reclaimed some pages from the inactive list, why wait in congestion_wait() AT ALL? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 1:20 ` Rik van Riel @ 2007-06-29 1:29 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2007-06-29 1:29 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm On Thu, 28 Jun 2007 21:20:40 -0400 Rik van Riel <riel@redhat.com> wrote: > > But for now, the question is: is this a reasonable overall design? Back > > off from contention points, block at the top-level, polling for allocatable > > memory to turn up? > > I'm not convinced. If we have already reclaimed some > pages from the inactive list, why wait in congestion_wait() > AT ALL? Well by top-level I meant top-level. The point where we either block or declare oom. We do that now in alloc_pages(), correctly I believe. The congestion_wait()s in vmscan.c might be misplaced (ie: too far down) because they could lead to us blocking when some memory actually got freed up (or became freeable?) somewhere else. To fix that we'd need to take a global look at things from within direct-reclaim, or back out of direct-reclaim back up to alloc_pages(), but remember where we were up to for the next pass. Perhaps by extending scan_control a bit and moving its instantiation up to __alloc_pages(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:13 ` Andrew Morton 2007-06-28 23:16 ` Rik van Riel @ 2007-06-28 23:25 ` Andrea Arcangeli 2007-06-29 0:12 ` Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-28 23:25 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, linux-mm On Thu, Jun 28, 2007 at 04:13:50PM -0700, Andrew Morton wrote: > On Thu, 28 Jun 2007 19:04:05 -0400 > Rik van Riel <riel@redhat.com> wrote: > > > > Sigh. We had a workload (forget which, still unfixed) in which things > > > would basically melt down in that linear anon_vma walk, walking 10,000 or > > > more vma's. I wonder if that's what's happening here? > > > > That would be a large multi-threaded application that fills up > > memory. Customers are reproducing this with JVMs on some very > > large systems. > > So.... does that mean "yes, it's scanning a lot of vmas"? > > If so, I expect there will still be failure modes, whatever we do outside > of this. A locked, linear walk of a list whose length is > application-controlled is going to be a problem. Could be that we'll need > an O(n) -> O(log(n)) conversion, which will be tricky in there. There's no swapping, so are we sure we need to scan the pte? This might be as well the unmapping code being invoked too early despite there's still clean cache to free. If I/O would start because swapping is really needed, the O(N) walk wouldn't hog the cpu so much because lots of time would be spent waiting for I/O too. Decreasing DEF_PRIORITY should defer the invocation of the unmapping code too. Conversion to O(log(N)) like for the filebacked mappings shouldn't be a big problem but it'll waste more static memory for each vma and anon_vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 23:25 ` Andrea Arcangeli @ 2007-06-29 0:12 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2007-06-29 0:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, linux-mm On Fri, 29 Jun 2007 01:25:36 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > On Thu, Jun 28, 2007 at 04:13:50PM -0700, Andrew Morton wrote: > > On Thu, 28 Jun 2007 19:04:05 -0400 > > Rik van Riel <riel@redhat.com> wrote: > > > > > > Sigh. We had a workload (forget which, still unfixed) in which things > > > > would basically melt down in that linear anon_vma walk, walking 10,000 or > > > > more vma's. I wonder if that's what's happening here? > > > > > > That would be a large multi-threaded application that fills up > > > memory. Customers are reproducing this with JVMs on some very > > > large systems. > > > > So.... does that mean "yes, it's scanning a lot of vmas"? > > > > If so, I expect there will still be failure modes, whatever we do outside > > of this. A locked, linear walk of a list whose length is > > application-controlled is going to be a problem. Could be that we'll need > > an O(n) -> O(log(n)) conversion, which will be tricky in there. > > There's no swapping, so are we sure we need to scan the pte? well, for better or for worse, that's the design. We need to run page_referenced() when considering whether to deactivate the page and that involves a scan of all the ptes. > This > might be as well the unmapping code being invoked too early despite > there's still clean cache to free. Might be so, but even if we ade changes there, failure modes will remain. > If I/O would start because swapping > is really needed, the O(N) walk wouldn't hog the cpu so much because > lots of time would be spent waiting for I/O too. yup. The *total* amount of CPu we spend in there shouldn't matter a lot: unless something else is bust, it'll be relatively low. I think the problem here is that a) we do it all in a big burst and b) we do it on lots of CPUs at the same time, so that burst is quite an inefficient one. We _could_ teach kswapd to keep the lists in balance in some fashion even when we're above pages_high. But I suspect that'll have corner-cases and probably it'd be better to do it synchronously. There's not much point in having multiple CPUs doing this so some per-zone trylock could perhaps be used. > Decreasing > DEF_PRIORITY should defer the invocation of the unmapping code too. > > Conversion to O(log(N)) like for the filebacked mappings shouldn't be > a big problem but it'll waste more static memory for each vma and > anon_vma. hm, OK, I haven't looked at what would be involved there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-28 22:44 ` Rik van Riel 2007-06-28 22:57 ` Andrew Morton @ 2007-06-29 13:38 ` Lee Schermerhorn 2007-06-29 14:12 ` Andrea Arcangeli 1 sibling, 1 reply; 77+ messages in thread From: Lee Schermerhorn @ 2007-06-29 13:38 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Nick Dokos On Thu, 2007-06-28 at 18:44 -0400, Rik van Riel wrote: > Andrew Morton wrote: > > > Where's the system time being spent? > > OK, it turns out that there is quite a bit of variability > in where the system spends its time. I did a number of > reaim runs and averaged the time the system spent in the > top functions. > > This is with the Fedora rawhide kernel config, which has > quite a few debugging options enabled. > > _raw_spin_lock 32.0% > page_check_address 12.7% > __delay 10.8% > mwait_idle 10.4% > anon_vma_unlink 5.7% > __anon_vma_link 5.3% > lockdep_reset_lock 3.5% > __kmalloc_node_track_caller 2.8% > security_port_sid 1.8% > kfree 1.6% > anon_vma_link 1.2% > page_referenced_one 1.1% > > In short, the system is waiting on the anon_vma lock. > > I wonder if Lee Schemmerhorn's patch to turn that > spinlock into an rwlock would help this workload, > or if we simply should scan fewer pages in the > pageout code. > Rik: Here's a fairly recent version of the patch if you want to try it on your workload. We've seen mixed results on somewhat larger systems, with and without your split LRU patch. I've started writing up those results. I'll try to get back to finishing up the writeup after OLS and vacation. Regards, Lee ----------- Patch against 2.6.22-rc4-mm2 Make the anon_vma list lock a read/write lock. Heaviest use of this lock is in the page_referenced()/try_to_unmap() calls from vmscan [shrink_page_list()]. These functions can use a read lock to allow some parallelism for different cpus trying to reclaim pages mapped via the same set of vmas. This change should not change the footprint of the anon_vma in the non-debug case. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> include/linux/rmap.h | 9 ++++++--- mm/migrate.c | 4 ++-- mm/mmap.c | 4 ++-- mm/rmap.c | 20 ++++++++++---------- 4 files changed, 20 insertions(+), 17 deletions(-) Index: Linux/include/linux/rmap.h =================================================================== --- Linux.orig/include/linux/rmap.h 2007-06-11 14:39:56.000000000 -0400 +++ Linux/include/linux/rmap.h 2007-06-20 09:49:24.000000000 -0400 @@ -24,7 +24,7 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock; /* Serialize access to vma list */ + rwlock_t rwlock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; @@ -42,18 +42,21 @@ static inline void anon_vma_free(struct kmem_cache_free(anon_vma_cachep, anon_vma); } +/* + * This needs to be a write lock for __vma_link() + */ static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_lock(&anon_vma->lock); + write_lock(&anon_vma->rwlock); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_unlock(&anon_vma->lock); + write_unlock(&anon_vma->rwlock); } /* Index: Linux/mm/rmap.c =================================================================== --- Linux.orig/mm/rmap.c 2007-06-11 14:40:06.000000000 -0400 +++ Linux/mm/rmap.c 2007-06-20 09:50:27.000000000 -0400 @@ -25,7 +25,7 @@ * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_lock - * anon_vma->lock + * anon_vma->rwlock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) @@ -85,7 +85,7 @@ int anon_vma_prepare(struct vm_area_stru if (anon_vma) { allocated = NULL; locked = anon_vma; - spin_lock(&locked->lock); + write_lock(&locked->rwlock); } else { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) @@ -104,7 +104,7 @@ int anon_vma_prepare(struct vm_area_stru spin_unlock(&mm->page_table_lock); if (locked) - spin_unlock(&locked->lock); + write_unlock(&locked->rwlock); if (unlikely(allocated)) anon_vma_free(allocated); } @@ -132,10 +132,10 @@ void anon_vma_link(struct vm_area_struct struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + write_lock(&anon_vma->rwlock); list_add_tail(&vma->anon_vma_node, &anon_vma->head); validate_anon_vma(vma); - spin_unlock(&anon_vma->lock); + write_unlock(&anon_vma->rwlock); } } @@ -147,13 +147,13 @@ void anon_vma_unlink(struct vm_area_stru if (!anon_vma) return; - spin_lock(&anon_vma->lock); + write_lock(&anon_vma->rwlock); validate_anon_vma(vma); list_del(&vma->anon_vma_node); /* We must garbage collect the anon_vma if it's empty */ empty = list_empty(&anon_vma->head); - spin_unlock(&anon_vma->lock); + write_unlock(&anon_vma->rwlock); if (empty) anon_vma_free(anon_vma); @@ -164,7 +164,7 @@ static void anon_vma_ctor(void *data, st { struct anon_vma *anon_vma = data; - spin_lock_init(&anon_vma->lock); + rwlock_init(&anon_vma->rwlock); INIT_LIST_HEAD(&anon_vma->head); } @@ -191,7 +191,7 @@ static struct anon_vma *page_lock_anon_v goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); + read_lock(&anon_vma->rwlock); return anon_vma; out: rcu_read_unlock(); @@ -200,7 +200,7 @@ out: static void page_unlock_anon_vma(struct anon_vma *anon_vma) { - spin_unlock(&anon_vma->lock); + read_unlock(&anon_vma->rwlock); rcu_read_unlock(); } Index: Linux/mm/mmap.c =================================================================== --- Linux.orig/mm/mmap.c 2007-06-20 09:39:03.000000000 -0400 +++ Linux/mm/mmap.c 2007-06-20 09:49:24.000000000 -0400 @@ -571,7 +571,7 @@ again: remove_next = 1 + (end > next-> if (vma->anon_vma) anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + write_lock(&anon_vma->rwlock); /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the @@ -625,7 +625,7 @@ again: remove_next = 1 + (end > next-> } if (anon_vma) - spin_unlock(&anon_vma->lock); + write_unlock(&anon_vma->rwlock); if (mapping) spin_unlock(&mapping->i_mmap_lock); Index: Linux/mm/migrate.c =================================================================== --- Linux.orig/mm/migrate.c 2007-06-20 09:39:04.000000000 -0400 +++ Linux/mm/migrate.c 2007-06-20 09:49:24.000000000 -0400 @@ -228,12 +228,12 @@ static void remove_anon_migration_ptes(s * We hold the mmap_sem lock. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); + read_lock(&anon_vma->rwlock); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(&anon_vma->lock); + read_unlock(&anon_vma->rwlock); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 13:38 ` Lee Schermerhorn @ 2007-06-29 14:12 ` Andrea Arcangeli 2007-06-29 14:59 ` Rik van Riel ` (4 more replies) 0 siblings, 5 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-29 14:12 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos On Fri, Jun 29, 2007 at 09:38:29AM -0400, Lee Schermerhorn wrote: > On Thu, 2007-06-28 at 18:44 -0400, Rik van Riel wrote: > > Andrew Morton wrote: > > > > > Where's the system time being spent? > > > > OK, it turns out that there is quite a bit of variability > > in where the system spends its time. I did a number of > > reaim runs and averaged the time the system spent in the > > top functions. > > > > This is with the Fedora rawhide kernel config, which has > > quite a few debugging options enabled. > > > > _raw_spin_lock 32.0% > > page_check_address 12.7% > > __delay 10.8% > > mwait_idle 10.4% > > anon_vma_unlink 5.7% > > __anon_vma_link 5.3% > > lockdep_reset_lock 3.5% > > __kmalloc_node_track_caller 2.8% > > security_port_sid 1.8% > > kfree 1.6% > > anon_vma_link 1.2% > > page_referenced_one 1.1% BTW, hope the above numbers are measured before the trashing stage when the number of jobs per second is lower than 10. It'd be nice not to spend all that time in system time but after that point the system will shortly reach oom. It's more important to be fast and save cpu in "useful" conditions (like with <4000 tasks). > Here's a fairly recent version of the patch if you want to try it on > your workload. We've seen mixed results on somewhat larger systems, > with and without your split LRU patch. I've started writing up those > results. I'll try to get back to finishing up the writeup after OLS and > vacation. This looks a very good idea indeed. Overall the O(log(N)) change I doubt would help, being able to give an efficient answer to "give me only the vmas that maps this anon page" won't be helpful here since the answer will be the same as the current question "give me any vma that may be mapping this anon page". Only for the filebacked mappings it matters. Also I'm stunned this is being compared to a java workload, java is a threaded beast (unless you're capable of understanding async-io in which case it's still threaded but with tons less threads, but anyway you code it won't create any anonymous related overhead). What we deal with isn't really an issue with anon-vma but just with the fact the system is trying to unmap pages that are mapped in 4000-5000 pte, so no matter how you code it, there will be still 4000-5000 ptes to check for each page that we want to know if it's referenced and it will take system time, this is an hardware issue not a software one. And the other suspect thing is to do all that pte-mangling work without doing any I/O at all. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-29 14:12 ` Andrea Arcangeli @ 2007-06-29 14:59 ` Rik van Riel 2007-06-29 22:39 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn ` (3 subsequent siblings) 4 siblings, 0 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-29 14:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Lee Schermerhorn, Andrew Morton, linux-mm, Nick Dokos Andrea Arcangeli wrote: > BTW, hope the above numbers are measured before the trashing stage > when the number of jobs per second is lower than 10. It'd be nice not > to spend all that time in system time but after that point the system > will shortly reach oom. It's more important to be fast and save cpu in > "useful" conditions (like with <4000 tasks). If the numbers were measured only in the thrashing stage, mwait_idle would be the top CPU "user", not the scanning code. What I am trying to measure is more a question of system robustness than performance. We have seen a few cases where the system took 2 hours to recover to a useful state after running out of RAM, with enough free swap. Linux needs to deal better with memory filling up. It should start to swap instead of scanning pages for very long periods of time and not recovering for a while. >> Here's a fairly recent version of the patch if you want to try it on >> your workload. We've seen mixed results on somewhat larger systems, >> with and without your split LRU patch. I've started writing up those >> results. I'll try to get back to finishing up the writeup after OLS and >> vacation. > > This looks a very good idea indeed. I'm definately going to give Lee's patch a spin. > Also I'm stunned this is being compared to a java workload, java is a > threaded beast Interestingly enough, both a heavy Java workload and this AIM7 test block on the anon_vma lock contention. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] 2007-06-29 14:12 ` Andrea Arcangeli 2007-06-29 14:59 ` Rik van Riel @ 2007-06-29 22:39 ` Lee Schermerhorn 2007-06-29 22:42 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn ` (2 subsequent siblings) 4 siblings, 0 replies; 77+ messages in thread From: Lee Schermerhorn @ 2007-06-29 22:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos On Fri, 2007-06-29 at 16:12 +0200, Andrea Arcangeli wrote: > On Fri, Jun 29, 2007 at 09:38:29AM -0400, Lee Schermerhorn wrote: <snip> > > > Here's a fairly recent version of the patch if you want to try it on > > your workload. We've seen mixed results on somewhat larger systems, > > with and without your split LRU patch. I've started writing up those > > results. I'll try to get back to finishing up the writeup after OLS and > > vacation. > > This looks a very good idea indeed. > > Overall the O(log(N)) change I doubt would help, being able to give an > efficient answer to "give me only the vmas that maps this anon page" > won't be helpful here since the answer will be the same as the current > question "give me any vma that may be mapping this anon page". Only > for the filebacked mappings it matters. > > Also I'm stunned this is being compared to a java workload, java is a > threaded beast (unless you're capable of understanding async-io in > which case it's still threaded but with tons less threads, but anyway > you code it won't create any anonymous related overhead). What we deal > with isn't really an issue with anon-vma but just with the fact the > system is trying to unmap pages that are mapped in 4000-5000 pte, so > no matter how you code it, there will be still 4000-5000 ptes to check > for each page that we want to know if it's referenced and it will take > system time, this is an hardware issue not a software one. And the > other suspect thing is to do all that pte-mangling work without doing > any I/O at all. Andrea: Yes, the patch is not a panacea. At best, it allows different kswapd's to attempt to unmap different pages associated with the same VMA. But, as you say, you still have to unmap X000 ptes. On one of the smaller ia64 systems we've been testing, we hit this state in the 15000-20000 range of AIM jobs. This patch, along with Rik's split LRU patch allowed us to make forward progress at saturation, and we were actually swapping, instead of just spinning around in page_referenced() and try_to_unmap(). [Actually, I don't think we get past page_referenced() much w/o this patch--have to check.] I have experimented with another "noreclaim" infrastructure, based on some patches by Larry Woodman at Red Hat, to keep non-reclaimable pages off the active/inactive list. I envisioned this as a general infrastructure to handle this case--pages whose anon_vmas have excessively long vma lists, swap-backed pages for which no swap space is available and mlock()ed pages [a la Nick Piggin's patch]. I will include the patch overview here and send along the 2 infrastructure patches and one "client" patch--the excessively referenced anon_vma case. I'm not proposing that these be considered for inclusion. Just another take on this issue. The patches are against 2.6.21-rc6. I have been distracted by other issues lately, so they have languished, and even the overview is a bit out of date relative to on-going activity in this area. I did integrate this series with Rik's split LRU patch at one time, and it all "worked" for some definition thereof. One final note before the "noreclaim overview": I have seen similar behavior on the i_mmap_lock for file back pages running a [too] heavy Oracle/TPC-C workload--on a larger ia64 system with ~8TB of storage. System hung/unresponsive, spitting out "Soft lockup" messages. Stack traces showed cpus in spinlock contention called from page_referenced_file. So, it's not limited to anon pages. Lee ----------------- This series of patches introduces support for mananaging "non-reclaimable" pages off the LRU active and inactive list. In this rather long-winded overview, I attempt to provide the motivation for this work, describe how it relates to other recent patches that address different aspects of the "problem", and give an overview of the mechanism. I'll try not to repeat too much of this in the patch descriptions. We have seen instances of large linux servers [10s/100s of GB of memory => millions of pages] apparently hanging for extended periods [10s or minutes or more] while all processors attempt to reclaim memory. For various reasons many of the pages on the LRU lists become difficult or impossible to reclaim. The system spends a lot time trying to reclaim [unmap] the difficult pages and/or shuffling through the impossible ones. Some of the conditions that make pages difficult or impossible to reclaim: 1) page is anon or shmem, but no swap space available 2) page is mlocked into memory 3) page is anon with an excessive number of related vmas [on the anon_vma list]. More on this below. The basic noreclaim mechanism, described below, is based on a patch developed by Larry Woodman of Red Hat for RHEL4 [2.6.9+ based kernel] to address the first condition above--an x86_64 non-NUMA system with 64G-128G memory [16M-32M 4k pages] with very little swap space-- ~2GB. The majority of the memory on the system was consumed by large database shared memory areas. A file IO intensive operation, such as backup, causes remaining free memory to be consumed by the page cache, initiating reclaim. vmscan then spends a great deal of time shuffling non-swappable anon and shmem pages between the active to the inactive lists, only to find that it can't move them to the swap cache. The pages get reactivated and round and round it goes. Because pages cannot be easily reclaimed, eventually other processors need to allocate pages and enter direct reclaim, only to compete for the zone lru lock. The single [normal] zone on the non-numa platform exacerbates this problem, but it can also arise, per zone, on numa platforms. Larry's patch alleviates this problem by maintaining anon and shmem pages for which no swap space exists on a per zone noreclaim list. Once the pages have been parked there, vmscan deals only with page cache pages, and anon/shmem pages to which space space has already been assigned. Pages move from the noreclaim list back to the LRU when swap space becomes available. Upstream developers have been addressing some of these issues in other ways: Christoph Lameter posted a patch to keep anon pages off the LRU when SWAP support not configured into the kernel. With Christoph's patch, these pages are left out "in limbo"--not on any list. Because of this, Christoph's patch does not address the more common situation of kernels with SWAP configured in, but insufficient or no swap added. I think this is a more common situation because most distros will ship kernels with the SWAP support configured in--at least for "enterprise" use. Maintaining these pages on a noreclaim list, will make it possible to restore these pages to the [in]active lists when/if swap is added. Nick Piggin's patch to keep mlock'ed pages [condition 2 above] off the LRU list also lets the mlocked/non-reclaimable pages float, not on any list. While Nick's patch does allow these pages to become reclaimable when all memory locks are removed, there is another reason to keep pages on a separate list. We want to be able to migrate anon pages that have no swap space backing them, and those that are mlocked. Indeed, the migration infrastructure supports this. However, the LRU lists, via the zone lru locks, arbitrate between tasks attempting to migrate the same pages simultaneously. To migrate a page, we must isolate it from the LRU. If the page cannot be isolated, migration gives up and moves on to another page. Which ever task is successful in isolating the page proceeds with the migration. Keeping the nonreclaimable pages on a separate list, protected by the zone lru lock, would preserve this arbitration function. isolate_page_lru(), used by both migration and Nick's mlock patch, can be enhanced to find pages on the noreclaim list, as well as on the [in]active lists. What's the probability that tasks will race on migrating the same page? Fairly high if auto-migration ever makes it into the kernel, but non-zero in any case. Rik van Reil's patch to split the active and inactive lists can address the non-swappable page problem by throttling the scan of the anon LRU lists, that contain both anon and shmem pages. However, if the system supports any swap space at all, one still needs to scan the anon lists to free up memory consumed by pages already in the swap cache. On large memory systems, the anon lists can still be millions of pages long and contain a large per centage of non-swappable and mlocked pages. This series attempts to unify this work into a general mechanism for managing non-reclaimable pages. The basic objective is to make vmscan as productive as possible on very large memory systems, by eliminating non-productive page shuffling. Like Larry's patch, the noreclaim infrastructure maintains "non-reclaimable" pages on a separate per-zone list. This noreclaim list is, conceptually, another LRU list--a sibling of the active and inactive lists. A page on the noreclaim list will have the PG_lru and PG_noreclaim flags set. The PG_noreclaim flag is analogous to, and mutually exclusive with, the PG_active flag--it specifies which LRU list the page resides on. The noreclaim list supports a pagevec cache, like the active and inactive lists to reduce contention on the zone lru lock in vmscan and in the fault path. Pages on the noreclaim list are "hidden" from page reclaim scanning. Thus, reclaim will not spend time attempting to reclaim the pages, only to find that they can't be unmapped, have no swap space available, are locked into memory, ... However, vmscan may find pages on the [in]active lists that have become non-reclaimable since they were put on the list. It will move them to the noreclaim list at that time. This series of patches includes the basic noreclaim list support and one patch, as a proof of concept, to address the 3rd condition listed above: the excessively long anon_vma list of related vmas. This seemed to be the easiest of the 3 conditions to address, and I have a test case handy [AIM7--see below]. Additional patches to handle anon pages for which no swap exists and to layer Nick Piggin's patch to keep "mlock pages off the LRU" will be forthcoming, if feedback indicates that this approach is worth pursuing. Now, about those anon pages with really long "related vma" lists: We have only seen this in AIM7 benchmarks on largish servers. The situation occurs when a single task fork()s many [10s of] thousands of children, and the the system needs to reclaim memory. We've seen all processors on a system spinning on the anon_vma lock attempting to unmap pages mapped by these thousands of children--for 10s of minutes or until we give up and reboot. I discussed this issue at LCA'07 in a kernel miniconf presentation. Linus questioned whether this was a problem that really needs solving. After all, AIM7 is only a synthetic benchmark. Does any real application behave this way? After the presentation, someone came up to me and told me that Apache also fork()s for each incoming connection and can fork thousands of children. However, I have not witnessed this, nor do I know how long lived these children are. I have included another patch that makes the anon_vma lock a reader/write lock. This allows different cpus to attempt to reclaim, in parallel, different pages that point to the same anon_vma. However, this doesn't solve the problem of trying to unmap pages that are [potentially] mapped into thousands of vmas. The last patch in this series counts the number of related vmas on an anon_vma's list and, when it exceeds a tunable threshold, pages that reference that anon_vma are declared nonreclaimable. We detect these non-reclaimable pages either on fault [COW or new anon page in a vma with an excessively shared anon_vma] or when vmscan encounters such a page on the LRU list. The patch/series does not [yet] support moving such a page back to the [in]active lists when it's anon_vma sharing drops below the threshold. This usually occurs when a task exits or explicitly unmapps the area. Any COWed private pages will be freed at this time, but anon pages that are still shared will remain nonreclaimable even though the related vma count is below the no-reclaim limit. Again, I will address this if the overall approach is deemed worth pursuing. Additional considerations: If the noreclaim list contains mlocked pages, they can be directly deleted from the noreclaim list without scanning when the become unlocked. But, note that we can't use one of the lru link fields to contain the mlocked vma count in this case. If the noreclaim list contains anon/shmem pages for which no swap space exists, it will be necessary to scan the list when swap space becomes available, either because it has been freed from other pages, or because additional swap has been added. The latter case should not occur frequently enough to be a problem. We should be able to defer the scanning when swap space is freed from other pages until a sufficient number become available or system is under severe pressure. If the list contains pages that are merely difficult to reclaim because of the excessive anon_vma sharing, and if we want to make them reclaimable again when the anon_vma related vma count drops to an acceptable value, one would have to scan the list at some point. Again, this could be deferred until there are a sufficient number of such pages to make it worth while or until the system is under severe memory pressure. The above considerations suggest that one consider separate lists for non-reclaimable [no swap, mlocked] and difficult to reclaim. Or, maybe not... Interaction of noreclaim list and LRU lists: My current patch moves pages to the noreclaim list as soon as they are detected, either on the active or inactive list. I could change this such that non-reclaimable pages found on the active list go to the inactive list first, and take a ride there before being declared non-reclaimable. However, we still have the issue of where to place the pages when then come off the no reclaim list: back to the active list? the inactive list? head or tail thereof? My current mechanism, with the PG_active and PG_noreclaim flags being mutually exclusive, does not track activeness of pages on the noreclaim list. To do so would require additional scanning of the list, I think, sort of defeating the purpose of the list. But, maybe acceptable if we scan just to test/modify the active flags. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" 2007-06-29 14:12 ` Andrea Arcangeli 2007-06-29 14:59 ` Rik van Riel 2007-06-29 22:39 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn @ 2007-06-29 22:42 ` Lee Schermerhorn 2007-06-29 22:44 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn 2007-06-29 22:49 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn 4 siblings, 0 replies; 77+ messages in thread From: Lee Schermerhorn @ 2007-06-29 22:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos Patch against 2.6.21-rc5/6 Infrastructure to manage pages excluded from reclaim--i.e., hidden from vmscan. Based on a patch by Larry Woodman of Red Hat. Applies atop two patches from Nick Piggin's "mlock pages off the LRU" series: move-and-rework-isolate_lru_page and move-and-rename-install_arg_page Maintain "nonreclaimable" pages on a separate per-zone list, to "hide" them from vmscan. Although this patch series does not support it, the noreclaim list could be scanned at a lower rate--for example to attempt to reclaim the "difficult to reclaim" pages when pages are REALLY needed, such as when reserves are exhausted and a critical need arises. A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether or not a page is reclaimable. Subsequent patches will add the various !reclaimable tests. Reclaimable pages are placed on the appropriate LRU list; non-reclaimable pages on the new noreclaim list. Notes: 1. Not sure I need the 'vma' arg to page_reclaimable(). I did in an earlier incarnation. Don't seem to now 2. for now, use bit 20 in page flags. Could restrict to 64-bit systems only and use one of bits 21-30 [ia64 uses bit 31; other archs ???]. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> include/linux/mm_inline.h | 34 +++++++++++++++++++- include/linux/mmzone.h | 6 +++ include/linux/page-flags.h | 20 ++++++++++++ include/linux/pagevec.h | 5 +++ include/linux/swap.h | 11 ++++++ mm/Kconfig | 8 ++++ mm/mempolicy.c | 2 - mm/migrate.c | 8 ++++ mm/page_alloc.c | 6 +++ mm/swap.c | 73 +++++++++++++++++++++++++++++++++++++++---- mm/vmscan.c | 75 ++++++++++++++++++++++++++++++++++++++++++++- 11 files changed, 237 insertions(+), 11 deletions(-) Index: Linux/mm/Kconfig =================================================================== --- Linux.orig/mm/Kconfig 2007-03-26 12:39:02.000000000 -0400 +++ Linux/mm/Kconfig 2007-03-26 13:14:05.000000000 -0400 @@ -163,3 +163,11 @@ config ZONE_DMA_FLAG default "0" if !ZONE_DMA default "1" +config NORECLAIM + bool "Track non-reclaimable pages" + help + Supports tracking of non-reclaimable pages off the [in]active lists + to avoid excessive reclaim overhead on large memory systems. Pages + may be non-reclaimable because: they are locked into memory, they + are anonymous pages for which no swap space exists, or they are anon + pages that are expensive to unmap [long anon_vma "related vma" list.] Index: Linux/include/linux/page-flags.h =================================================================== --- Linux.orig/include/linux/page-flags.h 2007-03-26 12:39:01.000000000 -0400 +++ Linux/include/linux/page-flags.h 2007-03-26 13:15:08.000000000 -0400 @@ -91,6 +91,9 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_noreclaim 20 /* Page is "non-reclaimable" */ + + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ @@ -249,6 +252,23 @@ static inline void SetPageUptodate(struc #define PageSwapCache(page) 0 #endif +#ifdef CONFIG_NORECLAIM +#define PageNoreclaim(page) test_bit(PG_noreclaim, &(page)->flags) +#define SetPageNoreclaim(page) set_bit(PG_noreclaim, &(page)->flags) +#define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags) +#define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags) +//TODO: need test versions? +#define TestSetPageNoreclaim(page) \ + test_and_set_bit(PG_noreclaim, &(page)->flags) +#define TestClearPageNoreclaim(page) \ + test_and_clear_bit(PG_noreclaim, &(page)->flags) +#else +#define PageNoreclaim(page) 0 +#define SetPageNoreclaim(page) +#define ClearPageNoreclaim(page) +#define __ClearPageNoreclaim(page) +#endif + #define PageUncached(page) test_bit(PG_uncached, &(page)->flags) #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) Index: Linux/include/linux/mmzone.h =================================================================== --- Linux.orig/include/linux/mmzone.h 2007-03-26 12:39:01.000000000 -0400 +++ Linux/include/linux/mmzone.h 2007-03-26 13:23:10.000000000 -0400 @@ -51,6 +51,9 @@ enum zone_stat_item { NR_FREE_PAGES, NR_INACTIVE, NR_ACTIVE, +#ifdef CONFIG_NORECLAIM + NR_NORECLAIM, +#endif NR_ANON_PAGES, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ @@ -217,6 +220,9 @@ struct zone { spinlock_t lru_lock; struct list_head active_list; struct list_head inactive_list; +#ifdef CONFIG_NORECLAIM + struct list_head noreclaim_list; +#endif unsigned long nr_scan_active; unsigned long nr_scan_inactive; unsigned long pages_scanned; /* since last reclaim */ Index: Linux/mm/page_alloc.c =================================================================== --- Linux.orig/mm/page_alloc.c 2007-03-26 12:39:02.000000000 -0400 +++ Linux/mm/page_alloc.c 2007-03-26 13:17:49.000000000 -0400 @@ -198,6 +198,7 @@ static void bad_page(struct page *page) 1 << PG_private | 1 << PG_locked | 1 << PG_active | + 1 << PG_noreclaim | 1 << PG_dirty | 1 << PG_reclaim | 1 << PG_slab | @@ -433,6 +434,7 @@ static inline int free_pages_check(struc 1 << PG_private | 1 << PG_locked | 1 << PG_active | + 1 << PG_noreclaim | 1 << PG_reclaim | 1 << PG_slab | 1 << PG_swapcache | @@ -582,6 +584,7 @@ static int prep_new_page(struct page *pa 1 << PG_private | 1 << PG_locked | 1 << PG_active | + 1 << PG_noreclaim | 1 << PG_dirty | 1 << PG_reclaim | 1 << PG_slab | @@ -2673,6 +2676,9 @@ static void __meminit free_area_init_cor zone_pcp_init(zone); INIT_LIST_HEAD(&zone->active_list); INIT_LIST_HEAD(&zone->inactive_list); +#ifdef CONFIG_NORECLAIM + INIT_LIST_HEAD(&zone->noreclaim_list); +#endif zone->nr_scan_active = 0; zone->nr_scan_inactive = 0; zap_zone_vm_stats(zone); Index: Linux/include/linux/mm_inline.h =================================================================== --- Linux.orig/include/linux/mm_inline.h 2007-03-26 12:39:01.000000000 -0400 +++ Linux/include/linux/mm_inline.h 2007-03-26 13:24:10.000000000 -0400 @@ -26,11 +26,43 @@ del_page_from_inactive_list(struct zone __dec_zone_state(zone, NR_INACTIVE); } +#ifdef CONFIG_NORECLAIM +static inline void __dec_zone_noreclaim(struct zone *zone) +{ + __dec_zone_state(zone, NR_NORECLAIM); +} + +static inline void +add_page_to_noreclaim_list(struct zone *zone, struct page *page) +{ + list_add(&page->lru, &zone->noreclaim_list); + __inc_zone_state(zone, NR_NORECLAIM); +} + +static inline void +del_page_from_noreclaim_list(struct zone *zone, struct page *page) +{ + list_del(&page->lru); + __dec_zone_noreclaim(zone); +} +#else +static inline void __dec_zone_noreclaim(struct zone *zone) { } + +static inline void +add_page_to_noreclaim_list(struct zone *zone, struct page *page) { } + +static inline void +del_page_from_noreclaim_list(struct zone *zone, struct page *page) { } +#endif + static inline void del_page_from_lru(struct zone *zone, struct page *page) { list_del(&page->lru); - if (PageActive(page)) { + if (PageNoreclaim(page)) { + __ClearPageNoreclaim(page); + __dec_zone_noreclaim(zone); + } else if (PageActive(page)) { __ClearPageActive(page); __dec_zone_state(zone, NR_ACTIVE); } else { Index: Linux/include/linux/swap.h =================================================================== --- Linux.orig/include/linux/swap.h 2007-03-26 12:39:01.000000000 -0400 +++ Linux/include/linux/swap.h 2007-03-26 13:13:18.000000000 -0400 @@ -186,6 +186,11 @@ extern void lru_add_drain(void); extern int lru_add_drain_all(void); extern int rotate_reclaimable_page(struct page *page); extern void swap_setup(void); +#ifdef CONFIG_NORECLAIM +extern void FASTCALL(lru_cache_add_noreclaim(struct page *page)); +#else +static inline void lru_cache_add_noreclaim(struct page *page) { } +#endif /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zone **, gfp_t); @@ -207,6 +212,12 @@ static inline int zone_reclaim(struct zo } #endif +#ifdef CONFIG_NORECLAIM +extern int page_reclaimable(struct page *page, struct vm_area_struct *vma); +#else +#define page_reclaimable(P, V) 1 +#endif + extern int kswapd_run(int nid); #ifdef CONFIG_MMU Index: Linux/include/linux/pagevec.h =================================================================== --- Linux.orig/include/linux/pagevec.h 2007-02-04 13:44:54.000000000 -0500 +++ Linux/include/linux/pagevec.h 2007-03-26 13:13:18.000000000 -0400 @@ -25,6 +25,11 @@ void __pagevec_release_nonlru(struct pag void __pagevec_free(struct pagevec *pvec); void __pagevec_lru_add(struct pagevec *pvec); void __pagevec_lru_add_active(struct pagevec *pvec); +#ifdef CONFIG_NORECLAIM +void __pagevec_lru_add_noreclaim(struct pagevec *pvec); +#else +static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec) { } +#endif void pagevec_strip(struct pagevec *pvec); unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, pgoff_t start, unsigned nr_pages); Index: Linux/mm/swap.c =================================================================== --- Linux.orig/mm/swap.c 2007-02-04 13:44:54.000000000 -0500 +++ Linux/mm/swap.c 2007-03-26 13:13:18.000000000 -0400 @@ -117,14 +117,14 @@ int rotate_reclaimable_page(struct page return 1; if (PageDirty(page)) return 1; - if (PageActive(page)) + if (PageActive(page) | PageNoreclaim(page)) return 1; if (!PageLRU(page)) return 1; zone = page_zone(page); spin_lock_irqsave(&zone->lru_lock, flags); - if (PageLRU(page) && !PageActive(page)) { + if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) { list_move_tail(&page->lru, &zone->inactive_list); __count_vm_event(PGROTATED); } @@ -142,7 +142,7 @@ void fastcall activate_page(struct page struct zone *zone = page_zone(page); spin_lock_irq(&zone->lru_lock); - if (PageLRU(page) && !PageActive(page)) { + if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) { del_page_from_inactive_list(zone, page); SetPageActive(page); add_page_to_active_list(zone, page); @@ -160,7 +160,8 @@ void fastcall activate_page(struct page */ void fastcall mark_page_accessed(struct page *page) { - if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) { + if (!PageActive(page) && !PageNoreclaim(page) && + PageReferenced(page) && PageLRU(page)) { activate_page(page); ClearPageReferenced(page); } else if (!PageReferenced(page)) { @@ -197,6 +198,29 @@ void fastcall lru_cache_add_active(struc put_cpu_var(lru_add_active_pvecs); } +#ifdef CONFIG_NORECLAIM +static DEFINE_PER_CPU(struct pagevec, lru_add_noreclaim_pvecs) = { 0, }; + +void fastcall lru_cache_add_noreclaim(struct page *page) +{ + struct pagevec *pvec = &get_cpu_var(lru_add_noreclaim_pvecs); + + page_cache_get(page); + if (!pagevec_add(pvec, page)) + __pagevec_lru_add_noreclaim(pvec); + put_cpu_var(lru_add_noreclaim_pvecs); +} + +static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) +{ + *pvec = &per_cpu(lru_add_noreclaim_pvecs, cpu); + if (pagevec_count(*pvec)) + __pagevec_lru_add_noreclaim(*pvec); +} +#else +static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) { } +#endif + static void __lru_add_drain(int cpu) { struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu); @@ -207,6 +231,8 @@ static void __lru_add_drain(int cpu) pvec = &per_cpu(lru_add_active_pvecs, cpu); if (pagevec_count(pvec)) __pagevec_lru_add_active(pvec); + + __drain_noreclaim_pvec(&pvec, cpu); } void lru_add_drain(void) @@ -277,14 +303,18 @@ void release_pages(struct page **pages, if (PageLRU(page)) { struct zone *pagezone = page_zone(page); + int is_lru_page; + if (pagezone != zone) { if (zone) spin_unlock_irq(&zone->lru_lock); zone = pagezone; spin_lock_irq(&zone->lru_lock); } - VM_BUG_ON(!PageLRU(page)); - __ClearPageLRU(page); + is_lru_page = PageLRU(page); + VM_BUG_ON(!(is_lru_page)); + if (is_lru_page) + __ClearPageLRU(page); del_page_from_lru(zone, page); } @@ -392,7 +422,7 @@ void __pagevec_lru_add_active(struct pag } VM_BUG_ON(PageLRU(page)); SetPageLRU(page); - VM_BUG_ON(PageActive(page)); + VM_BUG_ON(PageActive(page) || PageNoreclaim(page)); SetPageActive(page); add_page_to_active_list(zone, page); } @@ -402,6 +432,35 @@ void __pagevec_lru_add_active(struct pag pagevec_reinit(pvec); } +#ifdef CONFIG_NORECLAIM +void __pagevec_lru_add_noreclaim(struct pagevec *pvec) +{ + int i; + struct zone *zone = NULL; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct zone *pagezone = page_zone(page); + + if (pagezone != zone) { + if (zone) + spin_unlock_irq(&zone->lru_lock); + zone = pagezone; + spin_lock_irq(&zone->lru_lock); + } + VM_BUG_ON(PageLRU(page)); + SetPageLRU(page); + VM_BUG_ON(PageActive(page) || PageNoreclaim(page)); + SetPageNoreclaim(page); + add_page_to_noreclaim_list(zone, page); + } + if (zone) + spin_unlock_irq(&zone->lru_lock); + release_pages(pvec->pages, pvec->nr, pvec->cold); + pagevec_reinit(pvec); +} +#endif + /* * Try to drop buffers from the pages in a pagevec */ Index: Linux/mm/migrate.c =================================================================== --- Linux.orig/mm/migrate.c 2007-03-26 13:11:51.000000000 -0400 +++ Linux/mm/migrate.c 2007-03-26 13:13:18.000000000 -0400 @@ -52,7 +52,10 @@ int migrate_prep(void) static inline void move_to_lru(struct page *page) { - if (PageActive(page)) { + if (PageNoreclaim(page)) { + ClearPageNoreclaim(page); + lru_cache_add_noreclaim(page); + } else if (PageActive(page)) { /* * lru_cache_add_active checks that * the PG_active bit is off. @@ -322,6 +325,9 @@ static void migrate_page_copy(struct pag SetPageUptodate(newpage); if (PageActive(page)) SetPageActive(newpage); + else + if (PageNoreclaim(page)) + SetPageNoreclaim(newpage); if (PageChecked(page)) SetPageChecked(newpage); if (PageMappedToDisk(page)) Index: Linux/mm/vmscan.c =================================================================== --- Linux.orig/mm/vmscan.c 2007-03-26 13:11:51.000000000 -0400 +++ Linux/mm/vmscan.c 2007-03-26 13:24:56.000000000 -0400 @@ -473,6 +473,11 @@ static unsigned long shrink_page_list(st sc->nr_scanned++; + if (!page_reclaimable(page, NULL)) { + SetPageNoreclaim(page); + goto keep_locked; + } + if (!sc->may_swap && page_mapped(page)) goto keep_locked; @@ -587,6 +592,7 @@ free_it: continue; activate_locked: + VM_BUG_ON(PageActive(page)); SetPageActive(page); pgactivate++; keep_locked: @@ -682,6 +688,8 @@ int isolate_lru_page(struct page *page) ClearPageLRU(page); if (PageActive(page)) del_page_from_active_list(zone, page); + else if (PageNoreclaim(page)) + del_page_from_noreclaim_list(zone, page); else del_page_from_inactive_list(zone, page); } @@ -742,8 +750,11 @@ static unsigned long shrink_inactive_lis VM_BUG_ON(PageLRU(page)); SetPageLRU(page); list_del(&page->lru); - if (PageActive(page)) + if (PageActive(page)) { add_page_to_active_list(zone, page); + VM_BUG_ON(PageNoreclaim(page)); + } else if (PageNoreclaim(page)) + add_page_to_noreclaim_list(zone, page); else add_page_to_inactive_list(zone, page); if (!pagevec_add(&pvec, page)) { @@ -806,6 +817,9 @@ static void shrink_active_list(unsigned LIST_HEAD(l_hold); /* The pages which were snipped off */ LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */ LIST_HEAD(l_active); /* Pages to go onto the active_list */ +#ifdef CONFIG_NORECLAIM + LIST_HEAD(l_noreclaim); /* Pages to go onto the noreclaim list */ +#endif struct page *page; struct pagevec pvec; int reclaim_mapped = 0; @@ -869,6 +883,14 @@ force_reclaim_mapped: cond_resched(); page = lru_to_page(&l_hold); list_del(&page->lru); + if (!page_reclaimable(page, NULL)) { + /* + * divert any non-reclaimable pages onto the + * noreclaim list + */ + list_add(&page->lru, &l_noreclaim); + continue; + } if (page_mapped(page)) { if (!reclaim_mapped || (total_swap_pages == 0 && PageAnon(page)) || @@ -931,6 +953,30 @@ force_reclaim_mapped: } __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); +#ifdef CONFIG_NORECLAIM + pgmoved = 0; + while (!list_empty(&l_noreclaim)) { + page = lru_to_page(&l_noreclaim); + prefetchw_prev_lru_page(page, &l_noreclaim, flags); + VM_BUG_ON(PageLRU(page)); + SetPageLRU(page); + VM_BUG_ON(!PageActive(page)); + ClearPageActive(page); + VM_BUG_ON(PageNoreclaim(page)); + SetPageNoreclaim(page); + list_move(&page->lru, &zone->noreclaim_list); + pgmoved++; + if (!pagevec_add(&pvec, page)) { + __mod_zone_page_state(zone, NR_NORECLAIM, pgmoved); + pgmoved = 0; + spin_unlock_irq(&zone->lru_lock); + __pagevec_release(&pvec); + spin_lock_irq(&zone->lru_lock); + } + } + __mod_zone_page_state(zone, NR_NORECLAIM, pgmoved); +#endif + __count_zone_vm_events(PGREFILL, zone, pgscanned); __count_vm_events(PGDEACTIVATE, pgdeactivate); spin_unlock_irq(&zone->lru_lock); @@ -1764,3 +1810,30 @@ int zone_reclaim(struct zone *zone, gfp_ return __zone_reclaim(zone, gfp_mask, order); } #endif + +#ifdef CONFIG_NORECLAIM +/* + * page_reclaimable(struct page *page, struct vm_area_struct *vma) + * Test whether page is reclaimable--i.e., should be placed on active/inactive + * lists vs noreclaim list. + * + * @page - page to test + * @vma - vm area in which page is/will be mapped. May be NULL. + * If !NULL, called from fault path. + * + * Reasons page might not be reclaimable: + * TODO - later patches + * + * TODO: specify locking assumptions + */ +int page_reclaimable(struct page *page, struct vm_area_struct *vma) +{ + int reclaimable = 1; + + VM_BUG_ON(PageNoreclaim(page)); + + /* TODO: test page [!]reclaimable conditions */ + + return reclaimable; +} +#endif Index: Linux/mm/mempolicy.c =================================================================== --- Linux.orig/mm/mempolicy.c 2007-03-26 13:11:51.000000000 -0400 +++ Linux/mm/mempolicy.c 2007-03-26 13:13:18.000000000 -0400 @@ -1790,7 +1790,7 @@ static void gather_stats(struct page *pa if (PageSwapCache(page)) md->swapcache++; - if (PageActive(page)) + if (PageActive(page) || PageNoreclaim(page)) md->active++; if (PageWriteback(page)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." 2007-06-29 14:12 ` Andrea Arcangeli ` (2 preceding siblings ...) 2007-06-29 22:42 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn @ 2007-06-29 22:44 ` Lee Schermerhorn 2007-06-29 22:49 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn 4 siblings, 0 replies; 77+ messages in thread From: Lee Schermerhorn @ 2007-06-29 22:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos Patch against 2.6.21-rc5 n/m in the noreclaim series Report non-reclaimable pages per zone and system wide. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> drivers/base/node.c | 6 ++++++ fs/proc/proc_misc.c | 6 ++++++ mm/page_alloc.c | 16 +++++++++++++++- mm/vmstat.c | 3 +++ 4 files changed, 30 insertions(+), 1 deletion(-) Index: Linux/mm/page_alloc.c =================================================================== --- Linux.orig/mm/page_alloc.c 2007-03-26 13:17:49.000000000 -0400 +++ Linux/mm/page_alloc.c 2007-03-26 13:44:51.000000000 -0400 @@ -1574,10 +1574,18 @@ void show_free_areas(void) } } - printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n" +//TODO: check/adjust line lengths + printk("Active:%lu inactive:%lu" +#ifdef CONFIG_NORECLAIM + " noreclaim:%lu" +#endif + " dirty:%lu writeback:%lu unstable:%lu\n" " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", global_page_state(NR_ACTIVE), global_page_state(NR_INACTIVE), +#ifdef CONFIG_NORECLAIM + global_page_state(NR_NORECLAIM), +#endif global_page_state(NR_FILE_DIRTY), global_page_state(NR_WRITEBACK), global_page_state(NR_UNSTABLE_NFS), @@ -1602,6 +1610,9 @@ void show_free_areas(void) " high:%lukB" " active:%lukB" " inactive:%lukB" +#ifdef CONFIG_NORECLAIM + " noreclaim:%lukB" +#endif " present:%lukB" " pages_scanned:%lu" " all_unreclaimable? %s" @@ -1613,6 +1624,9 @@ void show_free_areas(void) K(zone->pages_high), K(zone_page_state(zone, NR_ACTIVE)), K(zone_page_state(zone, NR_INACTIVE)), +#ifdef CONFIG_NORECLAIM + K(zone_page_state(zone, NR_NORECLAIM)), +#endif K(zone->present_pages), zone->pages_scanned, (zone->all_unreclaimable ? "yes" : "no") Index: Linux/mm/vmstat.c =================================================================== --- Linux.orig/mm/vmstat.c 2007-03-26 12:39:02.000000000 -0400 +++ Linux/mm/vmstat.c 2007-03-26 13:35:43.000000000 -0400 @@ -434,6 +434,9 @@ static const char * const vmstat_text[] "nr_free_pages", "nr_active", "nr_inactive", +#ifdef CONFIG_NORECLAIM + "nr_noreclaim", +#endif "nr_anon_pages", "nr_mapped", "nr_file_pages", Index: Linux/drivers/base/node.c =================================================================== --- Linux.orig/drivers/base/node.c 2007-03-26 12:38:59.000000000 -0400 +++ Linux/drivers/base/node.c 2007-03-26 13:37:35.000000000 -0400 @@ -49,6 +49,9 @@ static ssize_t node_read_meminfo(struct "Node %d MemUsed: %8lu kB\n" "Node %d Active: %8lu kB\n" "Node %d Inactive: %8lu kB\n" +#ifdef CONFIG_NORECLAIM + "Node %d Noreclaim: %8lu kB\n" +#endif #ifdef CONFIG_HIGHMEM "Node %d HighTotal: %8lu kB\n" "Node %d HighFree: %8lu kB\n" @@ -71,6 +74,9 @@ static ssize_t node_read_meminfo(struct nid, K(i.totalram - i.freeram), nid, node_page_state(nid, NR_ACTIVE), nid, node_page_state(nid, NR_INACTIVE), +#ifdef CONFIG_NORECLAIM + nid, node_page_state(nid, NR_NORECLAIM), +#endif #ifdef CONFIG_HIGHMEM nid, K(i.totalhigh), nid, K(i.freehigh), Index: Linux/fs/proc/proc_misc.c =================================================================== --- Linux.orig/fs/proc/proc_misc.c 2007-03-26 12:39:01.000000000 -0400 +++ Linux/fs/proc/proc_misc.c 2007-03-26 13:39:05.000000000 -0400 @@ -154,6 +154,9 @@ static int meminfo_read_proc(char *page, "SwapCached: %8lu kB\n" "Active: %8lu kB\n" "Inactive: %8lu kB\n" +#ifdef CONFIG_NORECLAIM + "Noreclaim: %8lu kB\n" +#endif #ifdef CONFIG_HIGHMEM "HighTotal: %8lu kB\n" "HighFree: %8lu kB\n" @@ -184,6 +187,9 @@ static int meminfo_read_proc(char *page, K(total_swapcache_pages), K(global_page_state(NR_ACTIVE)), K(global_page_state(NR_INACTIVE)), +#ifdef CONFIG_NORECLAIM + K(global_page_state(NR_NORECLAIM)), +#endif #ifdef CONFIG_HIGHMEM K(i.totalhigh), K(i.freehigh), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" 2007-06-29 14:12 ` Andrea Arcangeli ` (3 preceding siblings ...) 2007-06-29 22:44 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn @ 2007-06-29 22:49 ` Lee Schermerhorn 4 siblings, 0 replies; 77+ messages in thread From: Lee Schermerhorn @ 2007-06-29 22:49 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos Here's the last one for now. I have a couple more in this series that handle the swap-backed w/ no swap space avail, but that's a different topic, right? ---- Patch m/n against 2.6.21-rc5 - track anon_vma "related vmas" == list length When a single parent forks a large number [thousands, 10s of thousands] of children, the anon_vma list of related vmas becomes very long. In reclaim, this list must be traversed twice--once in page_referenced_anon() and once in try_to_unmap_anon()--under a spin lock to reclaim the page. Multiple cpus can end up spinning behind the same anon_vma spinlock and traversing the lists. This patch, part of the "noreclaim" series, treats anon pages with list lengths longer than a tunable threshold as non- reclaimable. 1) add mm Kconfig option NORECLAIM_ANON_VMA, dependent on NORECLAIM. 32-bit systems may not want/need this features. 2) add a counter of related vmas to the anon_vma structure. This won't increase the size of the structure on 64-bit systems, as it will fit in a padding slot. 3) In [__]anon_vma_[un]link(), track number of related vmas. The count is only incremented/decremented while the anon_vma lock is held, so regular, non-atomic, increment/decrement is used. 4) in page_reclaimable(), check anon_vma count in vma's anon_vma, if vma supplied, or in page's anon_vma. In fault path, new anon pages are placed on the LRU before adding the anon rmap, so we need to check the vma's anon_vma. Fortunately, the vma is available at that point. In vmscan, we can just check the page's anon_vma for any anon pages that made it onto the [in]active list before the anon_vma list length became "excessive". 5) make the threshold tunable via /proc/sys/vm/anon_vma_reclaim_limit. Default value of 64 is totally arbitrary, but should be high enough that most applications won't hit it. 6) In the fault paths that install new anonymous pages, check whether the page is reclaimable or not [#4 above]. If it is, just add it to the active lru list [via the pagevec cache], else add it to the noreclaim list. Notes: 1) a separate patch makes the anon_vma lock a reader/writer lock. This allows some parallelism--different cpus can work on different pages that reference the same anon_vma--but this does not address the problem of long lists and potentially many pte's to unmap. 2) I moved the call to page_add_new_anon_rmap() to before the test for page_reclaimable() and thus before the calls to lru_cache_add_{active|noreclaim}(), so that page_reclaimable() could recognize the page as anon, thus obviating, I think, the vma arg to page_reclaimable(). TBD I think this reordering is OK, but the previous order may have existed to close some obscure race? Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> include/linux/rmap.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/swap.h | 3 ++ include/linux/sysctl.h | 1 kernel/sysctl.c | 12 ++++++++++ mm/Kconfig | 11 +++++++++ mm/memory.c | 20 +++++++++++++---- mm/rmap.c | 9 ++++++- mm/vmscan.c | 22 +++++++++++++++++- 8 files changed, 127 insertions(+), 8 deletions(-) Index: Linux/mm/Kconfig =================================================================== --- Linux.orig/mm/Kconfig 2007-03-28 16:33:18.000000000 -0400 +++ Linux/mm/Kconfig 2007-03-28 16:34:00.000000000 -0400 @@ -171,3 +171,14 @@ config NORECLAIM may be non-reclaimable because: they are locked into memory, they are anonymous pages for which no swap space exists, or they are anon pages that are expensive to unmap [long anon_vma "related vma" list.] + +config NORECLAIM_ANON_VMA + bool "Exclude pages with excessively long anon_vma lists" + depends on NORECLAIM + help + Treats anonymous pages with excessively long anon_vma lists as + non-reclaimable. Long anon_vma lists results from fork()ing + many [hundreds, thousands] of children from a single parent. The + anonymous pages in such tasks are very expensive [sometimes almost + impossible] to reclaim. Treating them as non-reclaimable avoids + the overhead of attempting to reclaim them. Index: Linux/include/linux/rmap.h =================================================================== --- Linux.orig/include/linux/rmap.h 2007-03-28 16:33:18.000000000 -0400 +++ Linux/include/linux/rmap.h 2007-03-28 16:33:29.000000000 -0400 @@ -10,6 +10,18 @@ #include <linux/spinlock.h> /* + * Optionally, limit the growth of the anon_vma list of "related" vmas + * to ANON_VMA_LIST_LIMIT. Add a count member + * to the anon_vma structure where we'd have padding on a 64-bit + * system w/o lock debugging. + */ +#ifdef CONFIG_NORECLAIM_ANON_VMA +#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 64 +#else +#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 0 +#endif + +/* * The anon_vma heads a list of private "related" vmas, to scan if * an anonymous page pointing to this anon_vma needs to be unmapped: * the vmas on the list will be related by forking, or by splitting. @@ -25,6 +37,9 @@ */ struct anon_vma { rwlock_t rwlock; /* Serialize access to vma list */ +#if CONFIG_NORECLAIM_ANON_VMA + int count; /* number of "related" vmas */ +#endif struct list_head head; /* List of private "related" vmas */ }; @@ -34,11 +49,18 @@ extern struct kmem_cache *anon_vma_cache static inline struct anon_vma *anon_vma_alloc(void) { - return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); + struct anon_vma *anon_vma; + + anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); + if (DEFAULT_ANON_VMA_RECLAIM_LIMIT && anon_vma) + anon_vma->count = 0; + return anon_vma; } static inline void anon_vma_free(struct anon_vma *anon_vma) { + if (DEFAULT_ANON_VMA_RECLAIM_LIMIT) + VM_BUG_ON(anon_vma->count); kmem_cache_free(anon_vma_cachep, anon_vma); } @@ -59,6 +81,39 @@ static inline void anon_vma_unlock(struc write_unlock(&anon_vma->rwlock); } +#if CONFIG_NORECLAIM_ANON_VMA + +/* + * Track number of "related" vmas on anon_vma list. + * Only called with anon_vma lock held. + * Note: we track related vmas on fork() and splits, but + * only enforce the limit on fork(). + */ +static inline void add_related_vma(struct anon_vma *anon_vma) +{ + ++anon_vma->count; +} + +static inline void remove_related_vma(struct anon_vma *anon_vma) +{ + --anon_vma->count; + VM_BUG_ON(anon_vma->count < 0); +} + +static inline struct anon_vma *page_anon_vma(struct page *page) +{ + VM_BUG_ON(!PageAnon(page)); + return (struct anon_vma *)((unsigned long)page->mapping & + ~PAGE_MAPPING_ANON); +} + +#else + +#define add_related_vma(A) +#define remove_related_vma(A) + +#endif + /* * anon_vma helper functions. */ Index: Linux/mm/rmap.c =================================================================== --- Linux.orig/mm/rmap.c 2007-03-28 16:33:18.000000000 -0400 +++ Linux/mm/rmap.c 2007-03-28 16:33:29.000000000 -0400 @@ -99,6 +99,7 @@ int anon_vma_prepare(struct vm_area_stru if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); + add_related_vma(anon_vma); allocated = NULL; } spin_unlock(&mm->page_table_lock); @@ -113,8 +114,11 @@ int anon_vma_prepare(struct vm_area_stru void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next) { - BUG_ON(vma->anon_vma != next->anon_vma); + struct anon_vma *anon_vma = vma->anon_vma; + + BUG_ON(anon_vma != next->anon_vma); list_del(&next->anon_vma_node); + remove_related_vma(anon_vma); } void __anon_vma_link(struct vm_area_struct *vma) @@ -123,6 +127,7 @@ void __anon_vma_link(struct vm_area_stru if (anon_vma) { list_add_tail(&vma->anon_vma_node, &anon_vma->head); + add_related_vma(anon_vma); validate_anon_vma(vma); } } @@ -134,6 +139,7 @@ void anon_vma_link(struct vm_area_struct if (anon_vma) { write_lock(&anon_vma->rwlock); list_add_tail(&vma->anon_vma_node, &anon_vma->head); + add_related_vma(anon_vma); validate_anon_vma(vma); write_unlock(&anon_vma->rwlock); } @@ -150,6 +156,7 @@ void anon_vma_unlink(struct vm_area_stru write_lock(&anon_vma->rwlock); validate_anon_vma(vma); list_del(&vma->anon_vma_node); + remove_related_vma(anon_vma); /* We must garbage collect the anon_vma if it's empty */ empty = list_empty(&anon_vma->head); Index: Linux/include/linux/swap.h =================================================================== --- Linux.orig/include/linux/swap.h 2007-03-28 16:33:18.000000000 -0400 +++ Linux/include/linux/swap.h 2007-03-28 16:33:29.000000000 -0400 @@ -214,6 +214,9 @@ static inline int zone_reclaim(struct zo #ifdef CONFIG_NORECLAIM extern int page_reclaimable(struct page *page, struct vm_area_struct *vma); +#ifdef CONFIG_NORECLAIM_ANON_VMA +extern int anon_vma_reclaim_limit; +#endif #else #define page_reclaimable(P, V) 1 #endif Index: Linux/mm/vmscan.c =================================================================== --- Linux.orig/mm/vmscan.c 2007-03-28 16:33:18.000000000 -0400 +++ Linux/mm/vmscan.c 2007-03-28 16:34:00.000000000 -0400 @@ -1812,6 +1812,10 @@ int zone_reclaim(struct zone *zone, gfp_ #endif #ifdef CONFIG_NORECLAIM + +#ifdef CONFIG_NORECLAIM_ANON_VMA +int anon_vma_reclaim_limit = DEFAULT_ANON_VMA_RECLAIM_LIMIT; +#endif /* * page_reclaimable(struct page *page, struct vm_area_struct *vma) * Test whether page is reclaimable--i.e., should be placed on active/inactive @@ -1822,7 +1826,8 @@ int zone_reclaim(struct zone *zone, gfp_ * If !NULL, called from fault path. * * Reasons page might not be reclaimable: - * TODO - later patches + * 1) anon_vma [if any] has too many related vmas + * [more TBD. e.g., anon page and no swap available, page mlocked, ...] * * TODO: specify locking assumptions */ @@ -1832,7 +1837,20 @@ int page_reclaimable(struct page *page, VM_BUG_ON(PageNoreclaim(page)); - /* TODO: test page [!]reclaimable conditions */ +#ifdef CONFIG_NORECLAIM_ANON_VMA + if (PageAnon(page)) { + struct anon_vma *anon_vma; + + /* + * anon page with too many related vmas? + */ + anon_vma = page_anon_vma(page); + VM_BUG_ON(!anon_vma); + if (anon_vma_reclaim_limit && + anon_vma->count > anon_vma_reclaim_limit) + reclaimable = 0; + } +#endif return reclaimable; } Index: Linux/include/linux/sysctl.h =================================================================== --- Linux.orig/include/linux/sysctl.h 2007-03-28 16:33:18.000000000 -0400 +++ Linux/include/linux/sysctl.h 2007-03-28 16:33:29.000000000 -0400 @@ -207,6 +207,7 @@ enum VM_PANIC_ON_OOM=33, /* panic at out-of-memory */ VM_VDSO_ENABLED=34, /* map VDSO into new processes? */ VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */ + VM_ANON_VMA_RECLAIM_LIMIT=36, /* max "related vmas" for reclaim */ /* s390 vm cmm sysctls */ VM_CMM_PAGES=1111, Index: Linux/kernel/sysctl.c =================================================================== --- Linux.orig/kernel/sysctl.c 2007-03-28 16:33:18.000000000 -0400 +++ Linux/kernel/sysctl.c 2007-03-28 16:33:29.000000000 -0400 @@ -859,6 +859,18 @@ static ctl_table vm_table[] = { .extra1 = &zero, }, #endif +#ifdef CONFIG_NORECLAIM_ANON_VMA + { + .ctl_name = VM_ANON_VMA_RECLAIM_LIMIT, + .procname = "anon_vma_reclaim_limit", + .data = &anon_vma_reclaim_limit, + .maxlen = sizeof(anon_vma_reclaim_limit), + .mode = 0644, + .proc_handler = &proc_dointvec, + .strategy = &sysctl_intvec, + .extra1 = &zero, + }, +#endif { .ctl_name = 0 } }; Index: Linux/mm/memory.c =================================================================== --- Linux.orig/mm/memory.c 2007-03-28 16:33:18.000000000 -0400 +++ Linux/mm/memory.c 2007-03-28 16:33:29.000000000 -0400 @@ -1650,8 +1650,11 @@ gotten: ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); - lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (page_reclaimable(new_page, vma)) + lru_cache_add_active(new_page); + else + lru_cache_add_noreclaim(new_page); /* Free the old page.. */ new_page = old_page; @@ -2149,8 +2152,11 @@ int install_new_anon_page(struct vm_area inc_mm_counter(mm, anon_rss); set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte( page, vma->vm_page_prot)))); - lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); + if (page_reclaimable(page, vma)) + lru_cache_add_active(page); + else + lru_cache_add_noreclaim(page); pte_unmap_unlock(pte, ptl); /* no need for flush_tlb */ @@ -2187,8 +2193,11 @@ static int do_anonymous_page(struct mm_s if (!pte_none(*page_table)) goto release; inc_mm_counter(mm, anon_rss); - lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); + if (page_reclaimable(page, vma)) + lru_cache_add_active(page); + else + lru_cache_add_noreclaim(page); } else { /* Map the ZERO_PAGE - vm_page_prot is readonly */ page = ZERO_PAGE(address); @@ -2334,8 +2343,11 @@ retry: set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); - lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (page_reclaimable(new_page, vma)) + lru_cache_add_active(new_page); + else + lru_cache_add_noreclaim(new_page); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 17:08 ` Rik van Riel 2007-06-26 17:55 ` Andrew Morton @ 2007-06-26 20:37 ` Andrea Arcangeli 2007-06-26 20:57 ` Rik van Riel 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-26 20:37 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, Andrew Morton On Tue, Jun 26, 2007 at 01:08:57PM -0400, Rik van Riel wrote: > Both the normal kernel and your kernel fall over once memory > pressure gets big enough, but they explode differently and > at different points. Ok, at some point it's normal they start trashing. What is strange is that it seems patch 01 requires the VM to do more work and in turn more memory to be free. The only explanation I could have is that the race has the side effect of in average reducing the amount of vm activity for each task instead of increasing it (this in turn reduces thrashing and free memory level requirements before the workload halts). Even if it may have a positive effect in practice, I still think the current racy behavior (randomly overstimating and randomly understimating the amount of work each task has to do depending of who adds and read the zone values first) isn't good. Perhaps if you change the DEF_PRIORITY you'll get closer to the current mainline but without any race. You can try to halve it and see what happens. If the initial passes fails, it'll start swapping and performance will go down quick. So perhaps once we fix the race we've to decrease DEF_PRIORITY to get the same vm-tune. It'd also be interesting to see what we get between 3000 and 4000. Where exactly we get to the halting point (4300 vs 5105) isn't crucial, otherwise one can win by simply decreasing min_free_kbytes as well, which clearly shows "when" we hang isn't the real interest. OTOH I agree the difference between 4300 and 5105 seems way too big but if this was between 5000 and 5105 I wouldn't worry too much (5000 instead of 5105 would result in more memory to be free at the oom point which isn't a net-negative). Hope the benchmark is repeatable. This week I've been working on another project but I'll shortly try to install AIM and reproduce and see what happens by decreasing DEF_PRIORITY. Thanks for the testing! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 20:37 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli @ 2007-06-26 20:57 ` Rik van Riel 2007-06-26 22:21 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-26 20:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton Andrea Arcangeli wrote: > On Tue, Jun 26, 2007 at 01:08:57PM -0400, Rik van Riel wrote: >> Both the normal kernel and your kernel fall over once memory >> pressure gets big enough, but they explode differently and >> at different points. > > Ok, at some point it's normal they start trashing. Yes, but I would hope that the system would be disk bound at that time instead of CPU bound. There was no swap IO going on yet, the system was just wasting CPU time in the VM. > Even if it may have a positive effect in practice, I still think the > current racy behavior (randomly overstimating and randomly > understimating the amount of work each task has to do depending of who > adds and read the zone values first) isn't good. Oh, I like your simplification of the code, too. I was running the test to see if that patch could be merged without any negative side effects, because I would have liked to see it. > Where exactly we get to the halting point (4300 vs 5105) isn't > crucial, However, neither of the two seems to be IO bound at that point... > Hope the benchmark is repeatable. This week > I've been working on another project but I'll shortly try to install > AIM and reproduce and see what happens by decreasing > DEF_PRIORITY. Thanks for the testing! Not only is the AIM7 test perfectly repeatable, it also causes the VM to show some of the same behaviour that customers are seeing in the field with large JVM workloads. -- All Rights Reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 01 of 16] remove nr_scan_inactive/active 2007-06-26 20:57 ` Rik van Riel @ 2007-06-26 22:21 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-26 22:21 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, Andrew Morton On Tue, Jun 26, 2007 at 04:57:20PM -0400, Rik van Riel wrote: > Yes, but I would hope that the system would be disk bound > at that time instead of CPU bound. > > There was no swap IO going on yet, the system was just > wasting CPU time in the VM. That seems a separate problem, 01 starts wasting cpu sooner and that's the regression you discovered, but mainline wastes cpu the same way too later on. We should do some profiling like Andrew suggested to see what's going on when it starts trashing cpu (perhaps it's some smp lock? you said you've only 4 cores so it must be some highly contended one if it's really a lock). > Oh, I like your simplification of the code, too. > > I was running the test to see if that patch could be > merged without any negative side effects, because I > would have liked to see it. I see. Good that you tested this with this workload so we noticed this regression. At the moment I hope it's only a tuning knob in the DEF_PRIORITY (or similar), it'd be really sad if this had a magic racy behavior that wouldn't be reproducible with a static non-racy algorithm. If nothing else, if we want to stick with this explicit smp race in the vm core, somebody should at least attempt to document how they can predict what the race will do at runtime, because to me it seems quite an unpredictable beast. On average it will probably reach a stable state, but this stable state will depend on the speed of the cpu caches and on the number of cpus, on the architecture and on the assembly generated by gcc, and then the race will trigger more or less or in a different way... > However, neither of the two seems to be IO bound > at that point... Yes. For now I'd be happy to see the same results for both to eliminate the regression. > Not only is the AIM7 test perfectly repeatable, it also > causes the VM to show some of the same behaviour that > customers are seeing in the field with large JVM workloads. Sounds good, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 02 of 16] avoid oom deadlock in nfs_create_request 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli 2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-10 17:38 ` Rik van Riel 2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli ` (14 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332960 -7200 # Node ID d64cb81222748354bf5b16258197217465f35aeb # Parent 8e38f7656968417dfee09fbb6450a8f1e70f8b21 avoid oom deadlock in nfs_create_request When sigkill is pending after the oom killer set TIF_MEMDIE, the task must go away or the VM will malfunction. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte struct nfs_server *server = NFS_SERVER(inode); struct nfs_page *req; - for (;;) { - /* try to allocate the request struct */ - req = nfs_page_alloc(); - if (req != NULL) - break; - - if (signalled() && (server->flags & NFS_MOUNT_INTR)) - return ERR_PTR(-ERESTARTSYS); - yield(); - } + /* try to allocate the request struct */ + req = nfs_page_alloc(); + if (unlikely(!req)) { + /* + * -ENOMEM will be returned only when TIF_MEMDIE is set + * so userland shouldn't risk to get confused by a new + * unhandled ENOMEM errno. + */ + WARN_ON(!test_thread_flag(TIF_MEMDIE)); + return ERR_PTR(-ENOMEM); + } + + if (signalled() && (server->flags & NFS_MOUNT_INTR)) + return ERR_PTR(-ERESTARTSYS); /* Initialize the request struct. Initially, we assume a * long write-back delay. This will be adjusted in -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 02 of 16] avoid oom deadlock in nfs_create_request 2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli @ 2007-06-10 17:38 ` Rik van Riel 2007-06-10 18:27 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-10 17:38 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm Andrea Arcangeli wrote: > When sigkill is pending after the oom killer set TIF_MEMDIE, the task > must go away or the VM will malfunction. However, if the sigkill is pending against ANOTHER task, this patch looks like it could introduce an IO error where the system would recover fine before. Tasks that do not have a pending SIGKILL should retry the allocation, shouldn't they? > diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c > --- a/fs/nfs/pagelist.c > +++ b/fs/nfs/pagelist.c > @@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte > struct nfs_server *server = NFS_SERVER(inode); > struct nfs_page *req; > > - for (;;) { > - /* try to allocate the request struct */ > - req = nfs_page_alloc(); > - if (req != NULL) > - break; > - > - if (signalled() && (server->flags & NFS_MOUNT_INTR)) > - return ERR_PTR(-ERESTARTSYS); > - yield(); > - } > + /* try to allocate the request struct */ > + req = nfs_page_alloc(); > + if (unlikely(!req)) { > + /* > + * -ENOMEM will be returned only when TIF_MEMDIE is set > + * so userland shouldn't risk to get confused by a new > + * unhandled ENOMEM errno. > + */ > + WARN_ON(!test_thread_flag(TIF_MEMDIE)); > + return ERR_PTR(-ENOMEM); > + } > + > + if (signalled() && (server->flags & NFS_MOUNT_INTR)) > + return ERR_PTR(-ERESTARTSYS); > > /* Initialize the request struct. Initially, we assume a > * long write-back delay. This will be adjusted in > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 02 of 16] avoid oom deadlock in nfs_create_request 2007-06-10 17:38 ` Rik van Riel @ 2007-06-10 18:27 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-10 18:27 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm On Sun, Jun 10, 2007 at 01:38:49PM -0400, Rik van Riel wrote: > Andrea Arcangeli wrote: > > >When sigkill is pending after the oom killer set TIF_MEMDIE, the task > >must go away or the VM will malfunction. > > However, if the sigkill is pending against ANOTHER task, > this patch looks like it could introduce an IO error > where the system would recover fine before. The error being returned would be -ENOMEM. But even that should not be returned because do_exit will run before userland runs again. When I told about this to Neil he didn't seem to object that do_exit will be called first so I hope we didn't get it wrong. The only risk would be if we set TIF_MEMDIE but we kill a task with SIGTERM, then the I/O error could reach userland if the user catched the sigterm signal in userland. I didn't add the warn-on for sigkill, because even if we decide to send sigterm first, in theory it wouldn't be a kernel issue if we correctly return -ENOMEM to userland if that is the task that must exit (we don't support a graceful exit path today, perhaps we never will). But clearly we don't know if all userland code is capable of coping with a -ENOMEM, so for now we don't have to worry thanks to the sigkill. > Tasks that do not have a pending SIGKILL should retry > the allocation, shouldn't they? All tasks not having TIF_MEMDIE set (and currently sigkill pending as well) should retry yes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 03 of 16] prevent oom deadlocks during read/write operations 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli 2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli ` (13 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332960 -7200 # Node ID 532a5f712848ee75d827bfe233b9364a709e1fc1 # Parent d64cb81222748354bf5b16258197217465f35aeb prevent oom deadlocks during read/write operations We need to react to SIGKILL during read/write with huge buffers or it becomes too easy to prevent a SIGKILLED task to run do_exit promptly after it has been selected for oom-killage. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -894,6 +894,13 @@ void do_generic_mapping_read(struct addr struct page *page; unsigned long nr, ret; + if (unlikely(sigismember(¤t->pending.signal, SIGKILL))) + /* + * Must not hang almost forever in D state in presence of sigkill + * and lots of ram/swap (think during OOM). + */ + break; + /* nr is the maximum number of bytes to copy from this page */ nr = PAGE_CACHE_SIZE; if (index >= end_index) { @@ -2105,6 +2112,13 @@ generic_file_buffered_write(struct kiocb unsigned long index; unsigned long offset; size_t copied; + + if (unlikely(sigismember(¤t->pending.signal, SIGKILL))) + /* + * Must not hang almost forever in D state in presence of sigkill + * and lots of ram/swap (think during OOM). + */ + break; offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ index = pos >> PAGE_CACHE_SHIFT; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 04 of 16] serialize oom killer 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (2 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-09 6:43 ` Peter Zijlstra 2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli ` (12 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332960 -7200 # Node ID baa866fedc79cb333b90004da2730715c145f1d5 # Parent 532a5f712848ee75d827bfe233b9364a709e1fc1 serialize oom killer It's risky and useless to run two oom killers in parallel, let serialize it to reduce the probability of spurious oom-killage. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -400,12 +400,15 @@ void out_of_memory(struct zonelist *zone unsigned long points = 0; unsigned long freed = 0; int constraint; + static DECLARE_MUTEX(OOM_lock); blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) /* Got some memory back in the last second. */ return; + if (down_trylock(&OOM_lock)) + return; if (printk_ratelimit()) { printk(KERN_WARNING "%s invoked oom-killer: " "gfp_mask=0x%x, order=%d, oomkilladj=%d\n", @@ -472,4 +475,6 @@ out: */ if (!test_thread_flag(TIF_MEMDIE)) schedule_timeout_uninterruptible(1); -} + + up(&OOM_lock); +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 04 of 16] serialize oom killer 2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli @ 2007-06-09 6:43 ` Peter Zijlstra 2007-06-09 15:27 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Peter Zijlstra @ 2007-06-09 6:43 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Fri, 2007-06-08 at 22:03 +0200, Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli <andrea@suse.de> > # Date 1181332960 -7200 > # Node ID baa866fedc79cb333b90004da2730715c145f1d5 > # Parent 532a5f712848ee75d827bfe233b9364a709e1fc1 > serialize oom killer > > It's risky and useless to run two oom killers in parallel, let serialize it to > reduce the probability of spurious oom-killage. > > Signed-off-by: Andrea Arcangeli <andrea@suse.de> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -400,12 +400,15 @@ void out_of_memory(struct zonelist *zone > unsigned long points = 0; > unsigned long freed = 0; > int constraint; > + static DECLARE_MUTEX(OOM_lock); I thought we depricated that construct in favour of DEFINE_MUTEX. Also, putting it in a function like so is a little icky IMHO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 04 of 16] serialize oom killer 2007-06-09 6:43 ` Peter Zijlstra @ 2007-06-09 15:27 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-09 15:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-mm On Sat, Jun 09, 2007 at 08:43:47AM +0200, Peter Zijlstra wrote: > On Fri, 2007-06-08 at 22:03 +0200, Andrea Arcangeli wrote: > > + static DECLARE_MUTEX(OOM_lock); > > I thought we depricated that construct in favour of DEFINE_MUTEX. Also, Ok, so it should be changed to DEFINE_MUTEX. I have to trust you on this because there's not a sign of warning in asm-i386/semaphore.h that DECLARE_MUTEX has been deprecated and tons of code is still using it in the current kernel. I couldn't imagine that somebody duplicated it somewhere else for whatever reason without removing DECLARE_MUTEX. It's not like we have to keep deprecated and redundant interfaces in the kernel for no good reason, especially if `sed` can fix it without human intervention. Let's say it's a low priority to rename it, if I've to generate a new diff, I'd probably prefer to generate one that drops DECLARE_MUTEX all over the other places too. > putting it in a function like so is a little icky IMHO. On this I disagree, the whole point of static/private variables is to decrease visibility where it's unnecessary. A static variable function-local is even less visible so it's a good thing and it helps self-documenting the code. So I very much like to keep it there, coding strict improves readability (you immediately know that no other code could ever try to acquire that lock). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 05 of 16] avoid selecting already killed tasks 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (3 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli ` (11 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332960 -7200 # Node ID 2ebc46595ead0f1790c6ec1d0302dd60ffbb1978 # Parent baa866fedc79cb333b90004da2730715c145f1d5 avoid selecting already killed tasks If the killed task doesn't go away because it's waiting on some other task who needs to allocate memory, to release the i_sem or some other lock, we must fallback to killing some other task in order to kill the original selected and already oomkilled task, but the logic that kills the childs first, would deadlock, if the already oom-killed task was actually the first child of the newly oom-killed task. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -366,6 +366,12 @@ static int oom_kill_process(struct task_ c = list_entry(tsk, struct task_struct, sibling); if (c->mm == p->mm) continue; + /* + * We cannot select tasks with TIF_MEMDIE already set + * or we'll hard deadlock. + */ + if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE))) + continue; if (!oom_kill_task(c)) return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 06 of 16] reduce the probability of an OOM livelock 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (4 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli ` (10 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332961 -7200 # Node ID fe82f6d082c859c641664990c6e14de8d16dcb5d # Parent 2ebc46595ead0f1790c6ec1d0302dd60ffbb1978 reduce the probability of an OOM livelock There's no need to loop way too many times over the lrus in order to declare defeat and decide to kill a task. The more loops we do the more likely there we'll run in a livelock with a page bouncing back and forth between tasks. The maximum number of entries to check in a loop that returns less than swap-cluster-max pages freed, should be the size of the list (or at most twice the size of the list if you want to be really paranoid about the PG_referenced bit). Our objective there is to know reliably when it's time that we kill a task, tring to free a few more pages at that already ciritical point is worthless. This seems to have the effect of reducing the "hang" time during oom killing. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1007,7 +1007,7 @@ unsigned long try_to_free_pages(struct z int priority; int ret = 0; unsigned long total_scanned = 0; - unsigned long nr_reclaimed = 0; + unsigned long nr_reclaimed; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; int i; @@ -1035,12 +1035,12 @@ unsigned long try_to_free_pages(struct z sc.nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, &sc); + nr_reclaimed = shrink_zones(priority, zones, &sc); + if (reclaim_state) + reclaim_state->reclaimed_slab = 0; shrink_slab(sc.nr_scanned, gfp_mask, lru_pages); - if (reclaim_state) { + if (reclaim_state) nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } total_scanned += sc.nr_scanned; if (nr_reclaimed >= sc.swap_cluster_max) { ret = 1; @@ -1131,7 +1131,6 @@ static unsigned long balance_pgdat(pg_da loop_again: total_scanned = 0; - nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); @@ -1186,6 +1185,7 @@ loop_again: * pages behind kswapd's direction of progress, which would * cause too much scanning of the lower zones. */ + nr_reclaimed = 0; for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; int nr_slab; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (5 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli ` (9 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332961 -7200 # Node ID aafcc5c9057f11d88c43b823c241f14a5ebdd638 # Parent fe82f6d082c859c641664990c6e14de8d16dcb5d balance_pgdat doesn't return the number of pages freed nr_reclaimed would be the number of pages freed in the last pass. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1092,8 +1092,6 @@ out: * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at pages_high. * - * Returns the number of pages which were actually freed. - * * There is special handling here for zones which are full of pinned pages. * This can happen if the pages are all mlocked, or if they are all used by * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. @@ -1109,7 +1107,7 @@ out: * the page allocator fallback scheme to ensure that aging of pages is balanced * across the zones. */ -static unsigned long balance_pgdat(pg_data_t *pgdat, int order) +static void balance_pgdat(pg_data_t *pgdat, int order) { int all_zones_ok; int priority; @@ -1259,8 +1257,6 @@ out: goto loop_again; } - - return nr_reclaimed; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (6 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli ` (8 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332961 -7200 # Node ID 60059913ab07906fceda14ffa72f2c77ef282fca # Parent aafcc5c9057f11d88c43b823c241f14a5ebdd638 don't depend on PF_EXITING tasks to go away A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in memory allocations without access to the PF_MEMALLOC pool (said that ideally do_exit would better not require memory allocations, especially not before calling exit_mm). The same way we raise its privilege to TIF_MEMDIE if it's the current task, we should do it even if it's not the current task to speedup oom killing. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -233,27 +233,13 @@ static struct task_struct *select_bad_pr * Note: this may have a chance of deadlock if it gets * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? + * + * Better not to skip PF_EXITING tasks, since they + * don't have access to the PF_MEMALLOC pool until + * we select them here first. */ if (test_tsk_thread_flag(p, TIF_MEMDIE)) return ERR_PTR(-1UL); - - /* - * This is in the process of releasing memory so wait for it - * to finish before killing some other task by mistake. - * - * However, if p is the current task, we allow the 'kill' to - * go ahead if it is exiting: this will simply set TIF_MEMDIE, - * which will allow it to gain access to memory reserves in - * the process of exiting and releasing its resources. - * Otherwise we could get an easy OOM deadlock. - */ - if (p->flags & PF_EXITING) { - if (p != current) - return ERR_PTR(-1UL); - - chosen = p; - *ppoints = ULONG_MAX; - } if (p->oomkilladj == OOM_DISABLE) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't go away 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (7 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 21:57 ` Christoph Lameter 2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli ` (7 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332961 -7200 # Node ID 4a70e6a4142230fa161dd37202cd62fede122880 # Parent 60059913ab07906fceda14ffa72f2c77ef282fca fallback killing more tasks if tif-memdie doesn't go away Waiting indefinitely for a TIF_MEMDIE task to go away will deadlock. Two tasks reading from the same inode at the same time and both going out of memory inside a read(largebuffer) syscall, will even deadlock through contention over the PG_locked bitflag. The task holding the semaphore detects oom but the oom killer decides to kill the task blocked in wait_on_page_locked(). The task holding the semaphore will hang inside alloc_pages that will never return because it will wait the TIF_MEMDIE task to go away, but the TIF_MEMDIE task can't go away until the task holding the semaphore is killed in the first place. It's quite unpractical to teach the oom killer the locking dependencies across running tasks, so the feasible fix is to develop a logic that after waiting a long time for a TIF_MEMDIE tasks goes away, fallbacks on killing one more task. This also eliminates the possibility of suprious oom killage (i.e. two tasks killed despite only one had to be killed). It's not a math guarantee because we can't demonstrate that if a TIF_MEMDIE SIGKILLED task didn't mange to complete do_exit within 10sec, it never will. But the current probability of suprious oom killing is sure much higher than the probability of suprious oom killing with this patch applied. The whole locking is around the tasklist_lock. On one side do_exit reads TIF_MEMDIE and clears VM_is_OOM under the lock, on the other side the oom killer accesses VM_is_OOM and TIF_MEMDIE under the lock. This is a read_lock in the oom killer but it's actually a write lock thanks to the OOM_lock semaphore running one oom killer at once (the locking rule is, either use write_lock_irq or read_lock+OOM_lock). Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/kernel/exit.c b/kernel/exit.c --- a/kernel/exit.c +++ b/kernel/exit.c @@ -845,6 +845,15 @@ static void exit_notify(struct task_stru unlikely(tsk->parent->signal->flags & SIGNAL_GROUP_EXIT))) state = EXIT_DEAD; tsk->exit_state = state; + + /* + * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside + * the tasklist_lock_lock. + */ + if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) { + extern unsigned long VM_is_OOM; + clear_bit(0, &VM_is_OOM); + } write_unlock_irq(&tasklist_lock); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -28,6 +28,9 @@ int sysctl_panic_on_oom; int sysctl_panic_on_oom; /* #define DEBUG */ +unsigned long VM_is_OOM; +static unsigned long last_tif_memdie_jiffies; + /** * badness - calculate a numeric value for how bad this task has been * @p: task struct of which task we should calculate @@ -225,21 +228,14 @@ static struct task_struct *select_bad_pr if (is_init(p)) continue; - /* - * This task already has access to memory reserves and is - * being killed. Don't allow any other task access to the - * memory reserve. - * - * Note: this may have a chance of deadlock if it gets - * blocked waiting for another task which itself is waiting - * for memory. Is there a better alternative? - * - * Better not to skip PF_EXITING tasks, since they - * don't have access to the PF_MEMALLOC pool until - * we select them here first. - */ - if (test_tsk_thread_flag(p, TIF_MEMDIE)) - return ERR_PTR(-1UL); + if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) { + /* + * Either we already waited long enough, + * or exit_mm already run, so we must + * try to kill another task. + */ + continue; + } if (p->oomkilladj == OOM_DISABLE) continue; @@ -276,13 +272,16 @@ static void __oom_kill_task(struct task_ if (verbose) printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm); + if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) { + last_tif_memdie_jiffies = jiffies; + set_bit(0, &VM_is_OOM); + } /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->time_slice = HZ; - set_tsk_thread_flag(p, TIF_MEMDIE); force_sig(SIGKILL, p); } @@ -419,6 +418,18 @@ void out_of_memory(struct zonelist *zone constraint = constrained_alloc(zonelist, gfp_mask); cpuset_lock(); read_lock(&tasklist_lock); + + /* + * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's + * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM + * is concerned. + */ + if (unlikely(test_bit(0, &VM_is_OOM))) { + if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ)) + goto out; + printk("detected probable OOM deadlock, so killing another task\n"); + last_tif_memdie_jiffies = jiffies; + } switch (constraint) { case CONSTRAINT_MEMORY_POLICY: @@ -440,10 +451,6 @@ retry: * issues we may have. */ p = select_bad_process(&points); - - if (PTR_ERR(p) == -1UL) - goto out; - /* Found nothing?!?! Either we hang forever, or we panic. */ if (!p) { read_unlock(&tasklist_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't go away 2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli @ 2007-06-08 21:57 ` Christoph Lameter 0 siblings, 0 replies; 77+ messages in thread From: Christoph Lameter @ 2007-06-08 21:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Fri, 8 Jun 2007, Andrea Arcangeli wrote: > @@ -276,13 +272,16 @@ static void __oom_kill_task(struct task_ > if (verbose) > printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm); > > + if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) { > + last_tif_memdie_jiffies = jiffies; > + set_bit(0, &VM_is_OOM); > + } > /* You cannot set VM_is_OM here since __oom_kill_task can be called for a process that has constrained allocations. With this patch a user can cause an OOM by restricting access to a single node using MPOL_BIND. Then VM_is_OOM will be set despite of lots of available memory elsewhere. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (8 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 21:48 ` Christoph Lameter 2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli ` (6 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID 24250f0be1aa26e5c6e33fd97b9eae125db9fbde # Parent 4a70e6a4142230fa161dd37202cd62fede122880 stop useless vm trashing while we wait the TIF_MEMDIE task to exit There's no point in trying to free memory if we're oom. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/include/linux/swap.h b/include/linux/swap.h --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -159,6 +159,8 @@ struct swap_list_t { #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages) /* linux/mm/oom_kill.c */ +extern unsigned long VM_is_OOM; +#define is_VM_OOM() unlikely(test_bit(0, &VM_is_OOM)) extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -923,6 +923,8 @@ static unsigned long shrink_zone(int pri nr_inactive = 0; while (nr_active || nr_inactive) { + if (is_VM_OOM()) + break; if (nr_active) { nr_to_scan = min(nr_active, (unsigned long)sc->swap_cluster_max); @@ -1032,6 +1034,17 @@ unsigned long try_to_free_pages(struct z } for (priority = DEF_PRIORITY; priority >= 0; priority--) { + if (is_VM_OOM()) { + if (!test_thread_flag(TIF_MEMDIE)) { + /* get out of the way */ + schedule_timeout_interruptible(1); + /* don't waste cpu if we're still oom */ + if (is_VM_OOM()) + goto out; + } else + goto out; + } + sc.nr_scanned = 0; if (!priority) disable_swap_token(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli @ 2007-06-08 21:48 ` Christoph Lameter 2007-06-09 1:59 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-08 21:48 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Fri, 8 Jun 2007, Andrea Arcangeli wrote: > There's no point in trying to free memory if we're oom. OOMs can occur because we are in a cpuset or have a memory policy that restricts the allocations. So I guess that OOMness is a per node property and not a global one. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-08 21:48 ` Christoph Lameter @ 2007-06-09 1:59 ` Andrea Arcangeli 2007-06-09 3:01 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-09 1:59 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Fri, Jun 08, 2007 at 02:48:15PM -0700, Christoph Lameter wrote: > On Fri, 8 Jun 2007, Andrea Arcangeli wrote: > > > There's no point in trying to free memory if we're oom. > > OOMs can occur because we are in a cpuset or have a memory policy that > restricts the allocations. So I guess that OOMness is a per node property > and not a global one. I'm sorry to inform you that the oom killing in current mainline has always been a global event not a per-node one, regardless of the fixes I just posted. if (test_tsk_thread_flag(p, TIF_MEMDIE)) return ERR_PTR(-1UL); [..] if (PTR_ERR(p) == -1UL) goto out; Best would be for you to send me more changes at the end of the patchbomb so that for that the first time _ever_, the oom will become a per-node event and not a global one anymore. Said that it's not entirely obvious to me, that it makes any sense to disrupt functionality instead of just running slower but safely (I would generally prefer printk a warning instead of killing a task if we've to override the restriction on the memory policy). But that's your call, I'm fine either ways... Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-09 1:59 ` Andrea Arcangeli @ 2007-06-09 3:01 ` Christoph Lameter 2007-06-09 14:05 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-09 3:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Sat, 9 Jun 2007, Andrea Arcangeli wrote: > I'm sorry to inform you that the oom killing in current mainline has > always been a global event not a per-node one, regardless of the fixes > I just posted. Wrong. The oom killling is a local event if we are in a constrained allocation. The allocating task is killed not a random task. That call to kill the allocating task should not set any global flags. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-09 3:01 ` Christoph Lameter @ 2007-06-09 14:05 ` Andrea Arcangeli 2007-06-09 14:38 ` Andrea Arcangeli 2007-06-11 16:04 ` Christoph Lameter 0 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-09 14:05 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Fri, Jun 08, 2007 at 08:01:58PM -0700, Christoph Lameter wrote: > On Sat, 9 Jun 2007, Andrea Arcangeli wrote: > > > I'm sorry to inform you that the oom killing in current mainline has > > always been a global event not a per-node one, regardless of the fixes > > I just posted. > > Wrong. The oom killling is a local event if we are in a constrained > allocation. The allocating task is killed not a random task. That call to > kill the allocating task should not set any global flags. I just showed the global flag that is being checked. TIF_MEMDIE affects the whole system, not just your node-constrained allocating task. If your local constrained task fails to exit because it's running in the nfs path that loops forever even if NULL is returned from alloc_pages, it will deadlock the whole system if later a regular oom happens (alloc_pages isn't guaranteed to be called by a page fault where we know do_exit will guaranteed to be called if a sigkill is pending). This is just an example. Amittedly my fixes made things worse for your "local" oom killing, but your code was only apparently "local" because TIF_MEMDIE is a _global_ flag in the mainline kernel. So again, I'm very willing to improve the local oom killing, so that it will really become a local event for the first time ever. Infact with my fixes applied the whole system will stop waiting for the TIF_MEMDIE flag to go away, so it'll be much easier to really make the global oom killing independent from the local one. I didn't look into the details of the local oom killing yet (exactly because it wasn't so local in the first place) but it may be enough to set VM_is_OOM only for tasks that are not being locally killed and then those new changes will automatically prevent TIF_MEMDIE being set on a local-oom to affect the global-oom event. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-09 14:05 ` Andrea Arcangeli @ 2007-06-09 14:38 ` Andrea Arcangeli 2007-06-11 16:07 ` Christoph Lameter 2007-06-11 16:04 ` Christoph Lameter 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-09 14:38 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On a side note about the current way you select the task to kill if a constrained alloc failure triggers, I think it would have been better if you simply extended the oom-selector by filtering tasks in function of the current->mems_allowed. Now I agree the current badness is quite bad, now with rss instead of the virtual space, it works a bit better at least, but the whole point is that if you integrate the cpuset task filtering in the oom-selector algorithm, then once we fix the badness algorithm to actually do something more meaningful than to check static values, you'll get the better algorithm working for your local-oom killing too. This if you really care about the huge-numa niche to get node-partitioning working really like if this was a virtualized environment. If you just have kill something to release memory, killing the current task is always the safest choice obviously, so as your customers are ok with it I'm certainly fine with the current approach too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-09 14:38 ` Andrea Arcangeli @ 2007-06-11 16:07 ` Christoph Lameter 2007-06-11 16:50 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 16:07 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Sat, 9 Jun 2007, Andrea Arcangeli wrote: > On a side note about the current way you select the task to kill if a > constrained alloc failure triggers, I think it would have been better > if you simply extended the oom-selector by filtering tasks in function > of the current->mems_allowed. Now I agree the current badness is quite Filtering tasks is a very expensive operation on huge systems. We have had cases where it took an hour or so for the OOM to complete. OOM usually occurs under heavy processing loads which makes the taking of global locks quite expensive. > bad, now with rss instead of the virtual space, it works a bit better > at least, but the whole point is that if you integrate the cpuset task > filtering in the oom-selector algorithm, then once we fix the badness > algorithm to actually do something more meaningful than to check > static values, you'll get the better algorithm working for your > local-oom killing too. This if you really care about the huge-numa > niche to get node-partitioning working really like if this was a > virtualized environment. If you just have kill something to release > memory, killing the current task is always the safest choice > obviously, so as your customers are ok with it I'm certainly fine with > the current approach too. The "kill-the-current-process" approach is most effective in hitting the process that is allocating the most. And as far as I can tell its easiest to understand for our customer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 16:07 ` Christoph Lameter @ 2007-06-11 16:50 ` Andrea Arcangeli 2007-06-11 16:57 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-11 16:50 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote: > Filtering tasks is a very expensive operation on huge systems. We have had Come on, oom_kill.c only happens at oom time, after the huge complex processing has figured out it's time to call into oom_kill.c, how can you care about the performance of oom_kill.c? Apparently some folks prefer to panic when oom triggers go figure... > cases where it took an hour or so for the OOM to complete. OOM usually > occurs under heavy processing loads which makes the taking of global locks > quite expensive. Since you mean that a _global_ OOM took one hour (you just used it as the comparison of the slow-one, the local-oom is supposed to be the fast one instead) I'd appreciate if you could try again with all my fixes applied and see if the time to recover the global oom is reduced (which is the whole objective of most of the fixes I've just posted). In general whatever you do inside oom_kill.c has nothing to do with the "expensive operations" (the expensive operations are infact halted with my fixes). In turn killing the current task so that oom_kill.c is faster, is quite a dubious argument. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 16:50 ` Andrea Arcangeli @ 2007-06-11 16:57 ` Christoph Lameter 2007-06-11 17:51 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 16:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote: > > Filtering tasks is a very expensive operation on huge systems. We have had > > Come on, oom_kill.c only happens at oom time, after the huge complex > processing has figured out it's time to call into oom_kill.c, how can > you care about the performance of oom_kill.c? Apparently some folks > prefer to panic when oom triggers go figure... Its pretty bad if a large system sits for hours just because it cannot finish its OOM processing. We have reports of that taking 4 hours! > In turn killing the current task so that oom_kill.c is faster, is > quite a dubious argument. It avoids repeated scans over a super sized tasklist with heavy lock contention. 4 loops for every OOM kill! If a number of processes will be OOM killed then it will take hours to sort out the lock contention. Want this as a a SUSE bug? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 16:57 ` Christoph Lameter @ 2007-06-11 17:51 ` Andrea Arcangeli 2007-06-11 17:56 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-11 17:51 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Mon, Jun 11, 2007 at 09:57:59AM -0700, Christoph Lameter wrote: > On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > > > On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote: > > > Filtering tasks is a very expensive operation on huge systems. We have had > > > > Come on, oom_kill.c only happens at oom time, after the huge complex > > processing has figured out it's time to call into oom_kill.c, how can > > you care about the performance of oom_kill.c? Apparently some folks > > prefer to panic when oom triggers go figure... > > Its pretty bad if a large system sits for hours just because it cannot > finish its OOM processing. We have reports of that taking 4 hours! Which is why I posted these fixes, so it will hopefully take much less than 4 hours. Even normal production systems takes far too long today. Most of these fixes are meant to reduce the complexity involved in detecting when the system is oom (starting from number 01). Keep in mind the whole 4 hours are spent _outside_ oom_kill.c. > It avoids repeated scans over a super sized tasklist with heavy lock > contention. 4 loops for every OOM kill! If a number of processes will be Once the tasklist_lock has been taken, what else is going to trash inside oom_kill.c? > OOM killed then it will take hours to sort out the lock contention. Did you measure it or this is just your imagination? I don't buy your hypothetical "several hours spent in oom_kill.c" numbers. How long does "ls /proc" takes? Can your run top at all? > Want this as a a SUSE bug? Feel free to file a SUSE bugreport so I hope you will back your claim with some real profiling data and so we can check if this can be fixed in software of it's the hardware to blame (in which case we need a CONFIG_SLOW_NUMA, since other hardware implementations may prefer to use the oom-selector during local-oom killing too and not only during the global ones). Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 17:51 ` Andrea Arcangeli @ 2007-06-11 17:56 ` Christoph Lameter 2007-06-11 18:22 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 17:56 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > Did you measure it or this is just your imagination? I don't buy your > hypothetical "several hours spent in oom_kill.c" numbers. How long > does "ls /proc" takes? Can your run top at all? These are customer reports. 4 hours one and another 2 hours. I can certainly get more reports if I ask them for more details. I will get this on your SUSE radar. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 17:56 ` Christoph Lameter @ 2007-06-11 18:22 ` Andrea Arcangeli 2007-06-11 18:39 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-11 18:22 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Mon, Jun 11, 2007 at 10:56:56AM -0700, Christoph Lameter wrote: > On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > > > Did you measure it or this is just your imagination? I don't buy your > > hypothetical "several hours spent in oom_kill.c" numbers. How long > > does "ls /proc" takes? Can your run top at all? > > These are customer reports. 4 hours one and another 2 hours. I can How long does "ls /proc" take? Can you run top at all on such a system (I mean before it reaches the oom point, then it'll hang for those 4 hours with the mainline kernel, I know this and that's why I worked to fix it and posted 18 patches so far about it). > certainly get more reports if I ask them for more details. I will get this > on your SUSE radar. If it takes 4 hours for the function out_of_memory to return, please report it. If instead as I start to suspect, you're going to show me the function out_of_memory called one million times and taking a few seconds for each invocation, please test all my fixes before reporting, there's a reason I made those changes... Back to the local-oom: if out_of_memory takes a couple of seconds at most as I expect (it'll be the same order of ls /proc, actually ls /proc will be a lot slower), killing the current task in the local-oom as a performance optimization remains a very dubious argument. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 18:22 ` Andrea Arcangeli @ 2007-06-11 18:39 ` Christoph Lameter 2007-06-11 18:58 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 18:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > > These are customer reports. 4 hours one and another 2 hours. I can > > How long does "ls /proc" take? Can you run top at all on such a > system (I mean before it reaches the oom point, then it'll hang for > those 4 hours with the mainline kernel, I know this and that's why I > worked to fix it and posted 18 patches so far about it). These are big systems and it would take some time to reproduce these issues. Thanks for your work. I'd really like to see improvements there. If you take care of not worsening the local kill path then I am okay with the rest. > > certainly get more reports if I ask them for more details. I will get this > > on your SUSE radar. > > If it takes 4 hours for the function out_of_memory to return, please > report it. If instead as I start to suspect, you're going to show me > the function out_of_memory called one million times and taking a few > seconds for each invocation, please test all my fixes before > reporting, there's a reason I made those changes... out_of_memory takes about 5-10 minutes each (according to one report). An OOM storm will then take the machine out for 4 hours. The on site SE can likely tell you more details in the bugzilla. Another reporter had been waiting for 2 hours after an oom without any messages indicating that a single OOM was processed. > Back to the local-oom: if out_of_memory takes a couple of seconds at > most as I expect (it'll be the same order of ls /proc, actually ls > /proc will be a lot slower), killing the current task in the local-oom > as a performance optimization remains a very dubious argument. Killing the local process avoids 4 slow scans over a pretty large tasklist. But I agree that there may be additionial other issues lurking there fore large systems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 18:39 ` Christoph Lameter @ 2007-06-11 18:58 ` Andrea Arcangeli 2007-06-11 19:25 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-11 18:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm On Mon, Jun 11, 2007 at 11:39:03AM -0700, Christoph Lameter wrote: > These are big systems and it would take some time to reproduce these Sure I understand. > issues. Thanks for your work. I'd really like to see improvements there. I appreciate and hope it already helps for your oom troubles too. > If you take care of not worsening the local kill path then I am okay with > the rest. The slight regression I introduced for the numa local oom path clearly needed correction. Let me know if you still see problems after the incremental patch I posted today of course. I think that should be enough to correct the local-oom without altering the global-oom. I tested it on non-numa and it still works fine. > out_of_memory takes about 5-10 minutes each (according to one report). An Even 10 minutes is way beyond what I expected (but with the background trashing of the mainline kernel, I can imagine it happening). > OOM storm will then take the machine out for 4 hours. The on site SE can > likely tell you more details in the bugzilla. Ok, then I think you really want to try my patchset for the oom storm since at least that one should be gone. When the first oom starts, the whole VM will stop, no other oom_kill will be called, and even if they're on their way to call a spurious out_of_memory, the semaphore trylock will put them back in S state immediately inside try_to_free_pages. Especially in systems like yours where trashing cachelines is practically forbidden, I suspect this could make a substantial difference and perhaps then out_of_memory will return in less than 10 minutes by the fact of practically running single threaded. > Another reporter had been waiting for 2 hours after an oom without any > messages indicating that a single OOM was processed. This is the case I'm dealing with more commonly, normally the more swap more more it takes, and that's expectable. It should have improved too with the patchset. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-11 18:58 ` Andrea Arcangeli @ 2007-06-11 19:25 ` Christoph Lameter 0 siblings, 0 replies; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 19:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Mon, 11 Jun 2007, Andrea Arcangeli wrote: > This is the case I'm dealing with more commonly, normally the more > swap more more it takes, and that's expectable. It should have > improved too with the patchset. Do you have a SLES10 kernel with these fixes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit 2007-06-09 14:05 ` Andrea Arcangeli 2007-06-09 14:38 ` Andrea Arcangeli @ 2007-06-11 16:04 ` Christoph Lameter 1 sibling, 0 replies; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 16:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Sat, 9 Jun 2007, Andrea Arcangeli wrote: > I just showed the global flag that is being checked. TIF_MEMDIE > affects the whole system, not just your node-constrained allocating TIF_MEMDIE affects the task that attempted to perform an constrained allocation. The effects are global for that task but there are not as severe as setting a global OOM flag! > Amittedly my fixes made things worse for your "local" oom killing, but > your code was only apparently "local" because TIF_MEMDIE is a _global_ > flag in the mainline kernel. So again, I'm very willing to improve the TIF_MEMDIE is confined to a process. > local one. I didn't look into the details of the local oom killing yet > (exactly because it wasn't so local in the first place) but it may be > enough to set VM_is_OOM only for tasks that are not being locally > killed and then those new changes will automatically prevent > TIF_MEMDIE being set on a local-oom to affect the global-oom event. TIF_MEMDIE must be set in order for the task to die properly even if its a constrained allocation because TIF_MEMDIE relaxes the constraints so that the task can terminate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (9 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli ` (5 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID c6dfb528f53eaac2188b49f67eed51c1a33ce7cd # Parent 24250f0be1aa26e5c6e33fd97b9eae125db9fbde the oom schedule timeout isn't needed with the VM_is_OOM logic VM_is_OOM whole point is to give a proper time to the TIF_MEMDIE task in order to exit. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -468,12 +468,5 @@ out: read_unlock(&tasklist_lock); cpuset_unlock(); - /* - * Give "p" a good chance of killing itself before we - * retry to allocate memory unless "p" is current - */ - if (!test_thread_flag(TIF_MEMDIE)) - schedule_timeout_uninterruptible(1); - up(&OOM_lock); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 12 of 16] show mem information only when a task is actually being killed 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (10 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli ` (4 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID db4c0ce6754d7838713eda1851aef43c2fb52fca # Parent c6dfb528f53eaac2188b49f67eed51c1a33ce7cd show mem information only when a task is actually being killed Don't show garbage while VM_is_OOM and the timeout didn't trigger. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -286,7 +286,7 @@ static void __oom_kill_task(struct task_ force_sig(SIGKILL, p); } -static int oom_kill_task(struct task_struct *p) +static int oom_kill_task(struct task_struct *p, gfp_t gfp_mask, int order) { struct mm_struct *mm; struct task_struct *g, *q; @@ -313,93 +313,6 @@ static int oom_kill_task(struct task_str return 1; } while_each_thread(g, q); - __oom_kill_task(p, 1); - - /* - * kill all processes that share the ->mm (i.e. all threads), - * but are in a different thread group. Don't let them have access - * to memory reserves though, otherwise we might deplete all memory. - */ - do_each_thread(g, q) { - if (q->mm == mm && q->tgid != p->tgid) - force_sig(SIGKILL, q); - } while_each_thread(g, q); - - return 0; -} - -static int oom_kill_process(struct task_struct *p, unsigned long points, - const char *message) -{ - struct task_struct *c; - struct list_head *tsk; - - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly - */ - if (p->flags & PF_EXITING) { - __oom_kill_task(p, 0); - return 0; - } - - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", - message, p->pid, p->comm, points); - - /* Try to kill a child first */ - list_for_each(tsk, &p->children) { - c = list_entry(tsk, struct task_struct, sibling); - if (c->mm == p->mm) - continue; - /* - * We cannot select tasks with TIF_MEMDIE already set - * or we'll hard deadlock. - */ - if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE))) - continue; - if (!oom_kill_task(c)) - return 0; - } - return oom_kill_task(p); -} - -static BLOCKING_NOTIFIER_HEAD(oom_notify_list); - -int register_oom_notifier(struct notifier_block *nb) -{ - return blocking_notifier_chain_register(&oom_notify_list, nb); -} -EXPORT_SYMBOL_GPL(register_oom_notifier); - -int unregister_oom_notifier(struct notifier_block *nb) -{ - return blocking_notifier_chain_unregister(&oom_notify_list, nb); -} -EXPORT_SYMBOL_GPL(unregister_oom_notifier); - -/** - * out_of_memory - kill the "best" process when we run out of memory - * - * If we run out of memory, we have the choice between either - * killing a random task (bad), letting the system crash (worse) - * OR try to be smart about which process to kill. Note that we - * don't have to be perfect here, we just have to be good. - */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) -{ - struct task_struct *p; - unsigned long points = 0; - unsigned long freed = 0; - int constraint; - static DECLARE_MUTEX(OOM_lock); - - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); - if (freed > 0) - /* Got some memory back in the last second. */ - return; - - if (down_trylock(&OOM_lock)) - return; if (printk_ratelimit()) { printk(KERN_WARNING "%s invoked oom-killer: " "gfp_mask=0x%x, order=%d, oomkilladj=%d\n", @@ -408,6 +321,94 @@ void out_of_memory(struct zonelist *zone show_mem(); } + __oom_kill_task(p, 1); + + /* + * kill all processes that share the ->mm (i.e. all threads), + * but are in a different thread group. Don't let them have access + * to memory reserves though, otherwise we might deplete all memory. + */ + do_each_thread(g, q) { + if (q->mm == mm && q->tgid != p->tgid) + force_sig(SIGKILL, q); + } while_each_thread(g, q); + + return 0; +} + +static int oom_kill_process(struct task_struct *p, unsigned long points, + const char *message, gfp_t gfp_mask, int order) +{ + struct task_struct *c; + struct list_head *tsk; + + /* + * If the task is already exiting, don't alarm the sysadmin or kill + * its children or threads, just set TIF_MEMDIE so it can die quickly + */ + if (p->flags & PF_EXITING) { + __oom_kill_task(p, 0); + return 0; + } + + printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", + message, p->pid, p->comm, points); + + /* Try to kill a child first */ + list_for_each(tsk, &p->children) { + c = list_entry(tsk, struct task_struct, sibling); + if (c->mm == p->mm) + continue; + /* + * We cannot select tasks with TIF_MEMDIE already set + * or we'll hard deadlock. + */ + if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE))) + continue; + if (!oom_kill_task(c, gfp_mask, order)) + return 0; + } + return oom_kill_task(p, gfp_mask, order); +} + +static BLOCKING_NOTIFIER_HEAD(oom_notify_list); + +int register_oom_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&oom_notify_list, nb); +} +EXPORT_SYMBOL_GPL(register_oom_notifier); + +int unregister_oom_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&oom_notify_list, nb); +} +EXPORT_SYMBOL_GPL(unregister_oom_notifier); + +/** + * out_of_memory - kill the "best" process when we run out of memory + * + * If we run out of memory, we have the choice between either + * killing a random task (bad), letting the system crash (worse) + * OR try to be smart about which process to kill. Note that we + * don't have to be perfect here, we just have to be good. + */ +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) +{ + struct task_struct *p; + unsigned long points = 0; + unsigned long freed = 0; + int constraint; + static DECLARE_MUTEX(OOM_lock); + + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); + if (freed > 0) + /* Got some memory back in the last second. */ + return; + + if (down_trylock(&OOM_lock)) + return; + if (sysctl_panic_on_oom == 2) panic("out of memory. Compulsory panic_on_oom is selected.\n"); @@ -434,12 +435,12 @@ void out_of_memory(struct zonelist *zone switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, points, - "No available memory (MPOL_BIND)"); + "No available memory (MPOL_BIND)", gfp_mask, order); break; case CONSTRAINT_CPUSET: oom_kill_process(current, points, - "No available memory in cpuset"); + "No available memory in cpuset", gfp_mask, order); break; case CONSTRAINT_NONE: @@ -458,7 +459,7 @@ retry: panic("Out of memory and no killable processes...\n"); } - if (oom_kill_process(p, points, "Out of memory")) + if (oom_kill_process(p, points, "Out of memory", gfp_mask, order)) goto retry; break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 13 of 16] simplify oom heuristics 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (11 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli ` (3 subsequent siblings) 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID dfac333eb29032dab87dd2c46f71a22037a6dc4a # Parent db4c0ce6754d7838713eda1851aef43c2fb52fca simplify oom heuristics Over time somebody had the good idea to remove the rcvd_sigterm points, this removes more of them. The selected task should be the one that if we don't kill, it will turn the system oom again sooner than later. These informations tell us nothing about which task is best to kill so they should be removed. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -52,7 +52,7 @@ static unsigned long last_tif_memdie_jif unsigned long badness(struct task_struct *p, unsigned long uptime) { - unsigned long points, cpu_time, run_time, s; + unsigned long points; struct mm_struct *mm; struct task_struct *child; @@ -93,26 +93,6 @@ unsigned long badness(struct task_struct points += child->mm->total_vm/2 + 1; task_unlock(child); } - - /* - * CPU time is in tens of seconds and run time is in thousands - * of seconds. There is no particular reason for this other than - * that it turned out to work very well in practice. - */ - cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime)) - >> (SHIFT_HZ + 3); - - if (uptime >= p->start_time.tv_sec) - run_time = (uptime - p->start_time.tv_sec) >> 10; - else - run_time = 0; - - s = int_sqrt(cpu_time); - if (s) - points /= s; - s = int_sqrt(int_sqrt(run_time)); - if (s) - points /= s; /* * Niced processes are most likely less important, so double -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 14 of 16] oom select should only take rss into account 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (12 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-10 17:17 ` Rik van Riel 2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli ` (2 subsequent siblings) 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID dbd70ffd95f34cd12f1fd2f05a9cc0f9a50edb4a # Parent dfac333eb29032dab87dd2c46f71a22037a6dc4a oom select should only take rss into account Running workloads where many tasks grow their virtual memory simultaneously, so they all have a relatively small virtual memory when oom triggers (if compared to innocent longstanding tasks), the oom killer then selects mysql/apache and other things with very large VM but very small RSS. RSS is the only thing that matters, killing a task with huge VM but zero RSS is not useful. Many apps tend to have large VM but small RSS in the first place (regardless of swapping activity) and they shouldn't be penalized like this. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -66,7 +66,7 @@ unsigned long badness(struct task_struct /* * The memory size of the process is the basis for the badness. */ - points = mm->total_vm; + points = get_mm_rss(mm); /* * After this unlock we can no longer dereference local variable `mm' @@ -90,7 +90,7 @@ unsigned long badness(struct task_struct list_for_each_entry(child, &p->children, sibling) { task_lock(child); if (child->mm != mm && child->mm) - points += child->mm->total_vm/2 + 1; + points += get_mm_rss(child->mm)/2 + 1; task_unlock(child); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 14 of 16] oom select should only take rss into account 2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli @ 2007-06-10 17:17 ` Rik van Riel 2007-06-10 17:30 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-10 17:17 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm Andrea Arcangeli wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -66,7 +66,7 @@ unsigned long badness(struct task_struct > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = get_mm_rss(mm); Makes sense. Originally it used total_vm so it could also select tasks that use up lots of swap, but I guess that in almost all the cases the preferred OOM task to kill is also using a lot of RAM. Acked-by: Rik van Riel <riel@redhat.com> -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 14 of 16] oom select should only take rss into account 2007-06-10 17:17 ` Rik van Riel @ 2007-06-10 17:30 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-10 17:30 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm On Sun, Jun 10, 2007 at 01:17:13PM -0400, Rik van Riel wrote: > Andrea Arcangeli wrote: > > >diff --git a/mm/oom_kill.c b/mm/oom_kill.c > >--- a/mm/oom_kill.c > >+++ b/mm/oom_kill.c > >@@ -66,7 +66,7 @@ unsigned long badness(struct task_struct > > /* > > * The memory size of the process is the basis for the badness. > > */ > >- points = mm->total_vm; > >+ points = get_mm_rss(mm); > > Makes sense. Originally it used total_vm so it could also > select tasks that use up lots of swap, but I guess that in > almost all the cases the preferred OOM task to kill is also > using a lot of RAM. Agreed. > Acked-by: Rik van Riel <riel@redhat.com> Thanks for the Ack. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (13 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-10 17:20 ` Rik van Riel 2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli 2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III 16 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID 31ef5d0bf924fb47da144321f692f4fefebf5cf5 # Parent dbd70ffd95f34cd12f1fd2f05a9cc0f9a50edb4a limit reclaim if enough pages have been freed No need to wipe out an huge chunk of the cache. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -938,6 +938,8 @@ static unsigned long shrink_zone(int pri nr_inactive -= nr_to_scan; nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, sc); + if (nr_reclaimed >= sc->swap_cluster_max) + break; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli @ 2007-06-10 17:20 ` Rik van Riel 2007-06-10 17:32 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-10 17:20 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Larry Woodman Andrea Arcangeli wrote: > No need to wipe out an huge chunk of the cache. I've seen recent upstream kernels free up to 75% of memory on my test system, when pushed hard enough. It is not hard to get hundreds of tasks into the pageout code simultaneously, all starting out at priority 12 and not freeing anything until they all get to much lower priorities. A workload that is dominated by anonymous memory will trigger this. All anonymous memory starts out on the active list and tasks will not even try to shrink the inactive list because nr_inactive >> priority is 0. This patch is a step in the right direction. However, I believe that your [PATCH 01 of 16] is a step in the wrong direction for these workloads... > Signed-off-by: Andrea Arcangeli <andrea@suse.de> Acked-by: Rik van Riel <riel@redhat.com> > diff --git a/mm/vmscan.c b/mm/vmscan.c > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -938,6 +938,8 @@ static unsigned long shrink_zone(int pri > nr_inactive -= nr_to_scan; > nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, > sc); > + if (nr_reclaimed >= sc->swap_cluster_max) > + break; > } > } -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-10 17:20 ` Rik van Riel @ 2007-06-10 17:32 ` Andrea Arcangeli 2007-06-10 17:52 ` Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-10 17:32 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, Larry Woodman On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote: > code simultaneously, all starting out at priority 12 and > not freeing anything until they all get to much lower > priorities. BTW, this reminds me that I've been wondering if 2**12 is a too small fraction of the lru to start the scan with. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-10 17:32 ` Andrea Arcangeli @ 2007-06-10 17:52 ` Rik van Riel 2007-06-11 16:23 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2007-06-10 17:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Larry Woodman Andrea Arcangeli wrote: > On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote: >> code simultaneously, all starting out at priority 12 and >> not freeing anything until they all get to much lower >> priorities. > > BTW, this reminds me that I've been wondering if 2**12 is a too small > fraction of the lru to start the scan with. If the system has 1 TB of RAM, it's probably too big of a fraction :) We need something smarter. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-10 17:52 ` Rik van Riel @ 2007-06-11 16:23 ` Christoph Lameter 2007-06-11 16:57 ` Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2007-06-11 16:23 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Larry Woodman On Sun, 10 Jun 2007, Rik van Riel wrote: > Andrea Arcangeli wrote: > > On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote: > > > code simultaneously, all starting out at priority 12 and > > > not freeing anything until they all get to much lower > > > priorities. > > > > BTW, this reminds me that I've been wondering if 2**12 is a too small > > fraction of the lru to start the scan with. > > If the system has 1 TB of RAM, it's probably too big > of a fraction :) > > We need something smarter. Well this value is depending on a nodes memory not on the systems total memory. So I think we are fine. 1TB systems (at least ours) are comprised of nodes with 4GB/8GB/16GB of memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed 2007-06-11 16:23 ` Christoph Lameter @ 2007-06-11 16:57 ` Rik van Riel 0 siblings, 0 replies; 77+ messages in thread From: Rik van Riel @ 2007-06-11 16:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm, Larry Woodman Christoph Lameter wrote: > On Sun, 10 Jun 2007, Rik van Riel wrote: > >> Andrea Arcangeli wrote: >>> On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote: >>>> code simultaneously, all starting out at priority 12 and >>>> not freeing anything until they all get to much lower >>>> priorities. >>> BTW, this reminds me that I've been wondering if 2**12 is a too small >>> fraction of the lru to start the scan with. >> If the system has 1 TB of RAM, it's probably too big >> of a fraction :) >> >> We need something smarter. > > Well this value is depending on a nodes memory not on the systems > total memory. So I think we are fine. 1TB systems (at least ours) are > comprised of nodes with 4GB/8GB/16GB of memory. Yours are fine, because currently the very large system customers tend to run fine tuned workloads. We are seeing some other users throwing random workloads at systems with 256GB of RAM in a single zone. General purpose computing is moving up, VM explosions are becoming more spectacular :) -- All Rights Reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 16 of 16] avoid some lock operation in vm fast path 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (14 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli @ 2007-06-08 20:03 ` Andrea Arcangeli 2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III 16 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw) To: linux-mm # HG changeset patch # User Andrea Arcangeli <andrea@suse.de> # Date 1181332962 -7200 # Node ID 19fb832beb3c83b7bed13c1a2f54ec4e077cfc0d # Parent 31ef5d0bf924fb47da144321f692f4fefebf5cf5 avoid some lock operation in vm fast path Let's not bloat the kernel for numa. Not nice, but at least this way perhaps somebody will clean it up instead of hiding the inefficiency in there. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -223,8 +223,10 @@ struct zone { unsigned long pages_scanned; /* since last reclaim */ int all_unreclaimable; /* All pages pinned */ +#ifdef CONFIG_NUMA /* A count of how many reclaimers are scanning this zone */ atomic_t reclaim_in_progress; +#endif /* Zone statistics */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2650,7 +2650,9 @@ static void __meminit free_area_init_cor INIT_LIST_HEAD(&zone->active_list); INIT_LIST_HEAD(&zone->inactive_list); zap_zone_vm_stats(zone); +#ifdef CONFIG_NUMA atomic_set(&zone->reclaim_in_progress, 0); +#endif if (!size) continue; diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -909,7 +909,9 @@ static unsigned long shrink_zone(int pri unsigned long nr_to_scan; unsigned long nr_reclaimed = 0; +#ifdef CONFIG_NUMA atomic_inc(&zone->reclaim_in_progress); +#endif /* * Add one to `nr_to_scan' just to make sure that the kernel will @@ -945,7 +947,9 @@ static unsigned long shrink_zone(int pri throttle_vm_writeout(sc->gfp_mask); +#ifdef CONFIG_NUMA atomic_dec(&zone->reclaim_in_progress); +#endif return nr_reclaimed; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 00 of 16] OOM related fixes 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli ` (15 preceding siblings ...) 2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli @ 2007-06-08 21:26 ` William Lee Irwin III 2007-06-09 14:55 ` Andrea Arcangeli 16 siblings, 1 reply; 77+ messages in thread From: William Lee Irwin III @ 2007-06-08 21:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm On Fri, Jun 08, 2007 at 10:02:58PM +0200, Andrea Arcangeli wrote: > Hello everyone, > this is a set of fixes done in the context of a quite evil workload reading > from nfs large files with big read buffers in parallel from many tasks at > the same time until the system goes oom. Mostly all of these fixes seems to be > required to fix the customer workload on top of an older sles kernel. The > forward port of the fixes has been already tested successfully on similar evil > workloads. > mainline vanilla running a somewhat simulated workload: [...] Interesting. This seems to demonstrate a need for file IO to handle fatal signals, beyond just people wanting faster responses to kill -9. Perhaps it's the case that fatal signals should always be handled, and there should be no waiting primitives excluding them. __GFP_NOFAIL is also "interesting." -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 00 of 16] OOM related fixes 2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III @ 2007-06-09 14:55 ` Andrea Arcangeli 2007-06-12 8:58 ` Petr Tesarik 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2007-06-09 14:55 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-mm, Petr Tesarik Hi Wil, On Fri, Jun 08, 2007 at 02:26:10PM -0700, William Lee Irwin III wrote: > Interesting. This seems to demonstrate a need for file IO to handle > fatal signals, beyond just people wanting faster responses to kill -9. > Perhaps it's the case that fatal signals should always be handled, and > there should be no waiting primitives excluding them. __GFP_NOFAIL is > also "interesting." Clearly the sooner we respond to a SIGKILL the better. We tried to catch the two critical points to solve the evil read(huge)->oom. BTW, the first suggestion that we had to also break out of read to make progress substantially quicker, was from Petr so I'm cc'ing him. I'm unsure what else of more generic we could do to solve more of those troubles at the same time without having to pollute the code with sigkill checks. For example we're not yet covering the o-direct paths but I did the minimal changes to resolve the current workload and that used buffered io of course ;). BTW, I could have checked the TIF_MEMDIE instead of seeing if sigkill was pending, but since I had to check the task structure anyway, I preferred to check for the sigkill so that kill -9 will now work for the first time against a large read/write syscall, besides allowing the TIF_MEMDIE task to exit in reasonable time without triggering the deadlock detection in the later patches. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 00 of 16] OOM related fixes 2007-06-09 14:55 ` Andrea Arcangeli @ 2007-06-12 8:58 ` Petr Tesarik 0 siblings, 0 replies; 77+ messages in thread From: Petr Tesarik @ 2007-06-12 8:58 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, linux-mm -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrea Arcangeli wrote: > Hi Wil, > > On Fri, Jun 08, 2007 at 02:26:10PM -0700, William Lee Irwin III wrote: >> Interesting. This seems to demonstrate a need for file IO to handle >> fatal signals, beyond just people wanting faster responses to kill -9. >> Perhaps it's the case that fatal signals should always be handled, and >> there should be no waiting primitives excluding them. __GFP_NOFAIL is >> also "interesting." > > Clearly the sooner we respond to a SIGKILL the better. We tried to > catch the two critical points to solve the evil read(huge)->oom. BTW, > the first suggestion that we had to also break out of read to make > progress substantially quicker, was from Petr so I'm cc'ing him. I'm Late as always... :(( It's not only about getting it quicker - the loop wouldn't break until the whole chunk has been read, which couldn't be accomplished until some memory was freed first, but the memory would be freed by killing this task which wouldn't terminate until everything is read, and so on... We obviously need to break the vicious circle somewhere. If we want to resolve all such cases we would have to ensure that delivering a SIGKILL can't fail on OOM conditions, i.e. that SIGKILL can always be handled without memory allocation. I'm planning to do some investigations on which places in the kernel are (worst) affected and then think about ways of fixing them. I don't expect we can fix them all, or at least not in the first round, but this looks like the only way to go... Cheers, Petr Tesarik > unsure what else of more generic we could do to solve more of those > troubles at the same time without having to pollute the code with > sigkill checks. For example we're not yet covering the o-direct paths > but I did the minimal changes to resolve the current workload and that > used buffered io of course ;). BTW, I could have checked the > TIF_MEMDIE instead of seeing if sigkill was pending, but since I had > to check the task structure anyway, I preferred to check for the > sigkill so that kill -9 will now work for the first time against a > large read/write syscall, besides allowing the TIF_MEMDIE task to exit > in reasonable time without triggering the deadlock detection in the > later patches. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGbmBRjpY2ODFi2ogRAseoAKCV+rX+PTmdGdjmjdObBwmdYDlqXACfXI9f BT5dOXg5qPVhH7Wj/DlHCP4= =ZlW9 -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2007-06-29 22:49 UTC | newest] Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli 2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli 2007-06-10 17:36 ` Rik van Riel 2007-06-10 18:17 ` Andrea Arcangeli 2007-06-11 14:58 ` Rik van Riel 2007-06-26 17:08 ` Rik van Riel 2007-06-26 17:55 ` Andrew Morton 2007-06-26 19:02 ` Rik van Riel 2007-06-28 22:44 ` Rik van Riel 2007-06-28 22:57 ` Andrew Morton 2007-06-28 23:04 ` Rik van Riel 2007-06-28 23:13 ` Andrew Morton 2007-06-28 23:16 ` Rik van Riel 2007-06-28 23:29 ` Andrew Morton 2007-06-29 0:00 ` Rik van Riel 2007-06-29 0:19 ` Andrew Morton 2007-06-29 0:45 ` Rik van Riel 2007-06-29 1:12 ` Andrew Morton 2007-06-29 1:20 ` Rik van Riel 2007-06-29 1:29 ` Andrew Morton 2007-06-28 23:25 ` Andrea Arcangeli 2007-06-29 0:12 ` Andrew Morton 2007-06-29 13:38 ` Lee Schermerhorn 2007-06-29 14:12 ` Andrea Arcangeli 2007-06-29 14:59 ` Rik van Riel 2007-06-29 22:39 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn 2007-06-29 22:42 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn 2007-06-29 22:44 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn 2007-06-29 22:49 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn 2007-06-26 20:37 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli 2007-06-26 20:57 ` Rik van Riel 2007-06-26 22:21 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli 2007-06-10 17:38 ` Rik van Riel 2007-06-10 18:27 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli 2007-06-09 6:43 ` Peter Zijlstra 2007-06-09 15:27 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli 2007-06-08 21:57 ` Christoph Lameter 2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli 2007-06-08 21:48 ` Christoph Lameter 2007-06-09 1:59 ` Andrea Arcangeli 2007-06-09 3:01 ` Christoph Lameter 2007-06-09 14:05 ` Andrea Arcangeli 2007-06-09 14:38 ` Andrea Arcangeli 2007-06-11 16:07 ` Christoph Lameter 2007-06-11 16:50 ` Andrea Arcangeli 2007-06-11 16:57 ` Christoph Lameter 2007-06-11 17:51 ` Andrea Arcangeli 2007-06-11 17:56 ` Christoph Lameter 2007-06-11 18:22 ` Andrea Arcangeli 2007-06-11 18:39 ` Christoph Lameter 2007-06-11 18:58 ` Andrea Arcangeli 2007-06-11 19:25 ` Christoph Lameter 2007-06-11 16:04 ` Christoph Lameter 2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli 2007-06-10 17:17 ` Rik van Riel 2007-06-10 17:30 ` Andrea Arcangeli 2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli 2007-06-10 17:20 ` Rik van Riel 2007-06-10 17:32 ` Andrea Arcangeli 2007-06-10 17:52 ` Rik van Riel 2007-06-11 16:23 ` Christoph Lameter 2007-06-11 16:57 ` Rik van Riel 2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli 2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III 2007-06-09 14:55 ` Andrea Arcangeli 2007-06-12 8:58 ` Petr Tesarik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox