* dbench on tmpfs OOM's
@ 2002-09-17 4:43 William Lee Irwin III
2002-09-17 4:58 ` Andrew Morton
0 siblings, 1 reply; 15+ messages in thread
From: William Lee Irwin III @ 2002-09-17 4:43 UTC (permalink / raw)
To: linux-mm; +Cc: akpm, hugh, linux-kernel
My machine OOM'd during a run of 2 simultaneous dbench 512's on 2
separate 12GB tmpfs fs's on a 32x NUMA-Q with 32GB of RAM, 2.5.35
plus minimalistic booting fixes, no swap, and my recently posted false
OOM fix (though it is perhaps not the most desirable fix).
Note: gfp_mask == __GFP_FS | __GFP_HIGHIO | __GFP_IO | __GFP__WAIT
... the __GFP_FS checks can't save us here.
Cheers,
Bill
#1 0xc013ce2e in oom_kill () at oom_kill.c:182
#2 0xc013cee8 in out_of_memory () at oom_kill.c:250
#3 0xc0138bec in try_to_free_pages (classzone=0xc0389300, gfp_mask=464,
order=0) at vmscan.c:561
#4 0xc013989b in balance_classzone (classzone=0xc0389300, gfp_mask=464,
order=0, freed=0xf6169ed0) at page_alloc.c:278
#5 0xc0139b97 in __alloc_pages (gfp_mask=464, order=0, zonelist=0xc02a3e94)
at page_alloc.c:401
#6 0xc013cb57 in alloc_pages_pgdat (pgdat=0xc0389000, gfp_mask=464, order=0)
at numa.c:77
#7 0xc013cba3 in _alloc_pages (gfp_mask=464, order=0) at numa.c:105
#8 0xc0139c23 in get_zeroed_page (gfp_mask=45) at page_alloc.c:442
#9 0xc013db12 in shmem_getpage_locked (info=0xd950f900, inode=0xd950f978,
idx=16) at shmem.c:205
#10 0xc013e6ea in shmem_file_write (file=0xf45dd6c0,
buf=0x804bda1 '\001' <repeats 200 times>..., count=65474, ppos=0xf45dd6e0)
at shmem.c:957
#11 0xc014553c in vfs_write (file=0xf45dd6c0,
buf=0x804bda0 '\001' <repeats 200 times>..., count=65475, pos=0xf45dd6e0)
at read_write.c:216
#12 0xc014561e in sys_write (fd=6,
buf=0x804bda0 '\001' <repeats 200 times>..., count=65475)
at read_write.c:246
#13 0xc010771f in syscall_call () at process.c:966
MemTotal: 32107256 kB
MemFree: 27564648 kB
MemShared: 0 kB
Cached: 4196528 kB
SwapCached: 0 kB
Active: 1924400 kB
Inactive: 2381404 kB
HighTotal: 31588352 kB
HighFree: 27563224 kB
LowTotal: 518904 kB
LowFree: 1424 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Committed_AS: 4330600 kB
PageTables: 12804 kB
ReverseMaps: 133841
TLB flushes: 1709
non flushes: 1093
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: dbench on tmpfs OOM's 2002-09-17 4:43 dbench on tmpfs OOM's William Lee Irwin III @ 2002-09-17 4:58 ` Andrew Morton 2002-09-17 5:01 ` Martin J. Bligh 2002-09-17 5:15 ` William Lee Irwin III 0 siblings, 2 replies; 15+ messages in thread From: Andrew Morton @ 2002-09-17 4:58 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-mm, akpm, hugh, linux-kernel William Lee Irwin III wrote: > > ... > MemTotal: 32107256 kB > MemFree: 27564648 kB I'd be suspecting that your node fallback is bust. Suggest you add a call to show_free_areas() somewhere; consider exposing the full per-zone status via /proc with a proper patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 4:58 ` Andrew Morton @ 2002-09-17 5:01 ` Martin J. Bligh 2002-09-17 5:14 ` Andrew Morton 2002-09-17 5:15 ` William Lee Irwin III 1 sibling, 1 reply; 15+ messages in thread From: Martin J. Bligh @ 2002-09-17 5:01 UTC (permalink / raw) To: Andrew Morton, William Lee Irwin III; +Cc: linux-mm, akpm, hugh, linux-kernel >> ... >> MemTotal: 32107256 kB >> MemFree: 27564648 kB > > I'd be suspecting that your node fallback is bust. > > Suggest you add a call to show_free_areas() somewhere; consider > exposing the full per-zone status via /proc with a proper patch. Won't /proc/meminfo.numa show that? Or do you mean something else by "full per-zone status"? Looks to me like it's just out of low memory: > LowFree: 1424 kB There is no low memory on anything but node 0 ... M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 5:01 ` Martin J. Bligh @ 2002-09-17 5:14 ` Andrew Morton 2002-09-17 5:18 ` Martin J. Bligh 0 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2002-09-17 5:14 UTC (permalink / raw) To: Martin J. Bligh; +Cc: William Lee Irwin III, linux-mm, akpm, hugh, linux-kernel "Martin J. Bligh" wrote: > > >> ... > >> MemTotal: 32107256 kB > >> MemFree: 27564648 kB > > > > I'd be suspecting that your node fallback is bust. > > > > Suggest you add a call to show_free_areas() somewhere; consider > > exposing the full per-zone status via /proc with a proper patch. > > Won't /proc/meminfo.numa show that? Or do you mean something > else by "full per-zone status"? meminfo.what? Remember when I suggested that you put a testing mode into the numa code so that mortals could run numa builds on non-numa boxes? > Looks to me like it's just out of low memory: > > > LowFree: 1424 kB > > There is no low memory on anything but node 0 ... > It was a GFP_HIGH allocation - just pagecache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 5:14 ` Andrew Morton @ 2002-09-17 5:18 ` Martin J. Bligh 0 siblings, 0 replies; 15+ messages in thread From: Martin J. Bligh @ 2002-09-17 5:18 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm, akpm, hugh, linux-kernel > meminfo.what? Remember when I suggested that you put > a testing mode into the numa code so that mortals could > run numa builds on non-numa boxes? NUMA aware meminfo is one of the patches you have sitting in your tree. I haven't got around to the NUMA-sim yet ... maybe after Halloween when management stop asking me to get other bits of code in before the freeze ;-) mbligh@larry:~$ cat /proc/meminfo.numa Node 0 MemTotal: 4194304 kB Node 0 MemFree: 3420660 kB Node 0 MemUsed: 773644 kB Node 0 HighTotal: 3418112 kB Node 0 HighFree: 2737992 kB Node 0 LowTotal: 776192 kB Node 0 LowFree: 682668 kB Node 1 MemTotal: 4147200 kB Node 1 MemFree: 4116444 kB Node 1 MemUsed: 30756 kB Node 1 HighTotal: 4147200 kB Node 1 HighFree: 4116444 kB Node 1 LowTotal: 0 kB Node 1 LowFree: 0 kB Node 2 MemTotal: 4147200 kB Node 2 MemFree: 4131816 kB Node 2 MemUsed: 15384 kB Node 2 HighTotal: 4147200 kB Node 2 HighFree: 4131816 kB Node 2 LowTotal: 0 kB Node 2 LowFree: 0 kB Node 3 MemTotal: 4147200 kB Node 3 MemFree: 4128432 kB Node 3 MemUsed: 18768 kB Node 3 HighTotal: 4147200 kB Node 3 HighFree: 4128432 kB Node 3 LowTotal: 0 kB Node 3 LowFree: 0 kB >> Looks to me like it's just out of low memory: >> >> > LowFree: 1424 kB >> >> There is no low memory on anything but node 0 ... > > It was a GFP_HIGH allocation - just pagecache. Ah, but what does a balance_classzone do on a NUMA box? Once you've finished rototilling the code you're looking at, I think we might have a better clue what it's supposed to do, at least ... M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 4:58 ` Andrew Morton 2002-09-17 5:01 ` Martin J. Bligh @ 2002-09-17 5:15 ` William Lee Irwin III 2002-09-17 5:31 ` Andrew Morton 1 sibling, 1 reply; 15+ messages in thread From: William Lee Irwin III @ 2002-09-17 5:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, akpm, hugh, linux-kernel William Lee Irwin III wrote: >> MemTotal: 32107256 kB >> MemFree: 27564648 kB On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote: > I'd be suspecting that your node fallback is bust. > Suggest you add a call to show_free_areas() somewhere; consider > exposing the full per-zone status via /proc with a proper patch. I went through the nodes by hand. It's just a run of the mill ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of the highmem zones were anywhere near ->pages_low. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 5:15 ` William Lee Irwin III @ 2002-09-17 5:31 ` Andrew Morton 2002-09-17 6:43 ` Christoph Rohland 2002-09-17 7:01 ` Hugh Dickins 0 siblings, 2 replies; 15+ messages in thread From: Andrew Morton @ 2002-09-17 5:31 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-mm, hugh, linux-kernel William Lee Irwin III wrote: > > William Lee Irwin III wrote: > >> MemTotal: 32107256 kB > >> MemFree: 27564648 kB > > On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote: > > I'd be suspecting that your node fallback is bust. > > Suggest you add a call to show_free_areas() somewhere; consider > > exposing the full per-zone status via /proc with a proper patch. > > I went through the nodes by hand. It's just a run of the mill > ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of > the highmem zones were anywhere near ->pages_low. > erk. Why is shmem using GFP_USER? mnm:/usr/src/25> grep page_address mm/shmem.c mnm:/usr/src/25> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 5:31 ` Andrew Morton @ 2002-09-17 6:43 ` Christoph Rohland 2002-09-17 7:01 ` Hugh Dickins 1 sibling, 0 replies; 15+ messages in thread From: Christoph Rohland @ 2002-09-17 6:43 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm, hugh, linux-kernel Hi Andrew, Andrew Morton wrote: > William Lee Irwin III wrote: >>I went through the nodes by hand. It's just a run of the mill >>ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of >>the highmem zones were anywhere near ->pages_low. >> >> > > erk. Why is shmem using GFP_USER? > > mnm:/usr/src/25> grep page_address mm/shmem.c For inode and page vector allocation. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 5:31 ` Andrew Morton 2002-09-17 6:43 ` Christoph Rohland @ 2002-09-17 7:01 ` Hugh Dickins 2002-09-17 7:27 ` William Lee Irwin III ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Hugh Dickins @ 2002-09-17 7:01 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm, linux-kernel On Mon, 16 Sep 2002, Andrew Morton wrote: > William Lee Irwin III wrote: > > > > William Lee Irwin III wrote: > > >> MemTotal: 32107256 kB > > >> MemFree: 27564648 kB > > > > On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote: > > > I'd be suspecting that your node fallback is bust. > > > Suggest you add a call to show_free_areas() somewhere; consider > > > exposing the full per-zone status via /proc with a proper patch. > > > > I went through the nodes by hand. It's just a run of the mill > > ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of > > the highmem zones were anywhere near ->pages_low. > > > > erk. Why is shmem using GFP_USER? > > mnm:/usr/src/25> grep page_address mm/shmem.c > mnm:/usr/src/25> shmem uses GFP_USER for its index pages to GFP_HIGHUSER data pages. Not to say there aren't other problems in the mix too, but Bill's main problem here will be one you discovered a while ago, Andrew. We fixed it then, but in my loopable tmpfs version, and I've been slow to extract the fixes and push them to mainline (or now -mm), since there's not much else that suffers than dbench. The problem is that dbench likes to do large random(?) seeks and then writes at resulting offset; and although shmem-tmpfs imposes a cap (default: half of memory) on the data pages, it imposes no cap on its index pages. So it foolishly ends up filling normal zone with empty index blocks for zero-length files, the index page allocation being done _before_ the data cap check. I'll rebase the relevant fixes against 2.5.35-mm1 later today, do a little testing and post the patch. What I never did was try GFP_HIGHUSER and kmap on the index pages: I think I decided back then that it wasn't likely to be needed (sparsely filled file indexes are a rarer case than sparsely filled pagetables, once the stupidity is fixed; and small files don't use index pages at all). But Bill's testing may well prove me wrong. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 7:01 ` Hugh Dickins @ 2002-09-17 7:27 ` William Lee Irwin III 2002-09-17 8:02 ` Andrew Morton 2002-09-17 7:57 ` dbench on tmpfs OOM's Christoph Rohland 2002-12-10 5:28 ` William Lee Irwin III 2 siblings, 1 reply; 15+ messages in thread From: William Lee Irwin III @ 2002-09-17 7:27 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote: > shmem uses GFP_USER for its index pages to GFP_HIGHUSER data pages. > Not to say there aren't other problems in the mix too, but Bill's > main problem here will be one you discovered a while ago, Andrew. > We fixed it then, but in my loopable tmpfs version, and I've been > slow to extract the fixes and push them to mainline (or now -mm), > since there's not much else that suffers than dbench. > The problem is that dbench likes to do large random(?) seeks and > then writes at resulting offset; and although shmem-tmpfs imposes > a cap (default: half of memory) on the data pages, it imposes no > cap on its index pages. So it foolishly ends up filling normal > zone with empty index blocks for zero-length files, the index > page allocation being done _before_ the data cap check. The extreme configurations of my machines put a great deal of stress on many codepaths. It shouldn't be regarded as any great failing that some corrections are required to function properly on them, as the visibility given to highmem-related pressures is far, far greater than seen elsewhere. One thing to bear in mind is that though the kernel may be functioning correctly in refusing the requests made, the additional functionality of being capable of servicing them would be much appreciated and likely to be of good use for the machines in the field we serve. Although from your general stance on things, it seems I've little to convince you of. On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote: > I'll rebase the relevant fixes against 2.5.35-mm1 later today, > do a little testing and post the patch. > What I never did was try GFP_HIGHUSER and kmap on the index pages: > I think I decided back then that it wasn't likely to be needed > (sparsely filled file indexes are a rarer case than sparsely filled > pagetables, once the stupidity is fixed; and small files don't use > index pages at all). But Bill's testing may well prove me wrong. I suspected you may well have plots here, as you have often before, so I held off on brewing up attempts at such myself until you replied. I'll defer to you. Thanks, Bill P.S.: The original intent of this testing was to obtain a profile of "best case" behavior not limited by throttling within the block layer. The method was derived from the observation that though the test on-disk was seek-bound according to the block layer, as the test was configured, it should have been able to have been carried out in-core. Carrying out the test on tmpfs was the method chosen to eliminate the block throttling. Whatever kind of additional stress-testing and optimization you have an interest in enabling tmpfs for I'd be at least moderately interested in providing, as I've some interest in using tmpfs to assist with generalized large page support within the core kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 7:27 ` William Lee Irwin III @ 2002-09-17 8:02 ` Andrew Morton 2002-09-17 11:38 ` 35-mm1 triggers watchdog Ed Tomlinson 0 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2002-09-17 8:02 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Hugh Dickins, linux-mm, linux-kernel William Lee Irwin III wrote: > > ... > The extreme configurations of my machines put a great deal of stress on > many codepaths. Nevertheless, the issue you raised is valid. Should direct reclaim reach across and reclaim from other nodes, or not? If so, when? I suspect that it shouldn't. Direct reclaim is increasingly last-resort and only seems to happen (now) with stresstest-type workloads, and when there is latency in getting kswapd up and running. I'd suggest that the balancing of other nodes be left in the hands of kswapd and that the fallback logic be isolated to the page allocation path. The current kswapd logic is all tangled up with direct reclaim. I've split this up, so that we can balance these two different functions. Code passes initial stresstesting; there may be corner cases. There is some lack of clarity in what kswapd does and what direct-reclaim tasks do; try_to_free_pages() tries to service both functions, and they are different. - kswapd's role is to keep all zones on its node at zone->free_pages >= zone->pages_high. and to never stop as long as any zones do not meet that condition. - A direct reclaimer's role is to try to free some pages from the zones which are suitable for this particular allocation request, and to return when that has been achieved, or when all the relevant zones are at zone->free_pages >= zone->pages_high. The patch explicitly separates these two code paths; kswapd does not run try_to_free_pages() any more. kswapd should not be aware of zone fallbacks. include/linux/mmzone.h | 1 mm/page_alloc.c | 3 mm/vmscan.c | 238 +++++++++++++++++++++++-------------------------- 3 files changed, 116 insertions(+), 126 deletions(-) --- 2.5.35/mm/vmscan.c~per-zone-vm Tue Sep 17 00:23:57 2002 +++ 2.5.35-akpm/mm/vmscan.c Tue Sep 17 00:59:47 2002 @@ -109,7 +109,8 @@ struct vmstats { int refiled_nonfreeable; int refiled_no_mapping; int refiled_nofs; - int refiled_congested; + int refiled_congested_kswapd; + int refiled_congested_non_kswapd; int written_back; int refiled_writeback; int refiled_writeback_nofs; @@ -122,15 +123,19 @@ struct vmstats { int refill_refiled; } vmstats; + +/* + * shrink_list returns the number of reclaimed pages + */ static /* inline */ int -shrink_list(struct list_head *page_list, int nr_pages, - unsigned int gfp_mask, int *max_scan, int *nr_mapped) +shrink_list(struct list_head *page_list, unsigned int gfp_mask, + int *max_scan, int *nr_mapped) { struct address_space *mapping; LIST_HEAD(ret_pages); struct pagevec freed_pvec; - const int nr_pages_in = nr_pages; int pgactivate = 0; + int ret = 0; pagevec_init(&freed_pvec); while (!list_empty(page_list)) { @@ -268,7 +273,10 @@ shrink_list(struct list_head *page_list, bdi = mapping->backing_dev_info; if (bdi != current->backing_dev_info && bdi_write_congested(bdi)){ - vmstats.refiled_congested++; + if (current->flags & PF_KSWAPD) + vmstats.refiled_congested_kswapd++; + else + vmstats.refiled_congested_non_kswapd++; goto keep_locked; } @@ -336,7 +344,7 @@ shrink_list(struct list_head *page_list, __put_page(page); /* The pagecache ref */ free_it: unlock_page(page); - nr_pages--; + ret++; vmstats.reclaimed++; if (!pagevec_add(&freed_pvec, page)) __pagevec_release_nonlru(&freed_pvec); @@ -354,11 +362,11 @@ keep: list_splice(&ret_pages, page_list); if (pagevec_count(&freed_pvec)) __pagevec_release_nonlru(&freed_pvec); - mod_page_state(pgsteal, nr_pages_in - nr_pages); + mod_page_state(pgsteal, ret); if (current->flags & PF_KSWAPD) - mod_page_state(kswapd_steal, nr_pages_in - nr_pages); + mod_page_state(kswapd_steal, ret); mod_page_state(pgactivate, pgactivate); - return nr_pages; + return ret; } /* @@ -367,18 +375,19 @@ keep: * not freed will be added back to the LRU. * * shrink_cache() is passed the number of pages to try to free, and returns - * the number which are yet-to-free. + * the number of pages which were reclaimed. * * For pagecache intensive workloads, the first loop here is the hottest spot * in the kernel (apart from the copy_*_user functions). */ static /* inline */ int -shrink_cache(int nr_pages, struct zone *zone, +shrink_cache(const int nr_pages, struct zone *zone, unsigned int gfp_mask, int max_scan, int *nr_mapped) { LIST_HEAD(page_list); struct pagevec pvec; int nr_to_process; + int ret = 0; /* * Try to ensure that we free `nr_pages' pages in one pass of the loop. @@ -391,10 +400,11 @@ shrink_cache(int nr_pages, struct zone * lru_add_drain(); spin_lock_irq(&zone->lru_lock); - while (max_scan > 0 && nr_pages > 0) { + while (max_scan > 0 && ret < nr_pages) { struct page *page; int nr_taken = 0; int nr_scan = 0; + int nr_freed; while (nr_scan++ < nr_to_process && !list_empty(&zone->inactive_list)) { @@ -425,10 +435,10 @@ shrink_cache(int nr_pages, struct zone * max_scan -= nr_scan; mod_page_state(pgscan, nr_scan); - nr_pages = shrink_list(&page_list, nr_pages, - gfp_mask, &max_scan, nr_mapped); - - if (nr_pages <= 0 && list_empty(&page_list)) + nr_freed = shrink_list(&page_list, gfp_mask, + &max_scan, nr_mapped); + ret += nr_freed; + if (nr_freed <= 0 && list_empty(&page_list)) goto done; spin_lock_irq(&zone->lru_lock); @@ -454,7 +464,7 @@ shrink_cache(int nr_pages, struct zone * spin_unlock_irq(&zone->lru_lock); done: pagevec_release(&pvec); - return nr_pages; + return ret; } /* @@ -578,9 +588,14 @@ refill_inactive_zone(struct zone *zone, mod_page_state(pgdeactivate, pgdeactivate); } +/* + * Try to reclaim `nr_pages' from this zone. Returns the number of reclaimed + * pages. This is a basic per-zone page freer. Used by both kswapd and + * direct reclaim. + */ static /* inline */ int -shrink_zone(struct zone *zone, int max_scan, - unsigned int gfp_mask, int nr_pages, int *nr_mapped) +shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask, + const int nr_pages, int *nr_mapped) { unsigned long ratio; @@ -601,36 +616,60 @@ shrink_zone(struct zone *zone, int max_s atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter); refill_inactive_zone(zone, SWAP_CLUSTER_MAX); } - nr_pages = shrink_cache(nr_pages, zone, gfp_mask, - max_scan, nr_mapped); - return nr_pages; + return shrink_cache(nr_pages, zone, gfp_mask, max_scan, nr_mapped); +} + +/* + * FIXME: don't do this for ZONE_HIGHMEM + */ +/* + * Here we assume it costs one seek to replace a lru page and that it also + * takes a seek to recreate a cache object. With this in mind we age equal + * percentages of the lru and ageable caches. This should balance the seeks + * generated by these structures. + * + * NOTE: for now I do this for all zones. If we find this is too aggressive + * on large boxes we may want to exclude ZONE_HIGHMEM. + * + * If we're encountering mapped pages on the LRU then increase the pressure on + * slab to avoid swapping. + */ +static void shrink_slab(int total_scanned, int gfp_mask) +{ + int shrink_ratio; + int pages = nr_used_zone_pages(); + + shrink_ratio = (pages / (total_scanned + 1)) + 1; + shrink_dcache_memory(shrink_ratio, gfp_mask); + shrink_icache_memory(shrink_ratio, gfp_mask); + shrink_dqcache_memory(shrink_ratio, gfp_mask); } +/* + * This is the direct reclaim path, for page-allocating processes. We only + * try to reclaim pages from zones which will satisfy the caller's allocation + * request. + */ static int shrink_caches(struct zone *classzone, int priority, - int *total_scanned, int gfp_mask, int nr_pages) + int *total_scanned, int gfp_mask, const int nr_pages) { struct zone *first_classzone; struct zone *zone; - int ratio; int nr_mapped = 0; - int pages = nr_used_zone_pages(); + int ret = 0; first_classzone = classzone->zone_pgdat->node_zones; for (zone = classzone; zone >= first_classzone; zone--) { int max_scan; int to_reclaim; - int unreclaimed; to_reclaim = zone->pages_high - zone->free_pages; if (to_reclaim < 0) continue; /* zone has enough memory */ - if (to_reclaim > SWAP_CLUSTER_MAX) - to_reclaim = SWAP_CLUSTER_MAX; - - if (to_reclaim < nr_pages) - to_reclaim = nr_pages; + to_reclaim = min(to_reclaim, SWAP_CLUSTER_MAX); + to_reclaim = max(to_reclaim, nr_pages); /* * If we cannot reclaim `nr_pages' pages by scanning twice @@ -639,33 +678,18 @@ shrink_caches(struct zone *classzone, in max_scan = zone->nr_inactive >> priority; if (max_scan < to_reclaim * 2) max_scan = to_reclaim * 2; - unreclaimed = shrink_zone(zone, max_scan, - gfp_mask, to_reclaim, &nr_mapped); - nr_pages -= to_reclaim - unreclaimed; + ret += shrink_zone(zone, max_scan, gfp_mask, + to_reclaim, &nr_mapped); *total_scanned += max_scan; + *total_scanned += nr_mapped; + if (ret >= nr_pages) + break; } - - /* - * Here we assume it costs one seek to replace a lru page and that - * it also takes a seek to recreate a cache object. With this in - * mind we age equal percentages of the lru and ageable caches. - * This should balance the seeks generated by these structures. - * - * NOTE: for now I do this for all zones. If we find this is too - * aggressive on large boxes we may want to exclude ZONE_HIGHMEM - * - * If we're encountering mapped pages on the LRU then increase the - * pressure on slab to avoid swapping. - */ - ratio = (pages / (*total_scanned + nr_mapped + 1)) + 1; - shrink_dcache_memory(ratio, gfp_mask); - shrink_icache_memory(ratio, gfp_mask); - shrink_dqcache_memory(ratio, gfp_mask); - return nr_pages; + return ret; } /* - * This is the main entry point to page reclaim. + * This is the main entry point to direct page reclaim. * * If a full scan of the inactive list fails to free enough memory then we * are "out of memory" and something needs to be killed. @@ -685,17 +709,18 @@ int try_to_free_pages(struct zone *classzone, unsigned int gfp_mask, unsigned int order) { - int priority = DEF_PRIORITY; - int nr_pages = SWAP_CLUSTER_MAX; + int priority; + const int nr_pages = SWAP_CLUSTER_MAX; + int nr_reclaimed = 0; inc_page_state(pageoutrun); for (priority = DEF_PRIORITY; priority; priority--) { int total_scanned = 0; - nr_pages = shrink_caches(classzone, priority, &total_scanned, - gfp_mask, nr_pages); - if (nr_pages <= 0) + nr_reclaimed += shrink_caches(classzone, priority, + &total_scanned, gfp_mask, nr_pages); + if (nr_reclaimed >= nr_pages) return 1; if (total_scanned == 0) return 1; /* All zones had enough free memory */ @@ -710,62 +735,46 @@ try_to_free_pages(struct zone *classzone /* Take a nap, wait for some writeback to complete */ blk_congestion_wait(WRITE, HZ/4); + shrink_slab(total_scanned, gfp_mask); } if (gfp_mask & __GFP_FS) out_of_memory(); return 0; } -static int check_classzone_need_balance(struct zone *classzone) +/* + * kswapd will work across all this node's zones until they are all at + * pages_high. + */ +static void kswapd_balance_pgdat(pg_data_t *pgdat) { - struct zone *first_classzone; + int priority = DEF_PRIORITY; + int i; - first_classzone = classzone->zone_pgdat->node_zones; - while (classzone >= first_classzone) { - if (classzone->free_pages > classzone->pages_high) - return 0; - classzone--; - } - return 1; -} + for (priority = DEF_PRIORITY; priority; priority--) { + int success = 1; -static int kswapd_balance_pgdat(pg_data_t * pgdat) -{ - int need_more_balance = 0, i; - struct zone *zone; + for (i = 0; i < pgdat->nr_zones; i++) { + struct zone *zone = pgdat->node_zones + i; + int nr_mapped = 0; + int max_scan; + int to_reclaim; - for (i = pgdat->nr_zones-1; i >= 0; i--) { - zone = pgdat->node_zones + i; - cond_resched(); - if (!zone->need_balance) - continue; - if (!try_to_free_pages(zone, GFP_KSWAPD, 0)) { - zone->need_balance = 0; - __set_current_state(TASK_INTERRUPTIBLE); - schedule_timeout(HZ); - continue; + to_reclaim = zone->pages_high - zone->free_pages; + if (to_reclaim <= 0) + continue; + success = 0; + max_scan = zone->nr_inactive >> priority; + if (max_scan < to_reclaim * 2) + max_scan = to_reclaim * 2; + shrink_zone(zone, max_scan, GFP_KSWAPD, + to_reclaim, &nr_mapped); + shrink_slab(max_scan + nr_mapped, GFP_KSWAPD); } - if (check_classzone_need_balance(zone)) - need_more_balance = 1; - else - zone->need_balance = 0; - } - - return need_more_balance; -} - -static int kswapd_can_sleep_pgdat(pg_data_t * pgdat) -{ - struct zone *zone; - int i; - - for (i = pgdat->nr_zones-1; i >= 0; i--) { - zone = pgdat->node_zones + i; - if (zone->need_balance) - return 0; + if (success) + break; /* All zones are at pages_high */ + blk_congestion_wait(WRITE, HZ/4); } - - return 1; } /* @@ -785,7 +794,7 @@ int kswapd(void *p) { pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + DEFINE_WAIT(wait); daemonize(); set_cpus_allowed(tsk, __node_to_cpu_mask(p->node_id)); @@ -806,27 +815,12 @@ int kswapd(void *p) */ tsk->flags |= PF_MEMALLOC|PF_KSWAPD; - /* - * Kswapd main loop. - */ - for (;;) { + for ( ; ; ) { if (current->flags & PF_FREEZE) refrigerator(PF_IOTHREAD); - __set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(&pgdat->kswapd_wait, &wait); - - mb(); - if (kswapd_can_sleep_pgdat(pgdat)) - schedule(); - - __set_current_state(TASK_RUNNING); - remove_wait_queue(&pgdat->kswapd_wait, &wait); - - /* - * If we actually get into a low-memory situation, - * the processes needing more memory will wake us - * up on a more timely basis. - */ + prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); + schedule(); + finish_wait(&pgdat->kswapd_wait, &wait); kswapd_balance_pgdat(pgdat); blk_run_queues(); } --- 2.5.35/mm/page_alloc.c~per-zone-vm Tue Sep 17 00:23:57 2002 +++ 2.5.35-akpm/mm/page_alloc.c Tue Sep 17 00:23:57 2002 @@ -343,8 +343,6 @@ __alloc_pages(unsigned int gfp_mask, uns } } - classzone->need_balance = 1; - mb(); /* we're somewhat low on memory, failed to find what we needed */ for (i = 0; zones[i] != NULL; i++) { struct zone *z = zones[i]; @@ -869,7 +867,6 @@ void __init free_area_init_core(pg_data_ spin_lock_init(&zone->lru_lock); zone->zone_pgdat = pgdat; zone->free_pages = 0; - zone->need_balance = 0; INIT_LIST_HEAD(&zone->active_list); INIT_LIST_HEAD(&zone->inactive_list); atomic_set(&zone->refill_counter, 0); --- 2.5.35/include/linux/mmzone.h~per-zone-vm Tue Sep 17 00:23:57 2002 +++ 2.5.35-akpm/include/linux/mmzone.h Tue Sep 17 00:23:57 2002 @@ -62,7 +62,6 @@ struct zone { spinlock_t lock; unsigned long free_pages; unsigned long pages_min, pages_low, pages_high; - int need_balance; ZONE_PADDING(_pad1_) . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* 35-mm1 triggers watchdog 2002-09-17 8:02 ` Andrew Morton @ 2002-09-17 11:38 ` Ed Tomlinson 2002-09-17 20:12 ` Andrew Morton 0 siblings, 1 reply; 15+ messages in thread From: Ed Tomlinson @ 2002-09-17 11:38 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm Hi Andrew, I have had 35-mm1 reboot twice via the software watchdog. What is the best way to debug this. I do have a serial term and can rebuild patched with the kernel debugger, just need some instructions on how to catch the stall and what info to gather. Is there a good FAQ on kernel debugger? Kernel is 35-mm1 UP no preempth, plus ide probe fixes & a corrected slab callback patch. TIA Ed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: 35-mm1 triggers watchdog 2002-09-17 11:38 ` 35-mm1 triggers watchdog Ed Tomlinson @ 2002-09-17 20:12 ` Andrew Morton 0 siblings, 0 replies; 15+ messages in thread From: Andrew Morton @ 2002-09-17 20:12 UTC (permalink / raw) To: Ed Tomlinson; +Cc: linux-mm Ed Tomlinson wrote: > > Hi Andrew, > > I have had 35-mm1 reboot twice via the software watchdog. What is the best > way to debug this. I do have a serial term and can rebuild patched with the > kernel debugger, just need some instructions on how to catch the stall and > what info to gather. Is there a good FAQ on kernel debugger? Normally ksymoops will tell you where it was locked when the NMI watchdog hit. Aren't you getting a stack trace? Kernel debugger? kgdb.sourceforge.net, with patches from http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/experimental/ You're best off cross-compiling so the source, vmlinux, etc are on the workstation and you copy kernels to the test box. umm, - patch the kernel - enable kgdb in config - build it, lilo it, add: gdb gdbbaud=115200 gdbttyS=ttyS1 to the kernel boot line. - Put my .gdbinit in $HOME. - reboot test box - gdb vmlinux rmt (gdb) c - Run test, wait for NMI watchdog. Sometimes it's a bit hard to work out _why_ the target trapped into the debugger, so I changed kgdb to deliver a SIGEMT in response to NMI rather than SIGBUS/SIGSEGV. set editing on set radix 0x0a define rmt set remotebaud 115200 target remote /dev/ttyS0 end define comm25 p ((struct thread_info *)((int)$esp & ~0x1fff))->task->comm end define task25 p ((struct thread_info *)((int)$esp & ~0x1fff))->task end define thread25 p ((struct thread_info *)((int)$esp & ~0x1fff)) end define reboot maintenance packet r end #process information macros define psname if $arg0 == 0 set $athread = init_tasks[0] else set $athread = pidhash[(($arg0 >> 8) ^ $arg0) & 1023] end if $athread != 0 while $athread->pid != $arg0 && $athread != 0 set $athread = $athread->hash_next end if $athread != 0 printf "%d %s\n", $arg0, (char*)$athread->comm end end end define ps set $initthread = init_tasks[0] set $athread = init_tasks[0] printf "%d %s\n", $athread->pid, (char*)($athread->comm) set $athread = $athread->next_task while $athread != ($initthread) if ($athread->pid) != (0) printf "%d %s\n", $athread->pid, (char*)$athread->comm end set $athread = $athread->next_task end end define page_states printf "Dirty: %dK\n", (page_states[0].nr_dirty + page_states[1].nr_dirty + page_states[2].nr_dirty + page_states[3].nr_dirty) * 4 printf "Writeback: %dK\n", (page_states[0].nr_writeback + page_states[1].nr_writeback + page_states[2].nr_writeback + page_states[3].nr_writeback) * 4 printf "Pagecache: %dK\n", (page_states[0].nr_pagecache + page_states[1].nr_pagecache + page_states[2].nr_pagecache + page_states[3].nr_pagecache) * 4 printf "Page Table Pages: %d\n", (page_states[0].nr_page_table_pages + page_states[1].nr_page_table_pages + page_states[2].nr_page_table_pages + page_states[3].nr_page_table_pages) * 4 printf "nr_reverse_maps: %d\n", page_states[0].nr_reverse_maps + page_states[1].nr_reverse_maps + page_states[2].nr_reverse_maps + page_states[3].nr_reverse_maps end define offsetof set $off = &(((struct $arg0 *)0)->$arg1) printf "%d 0x%x\n", $off, $off end # list_entry list type member define list_entry set $off = (int)&(((struct $arg1 *)0)->$arg2) set $addr = (int)$arg0 set $res = $addr - $off printf "0x%x\n", $res end -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 7:01 ` Hugh Dickins 2002-09-17 7:27 ` William Lee Irwin III @ 2002-09-17 7:57 ` Christoph Rohland 2002-12-10 5:28 ` William Lee Irwin III 2 siblings, 0 replies; 15+ messages in thread From: Christoph Rohland @ 2002-09-17 7:57 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, William Lee Irwin III, linux-mm, linux-kernel Hi Hugh, On Tue, 17 Sep 2002, Hugh Dickins wrote: > What I never did was try GFP_HIGHUSER and kmap on the index pages: > I think I decided back then that it wasn't likely to be needed > (sparsely filled file indexes are a rarer case than sparsely filled > pagetables, once the stupidity is fixed; and small files don't use > index pages at all). But Bill's testing may well prove me wrong. I think that this would be a good improvement. Big database and application servers would definitely benefit from it, desktops could easier use tmpfs as temporary file systems. I never dared to do it with my limited time since I feared deadlock situations. Also I ended up that I would try to go one step further: Make the index pages swappable, i.e. make the directory nodes normal tmpfs files. This would even make the accounting right. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dbench on tmpfs OOM's 2002-09-17 7:01 ` Hugh Dickins 2002-09-17 7:27 ` William Lee Irwin III 2002-09-17 7:57 ` dbench on tmpfs OOM's Christoph Rohland @ 2002-12-10 5:28 ` William Lee Irwin III 2 siblings, 0 replies; 15+ messages in thread From: William Lee Irwin III @ 2002-12-10 5:28 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote: > What I never did was try GFP_HIGHUSER and kmap on the index pages: > I think I decided back then that it wasn't likely to be needed > (sparsely filled file indexes are a rarer case than sparsely filled > pagetables, once the stupidity is fixed; and small files don't use > index pages at all). But Bill's testing may well prove me wrong. The included fix works flawlessly under the conditions of the original reported problem on 2.5.50-bk6-wli-1. Sorry for not getting back to you sooner. Thanks, Bill Results: ------- instance 1: ---------- Throughput 86.2057 MB/sec (NB=107.757 MB/sec 862.057 MBit/sec) 512 procs dbench 512 360.36s user 12645.64s system 1648% cpu 13:08.91 total instance 2: ---------- Throughput 85.8913 MB/sec (NB=107.364 MB/sec 858.913 MBit/sec) 512 procs dbench 512 361.96s user 11780.65s system 1539% cpu 13:08.97 total Peak memory consumption during the run: /proc/meminfo: ------------- MemTotal: 32125300 kB MemFree: 7841472 kB MemShared: 0 kB Buffers: 1236 kB Cached: 23397036 kB SwapCached: 0 kB Active: 149512 kB Inactive: 23386864 kB HighTotal: 31588352 kB HighFree: 7681344 kB LowTotal: 536948 kB LowFree: 160128 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB Mapped: 142508 kB Slab: 133020 kB Committed_AS: 23757472 kB PageTables: 18820 kB ReverseMaps: 168934 HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB /proc/slabinfo (reported by bloatmost): --------------------------------------- shmem_inode_cache: 39321KB 39682KB 99.9 radix_tree_node: 33870KB 34335KB 98.64 pae_pmd: 18732KB 18732KB 100.0 dentry_cache: 11612KB 14156KB 82.2 task_struct: 2691KB 2710KB 99.32 signal_act: 2207KB 2216KB 99.58 filp: 1976KB 2032KB 97.23 size-1024: 1824KB 1824KB 100.0 names_cache: 1740KB 1740KB 100.0 vm_area_struct: 1598KB 1650KB 96.88 pte_chain: 1271KB 1305KB 97.39 size-2048: 982KB 1032KB 95.15 biovec-BIO_MAX_PAGES: 768KB 780KB 98.46 files_cache: 704KB 704KB 100.0 mm_struct: 656KB 665KB 98.65 size-512: 421KB 436KB 96.55 blkdev_requests: 400KB 405KB 98.76 biovec-128: 384KB 390KB 98.46 ext2_inode_cache: 309KB 315KB 98.33 inode_cache: 253KB 253KB 100.0 skbuff_head_cache: 221KB 251KB 88.24 size-32: 183KB 211KB 86.68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2002-12-10 5:28 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-17 4:43 dbench on tmpfs OOM's William Lee Irwin III 2002-09-17 4:58 ` Andrew Morton 2002-09-17 5:01 ` Martin J. Bligh 2002-09-17 5:14 ` Andrew Morton 2002-09-17 5:18 ` Martin J. Bligh 2002-09-17 5:15 ` William Lee Irwin III 2002-09-17 5:31 ` Andrew Morton 2002-09-17 6:43 ` Christoph Rohland 2002-09-17 7:01 ` Hugh Dickins 2002-09-17 7:27 ` William Lee Irwin III 2002-09-17 8:02 ` Andrew Morton 2002-09-17 11:38 ` 35-mm1 triggers watchdog Ed Tomlinson 2002-09-17 20:12 ` Andrew Morton 2002-09-17 7:57 ` dbench on tmpfs OOM's Christoph Rohland 2002-12-10 5:28 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox