* [PATCH] modified segq for 2.5
@ 2002-08-15 14:24 Rik van Riel
2002-09-09 9:38 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-08-15 14:24 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: sfkaplan, linux-mm, Andrew Morton
Hi,
here is a patch that implements a modified SEGQ replacement
for the 2.5 kernel.
- new pages start out on the active list
- once a page reaches the end of the active list:
- if it is (mapped && referenced) it goes to the front of the active list
- otherwise, it gets moved to the front of the inactive list
- linear IO drops pages to the inactive list after it is done with them
- once a page reaches the end of the inactive list:
- if it is referenced, it goes to the front of the active list
- otherwise, it is reclaimed
This means accesses to not mapped pagecache pages while that
page is on the active list get ignored, while accesses to
process pages on the active list get counted. I hope this
bias will help keeping the working set of processes in RAM.
(note that the patch was made against 2.5.29, but it should be
trivial to port to newer kernels)
kind regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.476 -> 1.477
# include/linux/swap.h 1.48 -> 1.49
# mm/readahead.c 1.13 -> 1.14
# mm/vmscan.c 1.85 -> 1.86
# mm/filemap.c 1.114 -> 1.115
# mm/swap.c 1.17 -> 1.18
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/07/29 riel@imladris.surriel.com 1.477
# second chance replacement
# --------------------------------------------
#
diff -Nru a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h Thu Aug 15 11:19:09 2002
+++ b/include/linux/swap.h Thu Aug 15 11:19:09 2002
@@ -161,6 +161,7 @@
extern void FASTCALL(lru_cache_del(struct page *));
extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_page(struct page *));
extern void swap_setup(void);
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c Thu Aug 15 11:19:09 2002
+++ b/mm/filemap.c Thu Aug 15 11:19:09 2002
@@ -848,20 +848,11 @@
/*
* Mark a page as having seen activity.
- *
- * inactive,unreferenced -> inactive,referenced
- * inactive,referenced -> active,unreferenced
- * active,unreferenced -> active,referenced
*/
void mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page)) {
- activate_page(page);
- ClearPageReferenced(page);
- return;
- } else if (!PageReferenced(page)) {
+ if (!PageReferenced(page))
SetPageReferenced(page);
- }
}
/*
diff -Nru a/mm/readahead.c b/mm/readahead.c
--- a/mm/readahead.c Thu Aug 15 11:19:09 2002
+++ b/mm/readahead.c Thu Aug 15 11:19:09 2002
@@ -204,6 +204,39 @@
}
/*
+ * Since we're less likely to use the pages we've already read than
+ * the pages we're about to read we move the pages from the past
+ * window to the inactive list.
+ */
+static void
+drop_behind(struct file *file, unsigned long offset, pgoff_t size)
+{
+ unsigned long page_idx, lower_limit = 0;
+ struct address_space *mapping;
+ struct page *page;
+
+ /* We're re-using already present data or just started reading. */
+ if (size == -1UL || offset == 0)
+ return;
+
+ mapping = file->f_dentry->d_inode->i_mapping;
+
+ if (offset > size)
+ lower_limit = offset - size;
+
+ read_lock(&mapping->page_lock);
+ for (page_idx = offset; page_idx > lower_limit; page_idx--) {
+ page = radix_tree_lookup(&mapping->page_tree, page_idx);
+
+ if (!page || (!PageActive(page) && !PageReferenced(page)))
+ break;
+
+ deactivate_page(page);
+ }
+ read_unlock(&mapping->page_lock);
+}
+
+/*
* page_cache_readahead is the main function. If performs the adaptive
* readahead window size management and submits the readahead I/O.
*/
@@ -286,6 +319,11 @@
ra->prev_page = ra->start;
ra->ahead_start = 0;
ra->ahead_size = 0;
+ /*
+ * Drop the pages from the old window into the
+ * inactive list.
+ */
+ drop_behind(file, offset, ra->size);
/*
* Control now returns, probably to sleep until I/O
* completes against the first ahead page.
diff -Nru a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c Thu Aug 15 11:19:09 2002
+++ b/mm/swap.c Thu Aug 15 11:19:09 2002
@@ -53,6 +53,24 @@
}
/**
+ * deactivate_page - move an active page to the inactive list.
+ * @page: page to deactivate
+ */
+void deactivate_page(struct page * page)
+{
+ spin_lock(&pagemap_lru_lock);
+ if (PageLRU(page) && PageActive(page)) {
+ del_page_from_active_list(page);
+ add_page_to_inactive_list(page);
+ KERNEL_STAT_INC(pgdeactivate);
+ }
+ spin_unlock(&pagemap_lru_lock);
+
+ if (PageReferenced(page))
+ ClearPageReferenced(page);
+}
+
+/**
* lru_cache_add: add a page to the page lists
* @page: the page to add
*/
@@ -60,7 +78,7 @@
{
if (!TestSetPageLRU(page)) {
spin_lock(&pagemap_lru_lock);
- add_page_to_inactive_list(page);
+ add_page_to_active_list(page);
spin_unlock(&pagemap_lru_lock);
}
}
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c Thu Aug 15 11:19:09 2002
+++ b/mm/vmscan.c Thu Aug 15 11:19:09 2002
@@ -138,7 +138,7 @@
* the active list.
*/
pte_chain_lock(page);
- if (page_referenced(page) && page_mapping_inuse(page)) {
+ if (page_referenced(page)) {
del_page_from_inactive_list(page);
add_page_to_active_list(page);
pte_chain_unlock(page);
@@ -346,7 +346,7 @@
KERNEL_STAT_INC(pgscan);
pte_chain_lock(page);
- if (page->pte.chain && page_referenced(page)) {
+ if (page_referenced(page) && page_mapping_inuse(page)) {
list_del(&page->lru);
list_add(&page->lru, &active_list);
pte_chain_unlock(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] modified segq for 2.5 2002-08-15 14:24 [PATCH] modified segq for 2.5 Rik van Riel @ 2002-09-09 9:38 ` Andrew Morton 2002-09-09 11:40 ` Ed Tomlinson ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-09 9:38 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm Rik van Riel wrote: > > Hi, > > here is a patch that implements a modified SEGQ replacement > for the 2.5 kernel. > > - new pages start out on the active list > - once a page reaches the end of the active list: > - if it is (mapped && referenced) it goes to the front of the active list > - otherwise, it gets moved to the front of the inactive list > - linear IO drops pages to the inactive list after it is done with them > - once a page reaches the end of the inactive list: > - if it is referenced, it goes to the front of the active list > - otherwise, it is reclaimed > > This means accesses to not mapped pagecache pages while that > page is on the active list get ignored, while accesses to > process pages on the active list get counted. I hope this > bias will help keeping the working set of processes in RAM. > > (note that the patch was made against 2.5.29, but it should be > trivial to port to newer kernels) > > I ported this up. The below patch applies with or without my recent vmscan.c maulings. I haven't really had time to test it much. Running `make -j6 dep' on a setup where userspace has 14M available seems to be in the operating region. That's fairly swappy but not ridiculously so. Didn't seem to make much difference in that particular dot on the spectrum. 105 seconds all up. 2.4.19 does it in 80 or so, but I wasn't very careful in making sure that both kernels had the same available memory - half a meg here or there could make a big difference. I fiddled with it a bit: did you forget to move the write(2) pages to the inactive list? I changed it to do that at IO completion. It had little effect. Probably should be looking at the page state before doing that. One thing this patch did do was to speed up the initial untar of the kernel source - 50 seconds down to 25. That'll be due to not having so much dirt on the inactive list. The "nonblocking page reclaim" code (needs a better name...) does that in 18 secs. The inactive list was smaller with this patch. Around 10% of allocatable memory usually. btw, I've added the `page_mapped()' helper to replace open-coded testing of page->pte.chain. Because with highpte and HIGHMEM_64G, page->pte.chain is wrong. pte.direct is 64-bit and we need to test all those bits to see if the page is in pagetables. With nonblocking-vm and slabasap, the test took 150 seconds. Removing slabasap took it down to 98 seconds. The slab rework seemed to leave an extra megabyte average in cache. Which is not to say that the algorithms in there are wrong, but perhaps we should push it a bit harder if there's swapout pressure. And the fact that a meg makes that much difference indicates that it's right on the knee of the curve and perhaps not a very interesting test. I like the way in which the patch improves the reclaim success rate. It went from 50% to 80 or 90%. It worries me that the inactive list is so small. But I need to test it more. (This patch looks a lot like NRU - what's the difference?) include/linux/mm_inline.h | 9 ++++++++ include/linux/pagevec.h | 7 ++++++ mm/filemap.c | 14 +++---------- mm/readahead.c | 46 +++++++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 4 +++ mm/swap.c | 49 +++++++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 8 +++++-- 7 files changed, 124 insertions(+), 13 deletions(-) --- 2.5.33/mm/filemap.c~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/mm/filemap.c Mon Sep 9 02:03:48 2002 @@ -24,6 +24,8 @@ #include <linux/writeback.h> #include <linux/pagevec.h> #include <linux/security.h> +#include <linux/mm_inline.h> + /* * This is needed for the following functions: * - try_to_release_page @@ -685,6 +687,7 @@ void end_page_writeback(struct page *pag smp_mb__after_clear_bit(); if (waitqueue_active(waitqueue)) wake_up_all(waitqueue); + deactivate_page(page); } EXPORT_SYMBOL(end_page_writeback); @@ -868,20 +871,11 @@ grab_cache_page_nowait(struct address_sp /* * Mark a page as having seen activity. - * - * inactive,unreferenced -> inactive,referenced - * inactive,referenced -> active,unreferenced - * active,unreferenced -> active,referenced */ void mark_page_accessed(struct page *page) { - if (!PageActive(page) && PageReferenced(page)) { - activate_page(page); - ClearPageReferenced(page); - return; - } else if (!PageReferenced(page)) { + if (!PageReferenced(page)) SetPageReferenced(page); - } } /* --- 2.5.33/mm/readahead.c~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/mm/readahead.c Mon Sep 9 01:44:49 2002 @@ -213,6 +213,45 @@ check_ra_success(struct file_ra_state *r } /* + * Since we're less likely to use the pages we've already read than the pages + * we're about to read we move the pages from the past window to the inactive + * list. + */ +static void +drop_behind(struct address_space *mapping, pgoff_t offset, unsigned long size) +{ + unsigned long page_idx; + unsigned long lower_limit = 0; + struct page *page; + struct pagevec pvec; + + /* We're re-using already present data or just started reading. */ + if (size == -1UL || offset == 0) + return; + + if (offset > size) + lower_limit = offset - size; + + pagevec_init(&pvec); + read_lock(&mapping->page_lock); + for (page_idx = offset; page_idx > lower_limit; page_idx--) { + page = radix_tree_lookup(&mapping->page_tree, page_idx); + + if (!page || (!PageActive(page) && !PageReferenced(page))) + break; + + page_cache_get(page); + if (!pagevec_add(&pvec, page)) { + read_unlock(&mapping->page_lock); + __pagevec_deactivate_active(&pvec); + read_lock(&mapping->page_lock); + } + } + read_unlock(&mapping->page_lock); + pagevec_deactivate_active(&pvec); +} + +/* * page_cache_readahead is the main function. If performs the adaptive * readahead window size management and submits the readahead I/O. */ @@ -296,6 +335,13 @@ void page_cache_readahead(struct file *f ra->ahead_start = 0; ra->ahead_size = 0; /* + * Drop the pages from the old window into the + * inactive list. + */ + drop_behind(file->f_dentry->d_inode->i_mapping, + offset, ra->size); + + /* * Control now returns, probably to sleep until I/O * completes against the first ahead page. * When the second page in the old ahead window is --- 2.5.33/include/linux/pagevec.h~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/include/linux/pagevec.h Mon Sep 9 01:44:49 2002 @@ -18,6 +18,7 @@ void __pagevec_release(struct pagevec *p void __pagevec_release_nonlru(struct pagevec *pvec); void __pagevec_free(struct pagevec *pvec); void __pagevec_lru_add(struct pagevec *pvec); +void __pagevec_deactivate_active(struct pagevec *pvec); void lru_add_drain(void); void pagevec_deactivate_inactive(struct pagevec *pvec); void pagevec_strip(struct pagevec *pvec); @@ -69,3 +70,9 @@ static inline void pagevec_lru_add(struc if (pagevec_count(pvec)) __pagevec_lru_add(pvec); } + +static inline void pagevec_deactivate_active(struct pagevec *pvec) +{ + if (pagevec_count(pvec)) + __pagevec_deactivate_active(pvec); +} --- 2.5.33/mm/swap.c~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/mm/swap.c Mon Sep 9 01:44:49 2002 @@ -196,6 +196,38 @@ void pagevec_deactivate_inactive(struct } /* + * Move all the active pages to the head of the inactive list and release them. + * Reinitialises the caller's pagevec. + */ +void __pagevec_deactivate_active(struct pagevec *pvec) +{ + int i; + struct zone *zone = NULL; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct zone *pagezone = page_zone(page); + + if (pagezone != zone) { + if (!PageActive(page) || !PageLRU(page)) + continue; + if (zone) + spin_unlock_irq(&zone->lru_lock); + zone = pagezone; + spin_lock_irq(&zone->lru_lock); + } + if (PageActive(page) && PageLRU(page)) { + del_page_from_active_list(zone, page); + ClearPageActive(page); + add_page_to_inactive_list(zone, page); + } + } + if (zone) + spin_unlock_irq(&zone->lru_lock); + __pagevec_release(pvec); +} + +/* * Add the passed pages to the inactive_list, then drop the caller's refcount * on them. Reinitialises the caller's pagevec. */ @@ -216,7 +248,8 @@ void __pagevec_lru_add(struct pagevec *p } if (TestSetPageLRU(page)) BUG(); - add_page_to_inactive_list(zone, page); + add_page_to_active_list(zone, page); + SetPageActive(page); } if (zone) spin_unlock_irq(&zone->lru_lock); @@ -240,6 +273,20 @@ void pagevec_strip(struct pagevec *pvec) } } +void __deactivate_page(struct page *page) +{ + struct zone *zone = page_zone(page); + unsigned long flags; + + spin_lock_irqsave(&zone->lru_lock, flags); + if (PageLRU(page) && PageActive(page)) { + del_page_from_active_list(zone, page); + ClearPageActive(page); + add_page_to_inactive_list(zone, page); + } + spin_unlock_irqrestore(&zone->lru_lock, flags); +} + /* * Perform any setup for the swap system */ --- 2.5.33/mm/vmscan.c~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/mm/vmscan.c Mon Sep 9 01:44:49 2002 @@ -126,7 +126,7 @@ shrink_list(struct list_head *page_list, } pte_chain_lock(page); - if (page_referenced(page) && page_mapping_inuse(page)) { + if (page_referenced(page)) { /* In active use or really unfreeable. Activate it. */ pte_chain_unlock(page); goto activate_locked; @@ -411,9 +411,13 @@ refill_inactive_zone(struct zone *zone, while (!list_empty(&l_hold)) { page = list_entry(l_hold.prev, struct page, lru); list_del(&page->lru); + if (TestClearPageReferenced(page)) { + list_add(&page->lru, &l_active); + continue; + } if (page_mapped(page)) { pte_chain_lock(page); - if (page_mapped(page) && page_referenced(page)) { + if (page_referenced(page) && page_mapping_inuse(page)) { pte_chain_unlock(page); list_add(&page->lru, &l_active); continue; --- 2.5.33/mm/rmap.c~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/mm/rmap.c Mon Sep 9 01:44:49 2002 @@ -125,6 +125,9 @@ int page_referenced(struct page * page) if (TestClearPageReferenced(page)) referenced++; + if (!page_mapped(page)) + goto out; + if (PageDirect(page)) { pte_t *pte = rmap_ptep_map(page->pte.direct); if (ptep_test_and_clear_young(pte)) @@ -158,6 +161,7 @@ int page_referenced(struct page * page) pte_chain_free(pc); } } +out: return referenced; } --- 2.5.33/include/linux/mm_inline.h~segq Mon Sep 9 01:44:49 2002 +++ 2.5.33-akpm/include/linux/mm_inline.h Mon Sep 9 01:44:49 2002 @@ -38,3 +38,12 @@ del_page_from_lru(struct zone *zone, str zone->nr_inactive--; } } + + +void __deactivate_page(struct page *page); + +static inline void deactivate_page(struct page *page) +{ + if (PageLRU(page) && PageActive(page)) + __deactivate_page(page); +} . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 9:38 ` Andrew Morton @ 2002-09-09 11:40 ` Ed Tomlinson 2002-09-09 17:10 ` William Lee Irwin III 2002-09-09 18:58 ` Andrew Morton 2002-09-09 13:10 ` Rik van Riel 2002-09-09 22:46 ` Daniel Phillips 2 siblings, 2 replies; 28+ messages in thread From: Ed Tomlinson @ 2002-09-09 11:40 UTC (permalink / raw) To: Andrew Morton, Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm On September 9, 2002 05:38 am, Andrew Morton wrote: > With nonblocking-vm and slabasap, the test took 150 seconds. > Removing slabasap took it down to 98 seconds. The slab rework > seemed to leave an extra megabyte average in cache. Which is not > to say that the algorithms in there are wrong, but perhaps we should > push it a bit harder if there's swapout pressure. Andrew, One simple change that will make slabasap try harder is to use only inactive pages caculating the ratio. unsigned int nr_used_zone_pages(void) { unsigned int pages = 0; struct zone *zone; for_each_zone(zone) pages += zone->nr_inactive; return pages; } This will make it closer to slablru which used the inactive list. Second item. Do you run gkrelmon when doing your tests? If not please install it and watch it slowly start to eat resources. This morning (uptime 12hr) it was using 31% of CPU. Stopping and starting it did not change this. Think we have something we can improve here. I have inclued an strace of one (and a bit) update cycle. This was with 33-mm5 with your varient of slabasap. Ed open("/proc/meminfo", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "MemTotal: 516920 kB\nMemFre"..., 1024) = 491 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 gettimeofday({1031571076, 678996}, NULL) = 0 write(3, ">\2\7\0\30\2`\2\375\1`\2\35\0`\2\0\0%\0\0\0%\0P\0\3\0>"..., 1956) = 1956 ioctl(3, 0x541b, [0]) = 0 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 262) = 0 gettimeofday({1031571076, 945260}, NULL) = 0 time([1031571076]) = 1031571076 open("/proc/stat", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "cpu 418635 1309463 315263 22714"..., 1024) = 591 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/loadavg", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "3.27 2.32 3.38 3/132 14540\n", 1024) = 27 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/dev", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Inter-| Receive "..., 1024) = 938 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 gettimeofday({1031571076, 949176}, NULL) = 0 write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 784) = 784 ioctl(3, 0x541b, [0]) = 0 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 473) = 0 gettimeofday({1031571077, 424287}, NULL) = 0 time([1031571077]) = 1031571077 open("/proc/stat", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "cpu 418639 1309506 315264 22714"..., 1024) = 591 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/loadavg", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "3.27 2.32 3.38 2/132 14540\n", 1024) = 27 close(6) = 0 munmap(0x4001d000, 4096) = 0 write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0)\0`\2J\0\5\0m\0`\2!\0`\2"..., 2048) = 2048 open("/proc/net/tcp", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, " sl local_address rem_address "..., 1024) = 1024 read(6, " \n 6: "..., 1024) = 1024 read(6, "dc00040 3000 0 0 2 -1 "..., 1024) = 1024 read(6, "000 0 0 6460 1 da6dfc"..., 1024) = 1024 read(6, "00000000 00:00000000 00000000 1"..., 1024) = 1024 read(6, "0100007F:8001 01 00000000:000000"..., 1024) = 1024 read(6, " \n 40: 0100007F:866E 010000"..., 1024) = 1024 read(6, "-1 \n", 1024) = 32 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/tcp6", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/net/dev", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Inter-| Receive "..., 1024) = 938 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/route", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Iface\tDestination\tGateway \tFlags"..., 1024) = 512 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 time(NULL) = 1031571077 open("/proc/meminfo", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "MemTotal: 516920 kB\nMemFre"..., 1024) = 491 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/mounts", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "rootfs / rootfs rw 0 0\n/dev/root"..., 1024) = 314 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 statfs("/", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=786466, f_bfree=120154, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0 statfs("/poola", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=2477941, f_bfree=892388, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0 statfs("/poole", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=8870498, f_bfree=2468598, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0 statfs("/boot", {f_type="EXT2_SUPER_MAGIC", f_bsize=1024, f_blocks=63925, f_bfree=21000, f_files=16560, f_ffree=14904, f_namelen=255}) = 0 statfs("/tmp", {f_type=0x1021994, f_bsize=4096, f_blocks=192000, f_bfree=191685, f_files=64615, f_ffree=64593, f_namelen=255}) = 0 statfs("/poolg", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=8870624, f_bfree=2371206, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0 statfs("/root2", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=774823, f_bfree=137303, f_files=393600, f_ffree=261747, f_namelen=255}) = 0 gettimeofday({1031571077, 639770}, NULL) = 0 write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0\'\0`\2J\0\5\0k\1`\2!\0`\2"..., 1900) = 1900 ioctl(3, 0x541b, [0]) = 0 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 260) = 0 gettimeofday({1031571077, 916658}, NULL) = 0 time([1031571077]) = 1031571077 open("/proc/stat", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "cpu 418649 1309524 315285 22714"..., 1024) = 591 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/loadavg", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "3.27 2.32 3.38 4/132 14540\n", 1024) = 27 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/dev", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Inter-| Receive "..., 1024) = 938 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 gettimeofday({1031571077, 920415}, NULL) = 0 write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 192) = 192 ioctl(3, 0x541b, [0]) = 0 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 473) = 0 gettimeofday({1031571078, 396278}, NULL) = 0 time([1031571078]) = 1031571078 open("/proc/stat", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "cpu 418653 1309567 315286 22714"..., 1024) = 591 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/loadavg", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "3.27 2.32 3.38 3/132 14540\n", 1024) = 27 close(6) = 0 munmap(0x4001d000, 4096) = 0 write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0)\0`\2J\0\5\0m\0`\2!\0`\2"..., 2048) = 2048 open("/proc/net/tcp", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, " sl local_address rem_address "..., 1024) = 1024 read(6, " \n 6: "..., 1024) = 1024 read(6, "dc00040 3000 0 0 2 -1 "..., 1024) = 1024 read(6, "000 0 0 6460 1 da6dfc"..., 1024) = 1024 read(6, "00000000 00:00000000 00000000 1"..., 1024) = 1024 read(6, "0100007F:8001 01 00000000:000000"..., 1024) = 1024 read(6, " \n 40: 0100007F:866E 010000"..., 1024) = 1024 read(6, "-1 \n", 1024) = 32 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/tcp6", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/net/dev", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Inter-| Receive "..., 1024) = 938 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 writev(3, [{"8\2\5\0!\0`\2\4@\0\0\0\0\0\0\'\0`\2J\0\5\0k\1`\2!\0`\2"..., 2048}, {"\227\320\357\0", 4}], 2) = 2052 open("/proc/net/route", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Iface\tDestination\tGateway \tFlags"..., 1024) = 512 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 time(NULL) = 1031571078 open("/proc/meminfo", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "MemTotal: 516920 kB\nMemFre"..., 1024) = 491 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 gettimeofday({1031571078, 614278}, NULL) = 0 write(3, "J\2\5\0\320\2`\2!\0`\2\2\0\f\0\1\0000\0>\0\7\0\320\2`\2"..., 404) = 404 ioctl(3, 0x541b, [0]) = 0 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 258) = 0 gettimeofday({1031571078, 875241}, NULL) = 0 time([1031571078]) = 1031571078 open("/proc/stat", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "cpu 418657 1309592 315306 22714"..., 1024) = 591 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/loadavg", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "3.27 2.32 3.38 2/132 14540\n", 1024) = 27 close(6) = 0 munmap(0x4001d000, 4096) = 0 open("/proc/net/dev", O_RDONLY) = 6 fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000 read(6, "Inter-| Receive "..., 1024) = 938 read(6, "", 1024) = 0 close(6) = 0 munmap(0x4001d000, 4096) = 0 gettimeofday({1031571078, 879754}, NULL) = 0 write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 700) = 700 ioctl(3, 0x541b, [0]) = 0 poll( <unfinished ...> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 11:40 ` Ed Tomlinson @ 2002-09-09 17:10 ` William Lee Irwin III 2002-09-09 18:58 ` Andrew Morton 1 sibling, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2002-09-09 17:10 UTC (permalink / raw) To: Ed Tomlinson; +Cc: Andrew Morton, Rik van Riel, sfkaplan, linux-mm On Mon, Sep 09, 2002 at 07:40:16AM -0400, Ed Tomlinson wrote: > Second item. Do you run gkrelmon when doing your tests? If not please > install it and watch it slowly start to eat resources. This morning (uptime > Think we have something we can improve here. I have inclued an strace > of one (and a bit) update cycle. > This was with 33-mm5 with your varient of slabasap. strace -r to get relative timestamps. I've seen some issues where tasks suck progressively more cpu over time and the box gets unusable, leading most notably to 30+s or longer fork/exit latencies. Still on idea what's going wrong when it does, though. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 11:40 ` Ed Tomlinson 2002-09-09 17:10 ` William Lee Irwin III @ 2002-09-09 18:58 ` Andrew Morton 1 sibling, 0 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-09 18:58 UTC (permalink / raw) To: Ed Tomlinson; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm Ed Tomlinson wrote: > > On September 9, 2002 05:38 am, Andrew Morton wrote: > > > With nonblocking-vm and slabasap, the test took 150 seconds. > > Removing slabasap took it down to 98 seconds. The slab rework > > seemed to leave an extra megabyte average in cache. Which is not > > to say that the algorithms in there are wrong, but perhaps we should > > push it a bit harder if there's swapout pressure. > > Andrew, One simple change that will make slabasap try harder is to > use only inactive pages caculating the ratio. > > unsigned int nr_used_zone_pages(void) > { > unsigned int pages = 0; > struct zone *zone; > > for_each_zone(zone) > pages += zone->nr_inactive; > > return pages; > } > > This will make it closer to slablru which used the inactive list. hmm. Well if we are to be honest to the "account for seeks" thing then perhaps we should double-count for swap activity - a swapout and a swapin is two units of seekiness. So consider add_to_swap() to be worth two page scans. Maybe the same for swap_writepage(). That should increase pressure on slab when anon pages are being victimised. Ditto for dirty MAP_SHARED I guess. > Second item. Do you run gkrelmon when doing your tests? If not please > install it and watch it slowly start to eat resources. This morning (uptime > 12hr) it was using 31% of CPU. Stopping and starting it did not change this. > Think we have something we can improve here. I have inclued an strace > of one (and a bit) update cycle. I was running gkrellm for a while. Is that the same thing? I didn't see anything untoward in there. It seems to update at 10Hz or more, so it's fairly expensive. But no obvious increase in load across time. It seems that the CPU load accounting in 2.5 is a bit odd; perhaps as a result of the HZ changes. Certainly it is hard to make comparisons with 2.4 based upon it. Probably one needs to equalise the HZ settings to make useful comparison. Anyway. Could you please run the kernel profiler, see where the time is being spent? Just add `profile=1' to the kernel boot line and use this: readprofile -r sleep 30 readprofile -v -m /boot/System.map | sort -n +2 | tail -40 (If readprofile screws up, edit your System.map and remove all the lines containing " w " and " W ") -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 9:38 ` Andrew Morton 2002-09-09 11:40 ` Ed Tomlinson @ 2002-09-09 13:10 ` Rik van Riel 2002-09-09 19:03 ` Andrew Morton 2002-09-09 22:46 ` Daniel Phillips 2 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-09 13:10 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Mon, 9 Sep 2002, Andrew Morton wrote: > I fiddled with it a bit: did you forget to move the write(2) pages > to the inactive list? I changed it to do that at IO completion. > It had little effect. Probably should be looking at the page state > before doing that. Hmmm indeed, I forgot this. Note that IO completion state is too late, since then you'll have already pushed other pages out to the inactive list... > The inactive list was smaller with this patch. Around 10% > of allocatable memory usually. It should be a bit bigger than this, I think. If it isn't something may be going wrong ;) > I like the way in which the patch improves the reclaim success rate. > It went from 50% to 80 or 90%. That should help reduce the randomizing of the inactive list ;) > It worries me that the inactive list is so small. But I need to > test it more. It's actually ok, though a larger inactive list might help with some workloads (or make the system worse with some others?). > (This patch looks a lot like NRU - what's the difference?) For mapped pages, it basically is NRU. For normal cache pages, references while on the active list don't count, they will still get evicted. Only references while on the inactive list can save such a page. What this means is that (in clock terminology) the handspread for non-mapped cache pages is much smaller than for mapped pages. With an inactive list size of 10%, the handspread for mapped pages is about 10 times as wide as that for non-mapped pages, giving the mapped pages a bit of an advantage over the cache... regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 13:10 ` Rik van Riel @ 2002-09-09 19:03 ` Andrew Morton 2002-09-09 19:25 ` Rik van Riel 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2002-09-09 19:03 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm Rik van Riel wrote: > > On Mon, 9 Sep 2002, Andrew Morton wrote: > > > I fiddled with it a bit: did you forget to move the write(2) pages > > to the inactive list? I changed it to do that at IO completion. > > It had little effect. Probably should be looking at the page state > > before doing that. > > Hmmm indeed, I forgot this. Note that IO completion state is > too late, since then you'll have already pushed other pages > out to the inactive list... OK. So how would you like to handle those pages? > > The inactive list was smaller with this patch. Around 10% > > of allocatable memory usually. > > It should be a bit bigger than this, I think. If it isn't > something may be going wrong ;) Well the working set _was_ large. Sure, we'll be running refill_inactive a lot. But spending some CPU in there with this sort of workload is the right thing to do, if it ends up in better replacement decisions. So it doesn't seem to be a problem per-se? (It's soaking CPU when the VM isn't adding value which offends me ;)) Generally, where do you want to go with this code? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 19:03 ` Andrew Morton @ 2002-09-09 19:25 ` Rik van Riel 2002-09-09 19:55 ` Andrew Morton 2002-09-09 20:51 ` Andrew Morton 0 siblings, 2 replies; 28+ messages in thread From: Rik van Riel @ 2002-09-09 19:25 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Mon, 9 Sep 2002, Andrew Morton wrote: > Rik van Riel wrote: > > On Mon, 9 Sep 2002, Andrew Morton wrote: > > > > > I fiddled with it a bit: did you forget to move the write(2) pages > > > to the inactive list? I changed it to do that at IO completion. > > > It had little effect. Probably should be looking at the page state > > > before doing that. > > > > Hmmm indeed, I forgot this. Note that IO completion state is > > too late, since then you'll have already pushed other pages > > out to the inactive list... > > OK. So how would you like to handle those pages? Move them to the inactive list the moment we're done writing them, that is, the moment we move on to the next page. We wouldn't want to move the last page from /var/log/messages to the inactive list all the time ;) > > > The inactive list was smaller with this patch. Around 10% > > > of allocatable memory usually. > > > > It should be a bit bigger than this, I think. If it isn't > > something may be going wrong ;) > > Well the working set _was_ large. Sure, we'll be running refill_inactive > a lot. But spending some CPU in there with this sort of workload is the > right thing to do, if it ends up in better replacement decisions. So > it doesn't seem to be a problem per-se? OK, in that case there's no problem. If the working set really does take 90% of RAM that's a good thing to know ;) > Generally, where do you want to go with this code? If this code turns out to be more predictable and better or equal performance to use-once, I'd like to see it in the kernel. Use-once seems just too hard to tune right for all workloads. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 19:25 ` Rik van Riel @ 2002-09-09 19:55 ` Andrew Morton 2002-09-09 20:03 ` Rik van Riel 2002-09-09 20:51 ` Andrew Morton 1 sibling, 1 reply; 28+ messages in thread From: Andrew Morton @ 2002-09-09 19:55 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm Rik van Riel wrote: > > On Mon, 9 Sep 2002, Andrew Morton wrote: > > Rik van Riel wrote: > > > On Mon, 9 Sep 2002, Andrew Morton wrote: > > > > > > > I fiddled with it a bit: did you forget to move the write(2) pages > > > > to the inactive list? I changed it to do that at IO completion. > > > > It had little effect. Probably should be looking at the page state > > > > before doing that. > > > > > > Hmmm indeed, I forgot this. Note that IO completion state is > > > too late, since then you'll have already pushed other pages > > > out to the inactive list... > > > > OK. So how would you like to handle those pages? > > Move them to the inactive list the moment we're done writing > them, that is, the moment we move on to the next page. We > wouldn't want to move the last page from /var/log/messages to > the inactive list all the time ;) That's easy. > > > > The inactive list was smaller with this patch. Around 10% > > > > of allocatable memory usually. > > > > > > It should be a bit bigger than this, I think. If it isn't > > > something may be going wrong ;) > > > > Well the working set _was_ large. Sure, we'll be running refill_inactive > > a lot. But spending some CPU in there with this sort of workload is the > > right thing to do, if it ends up in better replacement decisions. So > > it doesn't seem to be a problem per-se? > > OK, in that case there's no problem. If the working set > really does take 90% of RAM that's a good thing to know ;) The working set appears to be 100.000% of RAM, hence the wild swings in throughput when you give or take half a meg. > > Generally, where do you want to go with this code? > > If this code turns out to be more predictable and better > or equal performance to use-once, I'd like to see it in > the kernel. Use-once seems just too hard to tune right > for all workloads. > gack. How do we judge that, without waiting a month and measuring the complaint level? (Here I go again). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 19:55 ` Andrew Morton @ 2002-09-09 20:03 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2002-09-09 20:03 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Mon, 9 Sep 2002, Andrew Morton wrote: > > OK, in that case there's no problem. If the working set > > really does take 90% of RAM that's a good thing to know ;) > > The working set appears to be 100.000% of RAM, hence the wild > swings in throughput when you give or take half a meg. In that case some form of load control should kick in, when the working set no longer fits in RAM we should degrade gracefully instead of just breaking down. Implementing load control is not an excercise that should be left to most readers, however ;) > > > Generally, where do you want to go with this code? > > > > If this code turns out to be more predictable and better > > or equal performance to use-once, I'd like to see it in > > the kernel. Use-once seems just too hard to tune right > > for all workloads. > > gack. How do we judge that, without waiting a month and > measuring the complaint level? (Here I go again). Beats me. We have reasoning and trying the thing on our own systems, but there don't seem to be any tools to measure what you want to know... regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 19:25 ` Rik van Riel 2002-09-09 19:55 ` Andrew Morton @ 2002-09-09 20:51 ` Andrew Morton 2002-09-09 20:57 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-09 20:51 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm Rik van Riel wrote: > > ... > > > Hmmm indeed, I forgot this. Note that IO completion state is > > > too late, since then you'll have already pushed other pages > > > out to the inactive list... > > > > OK. So how would you like to handle those pages? > > Move them to the inactive list the moment we're done writing > them, that is, the moment we move on to the next page. We > wouldn't want to move the last page from /var/log/messages to > the inactive list all the time ;) The moment "who" has done writing them? Some writeout comes in via shrink_foo() and a ton of writeout comes in via balance_dirty_pages(), pdflush, etc. Do we need to distinguish between the various contexts? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 20:51 ` Andrew Morton @ 2002-09-09 20:57 ` Andrew Morton 2002-09-09 21:09 ` Rik van Riel 2002-09-09 22:49 ` William Lee Irwin III 2 siblings, 0 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-09 20:57 UTC (permalink / raw) To: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm Andrew Morton wrote: > > Rik van Riel wrote: > > > > ... > > > > Hmmm indeed, I forgot this. Note that IO completion state is > > > > too late, since then you'll have already pushed other pages > > > > out to the inactive list... > > > > > > OK. So how would you like to handle those pages? > > > > Move them to the inactive list the moment we're done writing > > them, that is, the moment we move on to the next page. We > > wouldn't want to move the last page from /var/log/messages to > > the inactive list all the time ;) > > The moment "who" has done writing them? Some writeout > comes in via shrink_foo() and a ton of writeout comes in > via balance_dirty_pages(), pdflush, etc. > > Do we need to distinguish between the various contexts? Forget I said that. I added this: --- 2.5.34/fs/mpage.c~segq Mon Sep 9 13:53:25 2002 +++ 2.5.34-akpm/fs/mpage.c Mon Sep 9 13:54:07 2002 @@ -583,10 +583,9 @@ mpage_writepages(struct address_space *m bio = mpage_writepage(bio, page, get_block, &last_block_in_bio, &ret); } - if ((current->flags & PF_MEMALLOC) && - !PageActive(page) && PageLRU(page)) { + if (PageActive(page) && PageLRU(page)) { if (!pagevec_add(&pvec, page)) - pagevec_deactivate_inactive(&pvec); + pagevec_deactivate_active(&pvec); page = NULL; } if (ret == -EAGAIN && page) { @@ -612,7 +611,7 @@ mpage_writepages(struct address_space *m * Leave any remaining dirty pages on ->io_pages */ write_unlock(&mapping->page_lock); - pagevec_deactivate_inactive(&pvec); + pagevec_deactivate_active(&pvec); if (bio) mpage_bio_submit(WRITE, bio); return ret; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 20:51 ` Andrew Morton 2002-09-09 20:57 ` Andrew Morton @ 2002-09-09 21:09 ` Rik van Riel 2002-09-09 21:52 ` Andrew Morton 2002-09-09 22:49 ` William Lee Irwin III 2 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-09 21:09 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Mon, 9 Sep 2002, Andrew Morton wrote: > Rik van Riel wrote: > > Move them to the inactive list the moment we're done writing > > them, that is, the moment we move on to the next page. We > > The moment "who" has done writing them? Some writeout > comes in via shrink_foo() and a ton of writeout comes in > via balance_dirty_pages(), pdflush, etc. generic_file_write, once that function moves beyond the last byte of the page, onto the next page, we can be pretty sure it's done writing to this page pages where it always does partial writes, like buffer cache, database indices, etc... will stay in memory for a longer time. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 21:09 ` Rik van Riel @ 2002-09-09 21:52 ` Andrew Morton 2002-09-09 22:41 ` Rik van Riel 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2002-09-09 21:52 UTC (permalink / raw) To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm Rik van Riel wrote: > > On Mon, 9 Sep 2002, Andrew Morton wrote: > > Rik van Riel wrote: > > > > Move them to the inactive list the moment we're done writing > > > them, that is, the moment we move on to the next page. We > > > > The moment "who" has done writing them? Some writeout > > comes in via shrink_foo() and a ton of writeout comes in > > via balance_dirty_pages(), pdflush, etc. > > generic_file_write, once that function moves beyond the last > byte of the page, onto the next page, we can be pretty sure > it's done writing to this page Oh. So why don't we just start those new pages out on the inactive list? I fear that this change will result in us encountering more dirty pages on the inactive list. It could be that moving then onto the inactive list when IO is started is a good compromise - that will happen pretty darn quick if the system is under dirty pressure anyway. Do we remove the SetPageReferenced() in generic_file_write? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 21:52 ` Andrew Morton @ 2002-09-09 22:41 ` Rik van Riel 2002-09-10 0:17 ` Daniel Phillips 0 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-09 22:41 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Mon, 9 Sep 2002, Andrew Morton wrote: > > generic_file_write, once that function moves beyond the last > > byte of the page, onto the next page, we can be pretty sure > > it's done writing to this page > > Oh. So why don't we just start those new pages out on the > inactive list? I guess that should work, combined with a re-dropping of pages when we're doing sequential writes. > I fear that this change will result in us encountering more dirty > pages on the inactive list. If that's a problem, something is seriously fucked with the VM ;) > Do we remove the SetPageReferenced() in generic_file_write? Good question, I think we'll want to SetPageReferenced() when we do a partial write but ClearPageReferenced() when we've "written past the end" of the page. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:41 ` Rik van Riel @ 2002-09-10 0:17 ` Daniel Phillips 0 siblings, 0 replies; 28+ messages in thread From: Daniel Phillips @ 2002-09-10 0:17 UTC (permalink / raw) To: Rik van Riel, Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Tuesday 10 September 2002 00:41, Rik van Riel wrote: > On Mon, 9 Sep 2002, Andrew Morton wrote: > > Do we remove the SetPageReferenced() in generic_file_write? > > Good question, I think we'll want to SetPageReferenced() when > we do a partial write but ClearPageReferenced() when we've > "written past the end" of the page. There's no substitute for the real thing: a short delay queue where we treat all references as a single reference. In generic_file_write, a page goes onto this list immediately on instantiation. On exit from the delay queue we unconditionally clear the referenced bit and use the rmap list to discard the pte referenced bits, then move the page to the inactive list. >From there, a second reference will rescue the page to the hot end of the active list. Faulted-in pages, including swapped-in pages, mmaped pages and zeroed anon pages, take the same path as file IO pages. A reminder of why we're going to all this effort in the first place: it's to distinguish automatically between streaming IO and repeated use of data. With the improvements described here, we will additionally be able to detect used-once anon pages, which would include execute-once. Because of readahead, generic_file_read has to work a little differently. Ideally, we'd have a time-ordered readahead list and when the readahead heuristics accidently get too aggressive, we can cannibalize the future end of the list (and pour some cold water on the readahead thingy). A crude approximation of that behavior is just to have a readahead FIFO, and an even cruder approximation is to use the inactive list for this purpose. Unfortunately, the latter is too crude, because not-yet-used-readahead pages have to have a higher priority than just-used pages, otherwise the former will be recovered before the latter, which is not what we want. In any event, each page that passes under the read head of generic_file_read goes to the hot end of the delay queue, and from there behaves just like other kinds of pages. Attention has to be paid to balancing the aggressiveness of readahead against the refill_inactive scanning rate. These move in opposite directions in response to memory pressure. One could argue that program text is inherently more valuable than allocated data or file cache, in which case it may want its own inactive list, so that we can reclaim program text vs other kinds of data at different rates. The relative rates could depend on the relative instantiation rates (which includes the faulting rate and the file IO cache page creation rate). However, I'd like to see how well the crude presumption of equality works, and besides, it's less work that way. (So ignore this paragraph, please.) As far as zones go, the route of least resistance is to make both the delay queue and the readahead list per-zone, and since that means it's also per-node, numa people should like it. On the testing front, one useful cross-check is to determine whether hot spots in code are correctly detected. After running a while under mixed program activity and file IO, we should see that the hot spots as determined by a profiler (or cooked by a test program) have in fact moved to the active list, while initialization code has been evicted. All of the above is O(1). -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 20:51 ` Andrew Morton 2002-09-09 20:57 ` Andrew Morton 2002-09-09 21:09 ` Rik van Riel @ 2002-09-09 22:49 ` William Lee Irwin III 2002-09-09 22:54 ` Rik van Riel 2 siblings, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2002-09-09 22:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, sfkaplan, linux-mm Rik van Riel wrote: >> Move them to the inactive list the moment we're done writing >> them, that is, the moment we move on to the next page. We >> wouldn't want to move the last page from /var/log/messages to >> the inactive list all the time ;) On Mon, Sep 09, 2002 at 01:51:35PM -0700, Andrew Morton wrote: > The moment "who" has done writing them? Some writeout > comes in via shrink_foo() and a ton of writeout comes in > via balance_dirty_pages(), pdflush, etc. > Do we need to distinguish between the various contexts? Ideally some distinction would be nice, even if only to distinguish I/O demanded to be done directly by the workload from background writeback and/or readahead. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:49 ` William Lee Irwin III @ 2002-09-09 22:54 ` Rik van Riel 2002-09-09 23:32 ` William Lee Irwin III 0 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2002-09-09 22:54 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrew Morton, sfkaplan, linux-mm On Mon, 9 Sep 2002, William Lee Irwin III wrote: > Ideally some distinction would be nice, even if only to distinguish I/O > demanded to be done directly by the workload from background writeback > and/or readahead. OK, are we talking about page replacement or does queue scanning have priority over the quality of page replacement ? ;) Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:54 ` Rik van Riel @ 2002-09-09 23:32 ` William Lee Irwin III 2002-09-09 23:53 ` Rik van Riel 0 siblings, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2002-09-09 23:32 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, sfkaplan, linux-mm On Mon, 9 Sep 2002, William Lee Irwin III wrote: >> Ideally some distinction would be nice, even if only to distinguish I/O >> demanded to be done directly by the workload from background writeback >> and/or readahead. On Mon, Sep 09, 2002 at 07:54:29PM -0300, Rik van Riel wrote: > OK, are we talking about page replacement or does queue scanning > have priority over the quality of page replacement ? ;) This is relatively tangential. The concern expressed has more to do with VM writeback starving workload-issued I/O than page replacement. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 23:32 ` William Lee Irwin III @ 2002-09-09 23:53 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2002-09-09 23:53 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrew Morton, sfkaplan, linux-mm On Mon, 9 Sep 2002, William Lee Irwin III wrote: > On Mon, 9 Sep 2002, William Lee Irwin III wrote: > >> Ideally some distinction would be nice, even if only to distinguish I/O > >> demanded to be done directly by the workload from background writeback > >> and/or readahead. > > On Mon, Sep 09, 2002 at 07:54:29PM -0300, Rik van Riel wrote: > > OK, are we talking about page replacement or does queue scanning > > have priority over the quality of page replacement ? ;) > > This is relatively tangential. The concern expressed has more to do > with VM writeback starving workload-issued I/O than page replacement. If that happens, the asynchronous writeback threshold should be lower. Maybe we could even tune this dynamically ... Compromising on page replacement is generally a Bad Idea(tm) because page faults are expensive, very expensive. Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 9:38 ` Andrew Morton 2002-09-09 11:40 ` Ed Tomlinson 2002-09-09 13:10 ` Rik van Riel @ 2002-09-09 22:46 ` Daniel Phillips 2002-09-09 22:58 ` Andrew Morton 2 siblings, 1 reply; 28+ messages in thread From: Daniel Phillips @ 2002-09-09 22:46 UTC (permalink / raw) To: Andrew Morton, Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm On Monday 09 September 2002 11:38, Andrew Morton wrote: > One thing this patch did do was to speed up the initial untar of > the kernel source - 50 seconds down to 25. That'll be due to not > having so much dirt on the inactive list. The "nonblocking page > reclaim" code (needs a better name...) Nonblocking kswapd, no? Perhaps 'kscand' would be a better name, now. > ...does that in 18 secs. Woohoo! I didn't think it would make *that* much difference, did you dig into why? My reason for wanting nonblocking kswapd has always been to be able to untangle the multiple-simultaneous-scanners mess, which we are now in a good position to do. Erm, it never occurred to me it would be as easy as checking whether the page *might* block and skipping it if so. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:46 ` Daniel Phillips @ 2002-09-09 22:58 ` Andrew Morton 2002-09-09 23:40 ` William Lee Irwin III 2002-09-10 1:50 ` Daniel Phillips 0 siblings, 2 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-09 22:58 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm Daniel Phillips wrote: > > On Monday 09 September 2002 11:38, Andrew Morton wrote: > > One thing this patch did do was to speed up the initial untar of > > the kernel source - 50 seconds down to 25. That'll be due to not > > having so much dirt on the inactive list. The "nonblocking page > > reclaim" code (needs a better name...) > > Nonblocking kswapd, no? Perhaps 'kscand' would be a better name, now. Well, it blocks still. But it doesn't block on "this particular request queue" or on "that particular page ending IO". It blocks on "any queue putting back a write request". Which is basically equivalent to blocking on "a bunch of pages came clean". This logic is too global at present. It really needs to be per-zone, to fix an oom problem which you-know-who managed to trigger. All ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude that we're oom. (Plus I reduced the maximum-scan-before-oom by 2.5x) Then again, Bill had twiddled the dirty memory thresholds to permit 12G of dirty ZONE_HIGHMEM. > > ...does that in 18 secs. > > Woohoo! I didn't think it would make *that* much difference, did you > dig into why? That's nuthin. Some tests are 10-50 times faster. Tests like trying to compile something while some other process is doing a bunch of big file writes. > My reason for wanting nonblocking kswapd has always been to be able to > untangle the multiple-simultaneous-scanners mess, which we are now in > a good position to do. Erm, it never occurred to me it would be as easy > as checking whether the page *might* block and skipping it if so. > Skipping is dumb. It shouldn't have been on that list in the first place. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:58 ` Andrew Morton @ 2002-09-09 23:40 ` William Lee Irwin III 2002-09-10 0:02 ` Andrew Morton 2002-09-10 1:50 ` Daniel Phillips 1 sibling, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2002-09-09 23:40 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm On Mon, Sep 09, 2002 at 03:58:06PM -0700, Andrew Morton wrote: > This logic is too global at present. It really needs to be per-zone, > to fix an oom problem which you-know-who managed to trigger. All > ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in > ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude > that we're oom. (Plus I reduced the maximum-scan-before-oom by 2.5x) > Then again, Bill had twiddled the dirty memory thresholds > to permit 12G of dirty ZONE_HIGHMEM. This seemed to work fine when I just tweaked problem areas to use __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but generic_file_read(), generic_file_write(), the rest of filemap.c, slab allocations, and allocating file descriptor tables for poll() and select() appeared to generate OOM when it appeared to me that failing system calls with -ENOMEM was a better alternative than shooting tasks. After doing that, the system was able to do just fine until the disk driver oopsed. Given the lack of forward progress on the driver front due to basically nobody we know knowing or caring about that device and the mempool issue triggered by bounce buffering already being fixed I've obtained a replacement and am just chucking the isp1020 out the window. I'm also hunting for a (non-Emulex!) FC adapter so I can get more interesting dbench results from non-clockwork disks. =) Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 23:40 ` William Lee Irwin III @ 2002-09-10 0:02 ` Andrew Morton 2002-09-10 0:21 ` William Lee Irwin III 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2002-09-10 0:02 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm William Lee Irwin III wrote: > > On Mon, Sep 09, 2002 at 03:58:06PM -0700, Andrew Morton wrote: > > This logic is too global at present. It really needs to be per-zone, > > to fix an oom problem which you-know-who managed to trigger. All > > ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in > > ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude > > that we're oom. (Plus I reduced the maximum-scan-before-oom by 2.5x) > > Then again, Bill had twiddled the dirty memory thresholds > > to permit 12G of dirty ZONE_HIGHMEM. > > This seemed to work fine when I just tweaked problem areas to use > __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but > generic_file_read(), generic_file_write(), the rest of filemap.c, > slab allocations, and allocating file descriptor tables for poll() and > select() appeared to generate OOM when it appeared to me that failing > system calls with -ENOMEM was a better alternative than shooting tasks. But clearly there is reclaimable pagecache down there; we just have to wait for it. No idea why you'd get an oom on ZONE_HIGHMEM, but when I have a few more gigs I might be able to say. Anyway, it's all too much scanning. You'll probably find that segq helps by accident. I installed SEGQ (and the shrink-slab-harder-if-mapped-pages-are-enountered) on my desktop here. Initial indications are that SEGQ kicks butt. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-10 0:02 ` Andrew Morton @ 2002-09-10 0:21 ` William Lee Irwin III 2002-09-10 1:13 ` Andrew Morton 0 siblings, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2002-09-10 0:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm William Lee Irwin III wrote: >> This seemed to work fine when I just tweaked problem areas to use >> __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but >> generic_file_read(), generic_file_write(), the rest of filemap.c, >> slab allocations, and allocating file descriptor tables for poll() and >> select() appeared to generate OOM when it appeared to me that failing >> system calls with -ENOMEM was a better alternative than shooting tasks. On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote: > But clearly there is reclaimable pagecache down there; we just > have to wait for it. No idea why you'd get an oom on ZONE_HIGHMEM, > but when I have a few more gigs I might be able to say. > Anyway, it's all too much scanning. Well, there was no swap, and most things were dirty. Not sure about the rest. I was miffed by "Something tells it there's no memory and it shoots tasks instead of returning -ENOMEM to userspace in a syscall?" Saying "no" to the task allocating seems better than shooting tasks to me. out_of_memory() being called too early sounds bad, too, though. On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote: > You'll probably find that segq helps by accident. I installed > SEGQ (and the shrink-slab-harder-if-mapped-pages-are-enountered) > on my desktop here. Initial indications are that SEGQ kicks butt. It seems to be a nice strategy a priori. It's good to hear initial indications of the advantages coming out in practice. Something to bench soon for sure. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-10 0:21 ` William Lee Irwin III @ 2002-09-10 1:13 ` Andrew Morton 0 siblings, 0 replies; 28+ messages in thread From: Andrew Morton @ 2002-09-10 1:13 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm William Lee Irwin III wrote: > > William Lee Irwin III wrote: > >> This seemed to work fine when I just tweaked problem areas to use > >> __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but > >> generic_file_read(), generic_file_write(), the rest of filemap.c, > >> slab allocations, and allocating file descriptor tables for poll() and > >> select() appeared to generate OOM when it appeared to me that failing > >> system calls with -ENOMEM was a better alternative than shooting tasks. > > On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote: > > But clearly there is reclaimable pagecache down there; we just > > have to wait for it. No idea why you'd get an oom on ZONE_HIGHMEM, > > but when I have a few more gigs I might be able to say. > > Anyway, it's all too much scanning. > > Well, there was no swap, and most things were dirty. Not sure about the > rest. I was miffed by "Something tells it there's no memory and it > shoots tasks instead of returning -ENOMEM to userspace in a syscall?" > Saying "no" to the task allocating seems better than shooting tasks to > me. out_of_memory() being called too early sounds bad, too, though. If there is dirty memory or memory under writeback then going oom or returning NULL is a bug. It's just a search problem, and not a very complex one. Per-zone dirty accounting, per-zone throttling and a separate known-to-be-unreclaimable list should fix it up. Give me a few days to find a motivated moment... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-09 22:58 ` Andrew Morton 2002-09-09 23:40 ` William Lee Irwin III @ 2002-09-10 1:50 ` Daniel Phillips 2002-09-10 2:02 ` Rik van Riel 1 sibling, 1 reply; 28+ messages in thread From: Daniel Phillips @ 2002-09-10 1:50 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm On Tuesday 10 September 2002 00:58, Andrew Morton wrote: > Daniel Phillips wrote: > > > > On Monday 09 September 2002 11:38, Andrew Morton wrote: > > > One thing this patch did do was to speed up the initial untar of > > > the kernel source - 50 seconds down to 25. That'll be due to not > > > having so much dirt on the inactive list. The "nonblocking page > > > reclaim" code (needs a better name...) > > > > Nonblocking kswapd, no? Perhaps 'kscand' would be a better name, now. > > Well, it blocks still. But it doesn't block on "this particular > request queue" or on "that particular page ending IO". It > blocks on "any queue putting back a write request". Which is > basically equivalent to blocking on "a bunch of pages came clean". It's not that far from being truly nonblocking, which would be a useful property. Instead of calling ->writepage, just bump the page to the front of the pdlist (getting deja vu here). Move locked pages off to a locked list and let them rehabilitate themselves asynchronously (since we can now do lru list moves inside interrupts). If necessary, fall back to scanning the locked list for pages that slipped through the cracks, though it may be possible to make things airtight so that never happens. What other ways for kswapd to block are there? Buffers may be locked; a similar strategy applies, which is one reason why buffer state should not be opaque to the vfs. ->releasepage is a can of worms, at which I'm looking suspiciously. > Skipping is dumb. It shouldn't have been on that list in the > first place. Sure, it's not the only way to skin the cat. Anyway, skipping isn't so dumb that we haven't been doing it for years. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] modified segq for 2.5 2002-09-10 1:50 ` Daniel Phillips @ 2002-09-10 2:02 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2002-09-10 2:02 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, William Lee Irwin III, sfkaplan, linux-mm On Tue, 10 Sep 2002, Daniel Phillips wrote: > > Skipping is dumb. It shouldn't have been on that list in the > > first place. > > Sure, it's not the only way to skin the cat. Anyway, skipping isn't so > dumb that we haven't been doing it for years. Skipping might even be the correct thing to do, if we leave the pages on the inactive list in strict LRU order instead of wrapping them over to the other end of the list... regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ Spamtraps of the month: september@surriel.com trac@trac.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2002-09-10 2:02 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-08-15 14:24 [PATCH] modified segq for 2.5 Rik van Riel 2002-09-09 9:38 ` Andrew Morton 2002-09-09 11:40 ` Ed Tomlinson 2002-09-09 17:10 ` William Lee Irwin III 2002-09-09 18:58 ` Andrew Morton 2002-09-09 13:10 ` Rik van Riel 2002-09-09 19:03 ` Andrew Morton 2002-09-09 19:25 ` Rik van Riel 2002-09-09 19:55 ` Andrew Morton 2002-09-09 20:03 ` Rik van Riel 2002-09-09 20:51 ` Andrew Morton 2002-09-09 20:57 ` Andrew Morton 2002-09-09 21:09 ` Rik van Riel 2002-09-09 21:52 ` Andrew Morton 2002-09-09 22:41 ` Rik van Riel 2002-09-10 0:17 ` Daniel Phillips 2002-09-09 22:49 ` William Lee Irwin III 2002-09-09 22:54 ` Rik van Riel 2002-09-09 23:32 ` William Lee Irwin III 2002-09-09 23:53 ` Rik van Riel 2002-09-09 22:46 ` Daniel Phillips 2002-09-09 22:58 ` Andrew Morton 2002-09-09 23:40 ` William Lee Irwin III 2002-09-10 0:02 ` Andrew Morton 2002-09-10 0:21 ` William Lee Irwin III 2002-09-10 1:13 ` Andrew Morton 2002-09-10 1:50 ` Daniel Phillips 2002-09-10 2:02 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox