[PATCH] modified segq for 2.5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] modified segq for 2.5
@ 2002-08-15 14:24 Rik van Riel
  2002-09-09  9:38 ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-08-15 14:24 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: sfkaplan, linux-mm, Andrew Morton

Hi,

here is a patch that implements a modified SEGQ replacement
for the 2.5 kernel.

- new pages start out on the active list
- once a page reaches the end of the active list:
  - if it is (mapped && referenced) it goes to the front of the active list
  - otherwise, it gets moved to the front of the inactive list
- linear IO drops pages to the inactive list after it is done with them
- once a page reaches the end of the inactive list:
  - if it is referenced, it goes to the front of the active list
  - otherwise, it is reclaimed

This means accesses to not mapped pagecache pages while that
page is on the active list get ignored, while accesses to
process pages on the active list get counted.  I hope this
bias will help keeping the working set of processes in RAM.

(note that the patch was made against 2.5.29, but it should be
trivial to port to newer kernels)

kind regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.476   -> 1.477
#	include/linux/swap.h	1.48    -> 1.49
#	      mm/readahead.c	1.13    -> 1.14
#	         mm/vmscan.c	1.85    -> 1.86
#	        mm/filemap.c	1.114   -> 1.115
#	           mm/swap.c	1.17    -> 1.18
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/07/29	riel@imladris.surriel.com	1.477
# second chance replacement
# --------------------------------------------
#
diff -Nru a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h	Thu Aug 15 11:19:09 2002
+++ b/include/linux/swap.h	Thu Aug 15 11:19:09 2002
@@ -161,6 +161,7 @@
 extern void FASTCALL(lru_cache_del(struct page *));

 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_page(struct page *));

 extern void swap_setup(void);

diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	Thu Aug 15 11:19:09 2002
+++ b/mm/filemap.c	Thu Aug 15 11:19:09 2002
@@ -848,20 +848,11 @@

 /*
  * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-		return;
-	} else if (!PageReferenced(page)) {
+	if (!PageReferenced(page))
 		SetPageReferenced(page);
-	}
 }

 /*
diff -Nru a/mm/readahead.c b/mm/readahead.c
--- a/mm/readahead.c	Thu Aug 15 11:19:09 2002
+++ b/mm/readahead.c	Thu Aug 15 11:19:09 2002
@@ -204,6 +204,39 @@
 }

 /*
+ * Since we're less likely to use the pages we've already read than
+ * the pages we're about to read we move the pages from the past
+ * window to the inactive list.
+ */
+static void
+drop_behind(struct file *file, unsigned long offset, pgoff_t size)
+{
+	unsigned long page_idx, lower_limit = 0;
+	struct address_space *mapping;
+	struct page *page;
+
+	/* We're re-using already present data or just started reading. */
+	if (size == -1UL || offset == 0)
+		return;
+
+	mapping = file->f_dentry->d_inode->i_mapping;
+
+	if (offset > size)
+		lower_limit = offset - size;
+
+	read_lock(&mapping->page_lock);
+	for (page_idx = offset; page_idx > lower_limit; page_idx--) {
+		page = radix_tree_lookup(&mapping->page_tree, page_idx);
+
+		if (!page || (!PageActive(page) && !PageReferenced(page)))
+			break;
+
+		deactivate_page(page);
+	}
+	read_unlock(&mapping->page_lock);
+}
+
+/*
  * page_cache_readahead is the main function.  If performs the adaptive
  * readahead window size management and submits the readahead I/O.
  */
@@ -286,6 +319,11 @@
 			ra->prev_page = ra->start;
 			ra->ahead_start = 0;
 			ra->ahead_size = 0;
+			/*
+			 * Drop the pages from the old window into the
+			 * inactive list.
+			 */
+			drop_behind(file, offset, ra->size);
 			/*
 			 * Control now returns, probably to sleep until I/O
 			 * completes against the first ahead page.
diff -Nru a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c	Thu Aug 15 11:19:09 2002
+++ b/mm/swap.c	Thu Aug 15 11:19:09 2002
@@ -53,6 +53,24 @@
 }

 /**
+ * deactivate_page - move an active page to the inactive list.
+ * @page: page to deactivate
+ */
+void deactivate_page(struct page * page)
+{
+	spin_lock(&pagemap_lru_lock);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(page);
+		add_page_to_inactive_list(page);
+		KERNEL_STAT_INC(pgdeactivate);
+	}
+	spin_unlock(&pagemap_lru_lock);
+
+	if (PageReferenced(page))
+		ClearPageReferenced(page);
+}
+
+/**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
@@ -60,7 +78,7 @@
 {
 	if (!TestSetPageLRU(page)) {
 		spin_lock(&pagemap_lru_lock);
-		add_page_to_inactive_list(page);
+		add_page_to_active_list(page);
 		spin_unlock(&pagemap_lru_lock);
 	}
 }
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Thu Aug 15 11:19:09 2002
+++ b/mm/vmscan.c	Thu Aug 15 11:19:09 2002
@@ -138,7 +138,7 @@
 		 * the active list.
 		 */
 		pte_chain_lock(page);
-		if (page_referenced(page) && page_mapping_inuse(page)) {
+		if (page_referenced(page)) {
 			del_page_from_inactive_list(page);
 			add_page_to_active_list(page);
 			pte_chain_unlock(page);
@@ -346,7 +346,7 @@
 		KERNEL_STAT_INC(pgscan);

 		pte_chain_lock(page);
-		if (page->pte.chain && page_referenced(page)) {
+		if (page_referenced(page) && page_mapping_inuse(page)) {
 			list_del(&page->lru);
 			list_add(&page->lru, &active_list);
 			pte_chain_unlock(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-08-15 14:24 [PATCH] modified segq for 2.5 Rik van Riel
@ 2002-09-09  9:38 ` Andrew Morton
  2002-09-09 11:40   ` Ed Tomlinson
                     ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-09  9:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

Rik van Riel wrote:
> 
> Hi,
> 
> here is a patch that implements a modified SEGQ replacement
> for the 2.5 kernel.
> 
> - new pages start out on the active list
> - once a page reaches the end of the active list:
>   - if it is (mapped && referenced) it goes to the front of the active list
>   - otherwise, it gets moved to the front of the inactive list
> - linear IO drops pages to the inactive list after it is done with them
> - once a page reaches the end of the inactive list:
>   - if it is referenced, it goes to the front of the active list
>   - otherwise, it is reclaimed
> 
> This means accesses to not mapped pagecache pages while that
> page is on the active list get ignored, while accesses to
> process pages on the active list get counted.  I hope this
> bias will help keeping the working set of processes in RAM.
> 
> (note that the patch was made against 2.5.29, but it should be
> trivial to port to newer kernels)
> 
>


I ported this up.  The below patch applies with or without my recent
vmscan.c maulings.

I haven't really had time to test it much.  Running `make -j6 dep'
on a setup where userspace has 14M available seems to be in the
operating region.  That's fairly swappy but not ridiculously so.

Didn't seem to make much difference in that particular dot on the
spectrum.  105 seconds all up.   2.4.19 does it in 80 or so, but
I wasn't very careful in making sure that both kernels had the
same available memory - half a meg here or there could make a big
difference.

I fiddled with it a bit:  did you forget to move the write(2) pages
to the inactive list?  I changed it to do that at IO completion.
It had little effect.  Probably should be looking at the page state
before doing that.

One thing this patch did do was to speed up the initial untar of
the kernel source - 50 seconds down to 25.  That'll be due to not
having so much dirt on the inactive list.  The "nonblocking page
reclaim" code (needs a better name...) does that in 18 secs.

The inactive list was smaller with this patch.  Around 10%
of allocatable memory usually.

btw, I've added the `page_mapped()' helper to replace open-coded
testing of page->pte.chain.  Because with highpte and HIGHMEM_64G,
page->pte.chain is wrong.  pte.direct is 64-bit and we need to
test all those bits to see if the page is in pagetables.

With nonblocking-vm and slabasap, the test took 150 seconds.
Removing slabasap took it down to 98 seconds.  The slab rework
seemed to leave an extra megabyte average in cache.  Which is not
to say that the algorithms in there are wrong, but perhaps we should
push it a bit harder if there's swapout pressure.

And the fact that a meg makes that much difference indicates that it's
right on the knee of the curve and perhaps not a very interesting test.

I like the way in which the patch improves the reclaim success rate.
It went from 50% to 80 or 90%.

It worries me that the inactive list is so small.  But I need to
test it more.

(This patch looks a lot like NRU - what's the difference?)

 include/linux/mm_inline.h |    9 ++++++++
 include/linux/pagevec.h   |    7 ++++++
 mm/filemap.c              |   14 +++----------
 mm/readahead.c            |   46 +++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                 |    4 +++
 mm/swap.c                 |   49 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c               |    8 +++++--
 7 files changed, 124 insertions(+), 13 deletions(-)

--- 2.5.33/mm/filemap.c~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/mm/filemap.c	Mon Sep  9 02:03:48 2002
@@ -24,6 +24,8 @@
 #include <linux/writeback.h>
 #include <linux/pagevec.h>
 #include <linux/security.h>
+#include <linux/mm_inline.h>
+
 /*
  * This is needed for the following functions:
  *  - try_to_release_page
@@ -685,6 +687,7 @@ void end_page_writeback(struct page *pag
 	smp_mb__after_clear_bit(); 
 	if (waitqueue_active(waitqueue))
 		wake_up_all(waitqueue);
+	deactivate_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -868,20 +871,11 @@ grab_cache_page_nowait(struct address_sp
 
 /*
  * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-		return;
-	} else if (!PageReferenced(page)) {
+	if (!PageReferenced(page))
 		SetPageReferenced(page);
-	}
 }
 
 /*
--- 2.5.33/mm/readahead.c~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/mm/readahead.c	Mon Sep  9 01:44:49 2002
@@ -213,6 +213,45 @@ check_ra_success(struct file_ra_state *r
 }
 
 /*
+ * Since we're less likely to use the pages we've already read than the pages
+ * we're about to read we move the pages from the past window to the inactive
+ * list.
+ */
+static void
+drop_behind(struct address_space *mapping, pgoff_t offset, unsigned long size)
+{
+	unsigned long page_idx;
+	unsigned long lower_limit = 0;
+	struct page *page;
+	struct pagevec pvec;
+
+	/* We're re-using already present data or just started reading. */
+	if (size == -1UL || offset == 0)
+		return;
+
+	if (offset > size)
+		lower_limit = offset - size;
+
+	pagevec_init(&pvec);
+	read_lock(&mapping->page_lock);
+	for (page_idx = offset; page_idx > lower_limit; page_idx--) {
+		page = radix_tree_lookup(&mapping->page_tree, page_idx);
+
+		if (!page || (!PageActive(page) && !PageReferenced(page)))
+			break;
+
+		page_cache_get(page);
+		if (!pagevec_add(&pvec, page)) {
+			read_unlock(&mapping->page_lock);
+			__pagevec_deactivate_active(&pvec);
+			read_lock(&mapping->page_lock);
+		}
+	}
+	read_unlock(&mapping->page_lock);
+	pagevec_deactivate_active(&pvec);
+}
+
+/*
  * page_cache_readahead is the main function.  If performs the adaptive
  * readahead window size management and submits the readahead I/O.
  */
@@ -296,6 +335,13 @@ void page_cache_readahead(struct file *f
 			ra->ahead_start = 0;
 			ra->ahead_size = 0;
 			/*
+			 * Drop the pages from the old window into the
+			 * inactive list.
+			 */
+			drop_behind(file->f_dentry->d_inode->i_mapping,
+					offset, ra->size);
+
+			/*
 			 * Control now returns, probably to sleep until I/O
 			 * completes against the first ahead page.
 			 * When the second page in the old ahead window is
--- 2.5.33/include/linux/pagevec.h~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/include/linux/pagevec.h	Mon Sep  9 01:44:49 2002
@@ -18,6 +18,7 @@ void __pagevec_release(struct pagevec *p
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+void __pagevec_deactivate_active(struct pagevec *pvec);
 void lru_add_drain(void);
 void pagevec_deactivate_inactive(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
@@ -69,3 +70,9 @@ static inline void pagevec_lru_add(struc
 	if (pagevec_count(pvec))
 		__pagevec_lru_add(pvec);
 }
+
+static inline void pagevec_deactivate_active(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_deactivate_active(pvec);
+}
--- 2.5.33/mm/swap.c~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/mm/swap.c	Mon Sep  9 01:44:49 2002
@@ -196,6 +196,38 @@ void pagevec_deactivate_inactive(struct 
 }
 
 /*
+ * Move all the active pages to the head of the inactive list and release them.
+ * Reinitialises the caller's pagevec.
+ */
+void __pagevec_deactivate_active(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (!PageActive(page) || !PageLRU(page))
+				continue;
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		if (PageActive(page) && PageLRU(page)) {
+			del_page_from_active_list(zone, page);
+			ClearPageActive(page);
+			add_page_to_inactive_list(zone, page);
+		}
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	__pagevec_release(pvec);
+}
+
+/*
  * Add the passed pages to the inactive_list, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
@@ -216,7 +248,8 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		if (TestSetPageLRU(page))
 			BUG();
-		add_page_to_inactive_list(zone, page);
+		add_page_to_active_list(zone, page);
+		SetPageActive(page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -240,6 +273,20 @@ void pagevec_strip(struct pagevec *pvec)
 	}
 }
 
+void __deactivate_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		ClearPageActive(page);
+		add_page_to_inactive_list(zone, page);
+	}
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+
 /*
  * Perform any setup for the swap system
  */
--- 2.5.33/mm/vmscan.c~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/mm/vmscan.c	Mon Sep  9 01:44:49 2002
@@ -126,7 +126,7 @@ shrink_list(struct list_head *page_list,
 		}
 
 		pte_chain_lock(page);
-		if (page_referenced(page) && page_mapping_inuse(page)) {
+		if (page_referenced(page)) {
 			/* In active use or really unfreeable.  Activate it. */
 			pte_chain_unlock(page);
 			goto activate_locked;
@@ -411,9 +411,13 @@ refill_inactive_zone(struct zone *zone, 
 	while (!list_empty(&l_hold)) {
 		page = list_entry(l_hold.prev, struct page, lru);
 		list_del(&page->lru);
+		if (TestClearPageReferenced(page)) {
+			list_add(&page->lru, &l_active);
+			continue;
+		}
 		if (page_mapped(page)) {
 			pte_chain_lock(page);
-			if (page_mapped(page) && page_referenced(page)) {
+			if (page_referenced(page) && page_mapping_inuse(page)) {
 				pte_chain_unlock(page);
 				list_add(&page->lru, &l_active);
 				continue;
--- 2.5.33/mm/rmap.c~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/mm/rmap.c	Mon Sep  9 01:44:49 2002
@@ -125,6 +125,9 @@ int page_referenced(struct page * page)
 	if (TestClearPageReferenced(page))
 		referenced++;
 
+	if (!page_mapped(page))
+		goto out;
+
 	if (PageDirect(page)) {
 		pte_t *pte = rmap_ptep_map(page->pte.direct);
 		if (ptep_test_and_clear_young(pte))
@@ -158,6 +161,7 @@ int page_referenced(struct page * page)
 			pte_chain_free(pc);
 		}
 	}
+out:
 	return referenced;
 }
 
--- 2.5.33/include/linux/mm_inline.h~segq	Mon Sep  9 01:44:49 2002
+++ 2.5.33-akpm/include/linux/mm_inline.h	Mon Sep  9 01:44:49 2002
@@ -38,3 +38,12 @@ del_page_from_lru(struct zone *zone, str
 		zone->nr_inactive--;
 	}
 }
+
+
+void __deactivate_page(struct page *page);
+
+static inline void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page))
+		__deactivate_page(page);
+}

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09  9:38 ` Andrew Morton
@ 2002-09-09 11:40   ` Ed Tomlinson
  2002-09-09 17:10     ` William Lee Irwin III
  2002-09-09 18:58     ` Andrew Morton
  2002-09-09 13:10   ` Rik van Riel
  2002-09-09 22:46   ` Daniel Phillips
  2 siblings, 2 replies; 28+ messages in thread
From: Ed Tomlinson @ 2002-09-09 11:40 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On September 9, 2002 05:38 am, Andrew Morton wrote:

> With nonblocking-vm and slabasap, the test took 150 seconds.
> Removing slabasap took it down to 98 seconds.  The slab rework
> seemed to leave an extra megabyte average in cache.  Which is not
> to say that the algorithms in there are wrong, but perhaps we should
> push it a bit harder if there's swapout pressure.

Andrew, One simple change that will make slabasap try harder is to 
use only inactive pages caculating the ratio. 

unsigned int nr_used_zone_pages(void)
{
        unsigned int pages = 0;
        struct zone *zone;

        for_each_zone(zone)
                pages += zone->nr_inactive;

        return pages;
}

This will make it closer to slablru which used the inactive list.

Second item.  Do you run gkrelmon when doing your tests?  If not please
install it and watch it slowly start to eat resources.   This morning (uptime 
12hr) it was using 31% of CPU.  Stopping and starting it did not change this.  
Think we have something we can improve here.  I have inclued an strace
of one (and a bit) update cycle.

This was with 33-mm5 with your varient of slabasap.

Ed

open("/proc/meminfo", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "MemTotal:       516920 kB\nMemFre"..., 1024) = 491
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
gettimeofday({1031571076, 678996}, NULL) = 0
write(3, ">\2\7\0\30\2`\2\375\1`\2\35\0`\2\0\0%\0\0\0%\0P\0\3\0>"..., 1956) = 1956
ioctl(3, 0x541b, [0])                   = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 262) = 0
gettimeofday({1031571076, 945260}, NULL) = 0
time([1031571076])                      = 1031571076
open("/proc/stat", O_RDONLY)            = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "cpu  418635 1309463 315263 22714"..., 1024) = 591
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/loadavg", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "3.27 2.32 3.38 3/132 14540\n", 1024) = 27
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Inter-|   Receive               "..., 1024) = 938
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
gettimeofday({1031571076, 949176}, NULL) = 0
write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 784) = 784
ioctl(3, 0x541b, [0])                   = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 473) = 0
gettimeofday({1031571077, 424287}, NULL) = 0
time([1031571077])                      = 1031571077
open("/proc/stat", O_RDONLY)            = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "cpu  418639 1309506 315264 22714"..., 1024) = 591
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/loadavg", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "3.27 2.32 3.38 2/132 14540\n", 1024) = 27
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0)\0`\2J\0\5\0m\0`\2!\0`\2"..., 2048) = 2048
open("/proc/net/tcp", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "  sl  local_address rem_address "..., 1024) = 1024
read(6, "                         \n   6: "..., 1024) = 1024
read(6, "dc00040 3000 0 0 2 -1           "..., 1024) = 1024
read(6, "000     0        0 6460 1 da6dfc"..., 1024) = 1024
read(6, "00000000 00:00000000 00000000  1"..., 1024) = 1024
read(6, "0100007F:8001 01 00000000:000000"..., 1024) = 1024
read(6, "     \n  40: 0100007F:866E 010000"..., 1024) = 1024
read(6, "-1                             \n", 1024) = 32
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/tcp6", O_RDONLY)        = -1 ENOENT (No such file or directory)
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Inter-|   Receive               "..., 1024) = 938
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/route", O_RDONLY)       = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Iface\tDestination\tGateway \tFlags"..., 1024) = 512
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
time(NULL)                              = 1031571077
open("/proc/meminfo", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "MemTotal:       516920 kB\nMemFre"..., 1024) = 491
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/mounts", O_RDONLY)          = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "rootfs / rootfs rw 0 0\n/dev/root"..., 1024) = 314
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
statfs("/", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=786466, f_bfree=120154, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0
statfs("/poola", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=2477941, f_bfree=892388, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0
statfs("/poole", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=8870498, f_bfree=2468598, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0
statfs("/boot", {f_type="EXT2_SUPER_MAGIC", f_bsize=1024, f_blocks=63925, f_bfree=21000, f_files=16560, f_ffree=14904, f_namelen=255}) = 0
statfs("/tmp", {f_type=0x1021994, f_bsize=4096, f_blocks=192000, f_bfree=191685, f_files=64615, f_ffree=64593, f_namelen=255}) = 0
statfs("/poolg", {f_type="REISERFS_SUPER_MAGIC", f_bsize=4096, f_blocks=8870624, f_bfree=2371206, f_files=4294967295, f_ffree=4294967295, f_namelen=255}) = 0
statfs("/root2", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=774823, f_bfree=137303, f_files=393600, f_ffree=261747, f_namelen=255}) = 0
gettimeofday({1031571077, 639770}, NULL) = 0
write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0\'\0`\2J\0\5\0k\1`\2!\0`\2"..., 1900) = 1900
ioctl(3, 0x541b, [0])                   = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 260) = 0
gettimeofday({1031571077, 916658}, NULL) = 0
time([1031571077])                      = 1031571077
open("/proc/stat", O_RDONLY)            = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "cpu  418649 1309524 315285 22714"..., 1024) = 591
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/loadavg", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "3.27 2.32 3.38 4/132 14540\n", 1024) = 27
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Inter-|   Receive               "..., 1024) = 938
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
gettimeofday({1031571077, 920415}, NULL) = 0
write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 192) = 192
ioctl(3, 0x541b, [0])                   = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 473) = 0
gettimeofday({1031571078, 396278}, NULL) = 0
time([1031571078])                      = 1031571078
open("/proc/stat", O_RDONLY)            = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "cpu  418653 1309567 315286 22714"..., 1024) = 591
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/loadavg", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "3.27 2.32 3.38 3/132 14540\n", 1024) = 27
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
write(3, "8\2\5\0!\0`\2\4@\0\0\0\0\0\0)\0`\2J\0\5\0m\0`\2!\0`\2"..., 2048) = 2048
open("/proc/net/tcp", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "  sl  local_address rem_address "..., 1024) = 1024
read(6, "                         \n   6: "..., 1024) = 1024
read(6, "dc00040 3000 0 0 2 -1           "..., 1024) = 1024
read(6, "000     0        0 6460 1 da6dfc"..., 1024) = 1024
read(6, "00000000 00:00000000 00000000  1"..., 1024) = 1024
read(6, "0100007F:8001 01 00000000:000000"..., 1024) = 1024
read(6, "     \n  40: 0100007F:866E 010000"..., 1024) = 1024
read(6, "-1                             \n", 1024) = 32
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/tcp6", O_RDONLY)        = -1 ENOENT (No such file or directory)
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Inter-|   Receive               "..., 1024) = 938
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
writev(3, [{"8\2\5\0!\0`\2\4@\0\0\0\0\0\0\'\0`\2J\0\5\0k\1`\2!\0`\2"..., 2048}, {"\227\320\357\0", 4}], 2) = 2052
open("/proc/net/route", O_RDONLY)       = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Iface\tDestination\tGateway \tFlags"..., 1024) = 512
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
time(NULL)                              = 1031571078
open("/proc/meminfo", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "MemTotal:       516920 kB\nMemFre"..., 1024) = 491
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
gettimeofday({1031571078, 614278}, NULL) = 0
write(3, "J\2\5\0\320\2`\2!\0`\2\2\0\f\0\1\0000\0>\0\7\0\320\2`\2"..., 404) = 404
ioctl(3, 0x541b, [0])                   = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, 258) = 0
gettimeofday({1031571078, 875241}, NULL) = 0
time([1031571078])                      = 1031571078
open("/proc/stat", O_RDONLY)            = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "cpu  418657 1309592 315306 22714"..., 1024) = 591
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/loadavg", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "3.27 2.32 3.38 2/132 14540\n", 1024) = 27
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
read(6, "Inter-|   Receive               "..., 1024) = 938
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x4001d000, 4096)                = 0
gettimeofday({1031571078, 879754}, NULL) = 0
write(3, "F\2\5\0\213\0`\2$\0`\2\0\0\0\0\5\0\6\0>\0\7\0\211\0`\2"..., 700) = 700
ioctl(3, 0x541b, [0])                   = 0
poll( <unfinished ...>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09  9:38 ` Andrew Morton
  2002-09-09 11:40   ` Ed Tomlinson
@ 2002-09-09 13:10   ` Rik van Riel
  2002-09-09 19:03     ` Andrew Morton
  2002-09-09 22:46   ` Daniel Phillips
  2 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 13:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Mon, 9 Sep 2002, Andrew Morton wrote:

> I fiddled with it a bit:  did you forget to move the write(2) pages
> to the inactive list?  I changed it to do that at IO completion.
> It had little effect.  Probably should be looking at the page state
> before doing that.

Hmmm indeed, I forgot this.  Note that IO completion state is
too late, since then you'll have already pushed other pages
out to the inactive list...

> The inactive list was smaller with this patch.  Around 10%
> of allocatable memory usually.

It should be a bit bigger than this, I think.  If it isn't
something may be going wrong ;)

> I like the way in which the patch improves the reclaim success rate.
> It went from 50% to 80 or 90%.

That should help reduce the randomizing of the inactive list ;)

> It worries me that the inactive list is so small.  But I need to
> test it more.

It's actually ok, though a larger inactive list might help with
some workloads (or make the system worse with some others?).

> (This patch looks a lot like NRU - what's the difference?)

For mapped pages, it basically is NRU.  For normal cache pages,
references while on the active list don't count, they will still
get evicted. Only references while on the inactive list can save
such a page.

What this means is that (in clock terminology) the handspread
for non-mapped cache pages is much smaller than for mapped pages.
With an inactive list size of 10%, the handspread for mapped pages
is about 10 times as wide as that for non-mapped pages, giving the
mapped pages a bit of an advantage over the cache...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 11:40   ` Ed Tomlinson
@ 2002-09-09 17:10     ` William Lee Irwin III
  2002-09-09 18:58     ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-09 17:10 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: Andrew Morton, Rik van Riel, sfkaplan, linux-mm

On Mon, Sep 09, 2002 at 07:40:16AM -0400, Ed Tomlinson wrote:
> Second item.  Do you run gkrelmon when doing your tests?  If not please
> install it and watch it slowly start to eat resources.   This morning (uptime 
> Think we have something we can improve here.  I have inclued an strace
> of one (and a bit) update cycle.
> This was with 33-mm5 with your varient of slabasap.

strace -r to get relative timestamps. I've seen some issues where tasks
suck progressively more cpu over time and the box gets unusable, leading
most notably to 30+s or longer fork/exit latencies. Still on idea what's
going wrong when it does, though.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 11:40   ` Ed Tomlinson
  2002-09-09 17:10     ` William Lee Irwin III
@ 2002-09-09 18:58     ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 18:58 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm

Ed Tomlinson wrote:
> 
> On September 9, 2002 05:38 am, Andrew Morton wrote:
> 
> > With nonblocking-vm and slabasap, the test took 150 seconds.
> > Removing slabasap took it down to 98 seconds.  The slab rework
> > seemed to leave an extra megabyte average in cache.  Which is not
> > to say that the algorithms in there are wrong, but perhaps we should
> > push it a bit harder if there's swapout pressure.
> 
> Andrew, One simple change that will make slabasap try harder is to
> use only inactive pages caculating the ratio.
> 
> unsigned int nr_used_zone_pages(void)
> {
>         unsigned int pages = 0;
>         struct zone *zone;
> 
>         for_each_zone(zone)
>                 pages += zone->nr_inactive;
> 
>         return pages;
> }
> 
> This will make it closer to slablru which used the inactive list.

hmm.  Well if we are to be honest to the "account for seeks" thing
then perhaps we should double-count for swap activity - a swapout
and a swapin is two units of seekiness.  So consider add_to_swap()
to be worth two page scans.  Maybe the same for swap_writepage().

That should increase pressure on slab when anon pages are being
victimised.  Ditto for dirty MAP_SHARED I guess.

> Second item.  Do you run gkrelmon when doing your tests?  If not please
> install it and watch it slowly start to eat resources.   This morning (uptime
> 12hr) it was using 31% of CPU.  Stopping and starting it did not change this.
> Think we have something we can improve here.  I have inclued an strace
> of one (and a bit) update cycle.

I was running gkrellm for a while.  Is that the same thing?  I didn't
see anything untoward in there.  It seems to update at 10Hz or more,
so it's fairly expensive.  But no obvious increase in load across time.

It seems that the CPU load accounting in 2.5 is a bit odd; perhaps
as a result of the HZ changes.  Certainly it is hard to make comparisons
with 2.4 based upon it.  Probably one needs to equalise the HZ settings
to make useful comparison.

Anyway.  Could you please run the kernel profiler, see where the time
is being spent?  Just add `profile=1' to the kernel boot line and
use this:

readprofile -r
sleep 30
readprofile -v -m /boot/System.map | sort -n +2 | tail -40

(If readprofile screws up, edit your System.map and remove all
the lines containing " w " and " W ")
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 13:10   ` Rik van Riel
@ 2002-09-09 19:03     ` Andrew Morton
  2002-09-09 19:25       ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 19:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

Rik van Riel wrote:
> 
> On Mon, 9 Sep 2002, Andrew Morton wrote:
> 
> > I fiddled with it a bit:  did you forget to move the write(2) pages
> > to the inactive list?  I changed it to do that at IO completion.
> > It had little effect.  Probably should be looking at the page state
> > before doing that.
> 
> Hmmm indeed, I forgot this.  Note that IO completion state is
> too late, since then you'll have already pushed other pages
> out to the inactive list...

OK.  So how would you like to handle those pages?

> > The inactive list was smaller with this patch.  Around 10%
> > of allocatable memory usually.
> 
> It should be a bit bigger than this, I think.  If it isn't
> something may be going wrong ;)

Well the working set _was_ large.  Sure, we'll be running refill_inactive
a lot.  But spending some CPU in there with this sort of workload is the
right thing to do, if it ends up in better replacement decisions.  So
it doesn't seem to be a problem per-se?

(It's soaking CPU when the VM isn't adding value which offends me ;))


Generally, where do you want to go with this code?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 19:03     ` Andrew Morton
@ 2002-09-09 19:25       ` Rik van Riel
  2002-09-09 19:55         ` Andrew Morton
  2002-09-09 20:51         ` Andrew Morton
  0 siblings, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 19:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Mon, 9 Sep 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> > On Mon, 9 Sep 2002, Andrew Morton wrote:
> >
> > > I fiddled with it a bit:  did you forget to move the write(2) pages
> > > to the inactive list?  I changed it to do that at IO completion.
> > > It had little effect.  Probably should be looking at the page state
> > > before doing that.
> >
> > Hmmm indeed, I forgot this.  Note that IO completion state is
> > too late, since then you'll have already pushed other pages
> > out to the inactive list...
>
> OK.  So how would you like to handle those pages?

Move them to the inactive list the moment we're done writing
them, that is, the moment we move on to the next page. We
wouldn't want to move the last page from /var/log/messages to
the inactive list all the time ;)

> > > The inactive list was smaller with this patch.  Around 10%
> > > of allocatable memory usually.
> >
> > It should be a bit bigger than this, I think.  If it isn't
> > something may be going wrong ;)
>
> Well the working set _was_ large.  Sure, we'll be running refill_inactive
> a lot.  But spending some CPU in there with this sort of workload is the
> right thing to do, if it ends up in better replacement decisions.  So
> it doesn't seem to be a problem per-se?

OK, in that case there's no problem.  If the working set
really does take 90% of RAM that's a good thing to know ;)

> Generally, where do you want to go with this code?

If this code turns out to be more predictable and better
or equal performance to use-once, I'd like to see it in
the kernel.  Use-once seems just too hard to tune right
for all workloads.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 19:25       ` Rik van Riel
@ 2002-09-09 19:55         ` Andrew Morton
  2002-09-09 20:03           ` Rik van Riel
  2002-09-09 20:51         ` Andrew Morton
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 19:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

Rik van Riel wrote:
> 
> On Mon, 9 Sep 2002, Andrew Morton wrote:
> > Rik van Riel wrote:
> > > On Mon, 9 Sep 2002, Andrew Morton wrote:
> > >
> > > > I fiddled with it a bit:  did you forget to move the write(2) pages
> > > > to the inactive list?  I changed it to do that at IO completion.
> > > > It had little effect.  Probably should be looking at the page state
> > > > before doing that.
> > >
> > > Hmmm indeed, I forgot this.  Note that IO completion state is
> > > too late, since then you'll have already pushed other pages
> > > out to the inactive list...
> >
> > OK.  So how would you like to handle those pages?
> 
> Move them to the inactive list the moment we're done writing
> them, that is, the moment we move on to the next page. We
> wouldn't want to move the last page from /var/log/messages to
> the inactive list all the time ;)

That's easy.

> > > > The inactive list was smaller with this patch.  Around 10%
> > > > of allocatable memory usually.
> > >
> > > It should be a bit bigger than this, I think.  If it isn't
> > > something may be going wrong ;)
> >
> > Well the working set _was_ large.  Sure, we'll be running refill_inactive
> > a lot.  But spending some CPU in there with this sort of workload is the
> > right thing to do, if it ends up in better replacement decisions.  So
> > it doesn't seem to be a problem per-se?
> 
> OK, in that case there's no problem.  If the working set
> really does take 90% of RAM that's a good thing to know ;)

The working set appears to be 100.000% of RAM, hence the wild
swings in throughput when you give or take half a meg.
 
> > Generally, where do you want to go with this code?
> 
> If this code turns out to be more predictable and better
> or equal performance to use-once, I'd like to see it in
> the kernel.  Use-once seems just too hard to tune right
> for all workloads.
> 

gack.  How do we judge that, without waiting a month and
measuring the complaint level?  (Here I go again).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 19:55         ` Andrew Morton
@ 2002-09-09 20:03           ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 20:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Mon, 9 Sep 2002, Andrew Morton wrote:

> > OK, in that case there's no problem.  If the working set
> > really does take 90% of RAM that's a good thing to know ;)
>
> The working set appears to be 100.000% of RAM, hence the wild
> swings in throughput when you give or take half a meg.

In that case some form of load control should kick in,
when the working set no longer fits in RAM we should
degrade gracefully instead of just breaking down.

Implementing load control is not an excercise that
should be left to most readers, however ;)

> > > Generally, where do you want to go with this code?
> >
> > If this code turns out to be more predictable and better
> > or equal performance to use-once, I'd like to see it in
> > the kernel.  Use-once seems just too hard to tune right
> > for all workloads.
>
> gack.  How do we judge that, without waiting a month and
> measuring the complaint level?  (Here I go again).

Beats me. We have reasoning and trying the thing on our own
systems, but there don't seem to be any tools to measure
what you want to know...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 19:25       ` Rik van Riel
  2002-09-09 19:55         ` Andrew Morton
@ 2002-09-09 20:51         ` Andrew Morton
  2002-09-09 20:57           ` Andrew Morton
                             ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 20:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

Rik van Riel wrote:
> 
> ...
> > > Hmmm indeed, I forgot this.  Note that IO completion state is
> > > too late, since then you'll have already pushed other pages
> > > out to the inactive list...
> >
> > OK.  So how would you like to handle those pages?
> 
> Move them to the inactive list the moment we're done writing
> them, that is, the moment we move on to the next page. We
> wouldn't want to move the last page from /var/log/messages to
> the inactive list all the time ;)

The moment "who" has done writing them?  Some writeout
comes in via shrink_foo() and a ton of writeout comes in
via balance_dirty_pages(), pdflush, etc.

Do we need to distinguish between the various contexts?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 20:51         ` Andrew Morton
@ 2002-09-09 20:57           ` Andrew Morton
  2002-09-09 21:09           ` Rik van Riel
  2002-09-09 22:49           ` William Lee Irwin III
  2 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 20:57 UTC (permalink / raw)
  To: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm

Andrew Morton wrote:
> 
> Rik van Riel wrote:
> >
> > ...
> > > > Hmmm indeed, I forgot this.  Note that IO completion state is
> > > > too late, since then you'll have already pushed other pages
> > > > out to the inactive list...
> > >
> > > OK.  So how would you like to handle those pages?
> >
> > Move them to the inactive list the moment we're done writing
> > them, that is, the moment we move on to the next page. We
> > wouldn't want to move the last page from /var/log/messages to
> > the inactive list all the time ;)
> 
> The moment "who" has done writing them?  Some writeout
> comes in via shrink_foo() and a ton of writeout comes in
> via balance_dirty_pages(), pdflush, etc.
> 
> Do we need to distinguish between the various contexts?

Forget I said that.

I added this:

--- 2.5.34/fs/mpage.c~segq	Mon Sep  9 13:53:25 2002
+++ 2.5.34-akpm/fs/mpage.c	Mon Sep  9 13:54:07 2002
@@ -583,10 +583,9 @@ mpage_writepages(struct address_space *m
 				bio = mpage_writepage(bio, page, get_block,
 						&last_block_in_bio, &ret);
 			}
-			if ((current->flags & PF_MEMALLOC) &&
-					!PageActive(page) && PageLRU(page)) {
+			if (PageActive(page) && PageLRU(page)) {
 				if (!pagevec_add(&pvec, page))
-					pagevec_deactivate_inactive(&pvec);
+					pagevec_deactivate_active(&pvec);
 				page = NULL;
 			}
 			if (ret == -EAGAIN && page) {
@@ -612,7 +611,7 @@ mpage_writepages(struct address_space *m
 	 * Leave any remaining dirty pages on ->io_pages
 	 */
 	write_unlock(&mapping->page_lock);
-	pagevec_deactivate_inactive(&pvec);
+	pagevec_deactivate_active(&pvec);
 	if (bio)
 		mpage_bio_submit(WRITE, bio);
 	return ret;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 20:51         ` Andrew Morton
  2002-09-09 20:57           ` Andrew Morton
@ 2002-09-09 21:09           ` Rik van Riel
  2002-09-09 21:52             ` Andrew Morton
  2002-09-09 22:49           ` William Lee Irwin III
  2 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 21:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Mon, 9 Sep 2002, Andrew Morton wrote:
> Rik van Riel wrote:

> > Move them to the inactive list the moment we're done writing
> > them, that is, the moment we move on to the next page. We
>
> The moment "who" has done writing them?  Some writeout
> comes in via shrink_foo() and a ton of writeout comes in
> via balance_dirty_pages(), pdflush, etc.

generic_file_write, once that function moves beyond the last
byte of the page, onto the next page, we can be pretty sure
it's done writing to this page

pages where it always does partial writes, like buffer cache,
database indices, etc... will stay in memory for a longer time.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 21:09           ` Rik van Riel
@ 2002-09-09 21:52             ` Andrew Morton
  2002-09-09 22:41               ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 21:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

Rik van Riel wrote:
> 
> On Mon, 9 Sep 2002, Andrew Morton wrote:
> > Rik van Riel wrote:
> 
> > > Move them to the inactive list the moment we're done writing
> > > them, that is, the moment we move on to the next page. We
> >
> > The moment "who" has done writing them?  Some writeout
> > comes in via shrink_foo() and a ton of writeout comes in
> > via balance_dirty_pages(), pdflush, etc.
> 
> generic_file_write, once that function moves beyond the last
> byte of the page, onto the next page, we can be pretty sure
> it's done writing to this page

Oh.  So why don't we just start those new pages out on the
inactive list?

I fear that this change will result in us encountering more dirty
pages on the inactive list.  It could be that moving then onto the
inactive list when IO is started is a good compromise - that will
happen pretty darn quick if the system is under dirty pressure
anyway.

Do we remove the SetPageReferenced() in generic_file_write?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 21:52             ` Andrew Morton
@ 2002-09-09 22:41               ` Rik van Riel
  2002-09-10  0:17                 ` Daniel Phillips
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 22:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Mon, 9 Sep 2002, Andrew Morton wrote:

> > generic_file_write, once that function moves beyond the last
> > byte of the page, onto the next page, we can be pretty sure
> > it's done writing to this page
>
> Oh.  So why don't we just start those new pages out on the
> inactive list?

I guess that should work, combined with a re-dropping of
pages when we're doing sequential writes.

> I fear that this change will result in us encountering more dirty
> pages on the inactive list.

If that's a problem, something is seriously fucked with
the VM ;)

> Do we remove the SetPageReferenced() in generic_file_write?

Good question, I think we'll want to SetPageReferenced() when
we do a partial write but ClearPageReferenced() when we've
"written past the end" of the page.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09  9:38 ` Andrew Morton
  2002-09-09 11:40   ` Ed Tomlinson
  2002-09-09 13:10   ` Rik van Riel
@ 2002-09-09 22:46   ` Daniel Phillips
  2002-09-09 22:58     ` Andrew Morton
  2 siblings, 1 reply; 28+ messages in thread
From: Daniel Phillips @ 2002-09-09 22:46 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Monday 09 September 2002 11:38, Andrew Morton wrote:
> One thing this patch did do was to speed up the initial untar of
> the kernel source - 50 seconds down to 25.  That'll be due to not
> having so much dirt on the inactive list.  The "nonblocking page
> reclaim" code (needs a better name...)

Nonblocking kswapd, no?  Perhaps 'kscand' would be a better name, now.

> ...does that in 18 secs.

Woohoo!  I didn't think it would make *that* much difference, did you
dig into why?

My reason for wanting nonblocking kswapd has always been to be able to
untangle the multiple-simultaneous-scanners mess, which we are now in
a good position to do.  Erm, it never occurred to me it would be as easy
as checking whether the page *might* block and skipping it if so.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 20:51         ` Andrew Morton
  2002-09-09 20:57           ` Andrew Morton
  2002-09-09 21:09           ` Rik van Riel
@ 2002-09-09 22:49           ` William Lee Irwin III
  2002-09-09 22:54             ` Rik van Riel
  2 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-09 22:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, sfkaplan, linux-mm

Rik van Riel wrote:
>> Move them to the inactive list the moment we're done writing
>> them, that is, the moment we move on to the next page. We
>> wouldn't want to move the last page from /var/log/messages to
>> the inactive list all the time ;)

On Mon, Sep 09, 2002 at 01:51:35PM -0700, Andrew Morton wrote:
> The moment "who" has done writing them?  Some writeout
> comes in via shrink_foo() and a ton of writeout comes in
> via balance_dirty_pages(), pdflush, etc.
> Do we need to distinguish between the various contexts?

Ideally some distinction would be nice, even if only to distinguish I/O
demanded to be done directly by the workload from background writeback
and/or readahead.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:49           ` William Lee Irwin III
@ 2002-09-09 22:54             ` Rik van Riel
  2002-09-09 23:32               ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 22:54 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, sfkaplan, linux-mm

On Mon, 9 Sep 2002, William Lee Irwin III wrote:

> Ideally some distinction would be nice, even if only to distinguish I/O
> demanded to be done directly by the workload from background writeback
> and/or readahead.

OK, are we talking about page replacement or does queue scanning
have priority over the quality of page replacement ? ;)


Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:46   ` Daniel Phillips
@ 2002-09-09 22:58     ` Andrew Morton
  2002-09-09 23:40       ` William Lee Irwin III
  2002-09-10  1:50       ` Daniel Phillips
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-09 22:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm

Daniel Phillips wrote:
> 
> On Monday 09 September 2002 11:38, Andrew Morton wrote:
> > One thing this patch did do was to speed up the initial untar of
> > the kernel source - 50 seconds down to 25.  That'll be due to not
> > having so much dirt on the inactive list.  The "nonblocking page
> > reclaim" code (needs a better name...)
> 
> Nonblocking kswapd, no?  Perhaps 'kscand' would be a better name, now.

Well, it blocks still.  But it doesn't block on "this particular
request queue" or on "that particular page ending IO".  It
blocks on "any queue putting back a write request".   Which is
basically equivalent to blocking on "a bunch of pages came clean".

This logic is too global at present.  It really needs to be per-zone,
to fix an oom problem which you-know-who managed to trigger.  All
ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude
that we're oom.  (Plus I reduced the maximum-scan-before-oom by 2.5x)

Then again, Bill had twiddled the dirty memory thresholds
to permit 12G of dirty ZONE_HIGHMEM.

> > ...does that in 18 secs.
> 
> Woohoo!  I didn't think it would make *that* much difference, did you
> dig into why?

That's nuthin.  Some tests are 10-50 times faster.  Tests like
trying to compile something while some other process is doing a
bunch of big file writes.

> My reason for wanting nonblocking kswapd has always been to be able to
> untangle the multiple-simultaneous-scanners mess, which we are now in
> a good position to do.  Erm, it never occurred to me it would be as easy
> as checking whether the page *might* block and skipping it if so.
> 

Skipping is dumb.  It shouldn't have been on that list in the
first place.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:54             ` Rik van Riel
@ 2002-09-09 23:32               ` William Lee Irwin III
  2002-09-09 23:53                 ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-09 23:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, sfkaplan, linux-mm

On Mon, 9 Sep 2002, William Lee Irwin III wrote:
>> Ideally some distinction would be nice, even if only to distinguish I/O
>> demanded to be done directly by the workload from background writeback
>> and/or readahead.

On Mon, Sep 09, 2002 at 07:54:29PM -0300, Rik van Riel wrote:
> OK, are we talking about page replacement or does queue scanning
> have priority over the quality of page replacement ? ;)

This is relatively tangential. The concern expressed has more to do
with VM writeback starving workload-issued I/O than page replacement.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:58     ` Andrew Morton
@ 2002-09-09 23:40       ` William Lee Irwin III
  2002-09-10  0:02         ` Andrew Morton
  2002-09-10  1:50       ` Daniel Phillips
  1 sibling, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-09 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm

On Mon, Sep 09, 2002 at 03:58:06PM -0700, Andrew Morton wrote:
> This logic is too global at present.  It really needs to be per-zone,
> to fix an oom problem which you-know-who managed to trigger.  All
> ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in
> ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude
> that we're oom.  (Plus I reduced the maximum-scan-before-oom by 2.5x)
> Then again, Bill had twiddled the dirty memory thresholds
> to permit 12G of dirty ZONE_HIGHMEM.

This seemed to work fine when I just tweaked problem areas to use
__GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but
generic_file_read(), generic_file_write(), the rest of filemap.c,
slab allocations, and allocating file descriptor tables for poll() and
select() appeared to generate OOM when it appeared to me that failing
system calls with -ENOMEM was a better alternative than shooting tasks.

After doing that, the system was able to do just fine until the disk
driver oopsed. Given the lack of forward progress on the driver front
due to basically nobody we know knowing or caring about that device
and the mempool issue triggered by bounce buffering already being fixed
I've obtained a replacement and am just chucking the isp1020 out the
window. I'm also hunting for a (non-Emulex!) FC adapter so I can get
more interesting dbench results from non-clockwork disks. =)

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 23:32               ` William Lee Irwin III
@ 2002-09-09 23:53                 ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-09 23:53 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, sfkaplan, linux-mm

On Mon, 9 Sep 2002, William Lee Irwin III wrote:
> On Mon, 9 Sep 2002, William Lee Irwin III wrote:
> >> Ideally some distinction would be nice, even if only to distinguish I/O
> >> demanded to be done directly by the workload from background writeback
> >> and/or readahead.
>
> On Mon, Sep 09, 2002 at 07:54:29PM -0300, Rik van Riel wrote:
> > OK, are we talking about page replacement or does queue scanning
> > have priority over the quality of page replacement ? ;)
>
> This is relatively tangential. The concern expressed has more to do
> with VM writeback starving workload-issued I/O than page replacement.

If that happens, the asynchronous writeback threshold should be
lower. Maybe we could even tune this dynamically ...

Compromising on page replacement is generally a Bad Idea(tm) because
page faults are expensive, very expensive.

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 23:40       ` William Lee Irwin III
@ 2002-09-10  0:02         ` Andrew Morton
  2002-09-10  0:21           ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2002-09-10  0:02 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm

William Lee Irwin III wrote:
> 
> On Mon, Sep 09, 2002 at 03:58:06PM -0700, Andrew Morton wrote:
> > This logic is too global at present.  It really needs to be per-zone,
> > to fix an oom problem which you-know-who managed to trigger.  All
> > ZONE_NORMAL is dirty, we keep on getting woken up by IO completion in
> > ZONE_HIGHMEM, we end up scanning enough ZONE_NORMAL pages to conclude
> > that we're oom.  (Plus I reduced the maximum-scan-before-oom by 2.5x)
> > Then again, Bill had twiddled the dirty memory thresholds
> > to permit 12G of dirty ZONE_HIGHMEM.
> 
> This seemed to work fine when I just tweaked problem areas to use
> __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but
> generic_file_read(), generic_file_write(), the rest of filemap.c,
> slab allocations, and allocating file descriptor tables for poll() and
> select() appeared to generate OOM when it appeared to me that failing
> system calls with -ENOMEM was a better alternative than shooting tasks.

But clearly there is reclaimable pagecache down there; we just
have to wait for it.  No idea why you'd get an oom on ZONE_HIGHMEM,
but when I have a few more gigs I might be able to say.

Anyway, it's all too much scanning.

You'll probably find that segq helps by accident.  I installed
SEGQ (and the shrink-slab-harder-if-mapped-pages-are-enountered)
on my desktop here.  Initial indications are that SEGQ kicks butt.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:41               ` Rik van Riel
@ 2002-09-10  0:17                 ` Daniel Phillips
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Phillips @ 2002-09-10  0:17 UTC (permalink / raw)
  To: Rik van Riel, Andrew Morton; +Cc: William Lee Irwin III, sfkaplan, linux-mm

On Tuesday 10 September 2002 00:41, Rik van Riel wrote:
> On Mon, 9 Sep 2002, Andrew Morton wrote:
> > Do we remove the SetPageReferenced() in generic_file_write?
> 
> Good question, I think we'll want to SetPageReferenced() when
> we do a partial write but ClearPageReferenced() when we've
> "written past the end" of the page.

There's no substitute for the real thing: a short delay queue where we treat 
all references as a single reference.  In generic_file_write, a page goes 
onto this list immediately on instantiation.  On exit from the delay queue we 
unconditionally clear the referenced bit and use the rmap list to discard the 
pte referenced bits, then move the page to the inactive list.

>From there, a second reference will rescue the page to the hot end of the 
active list.  Faulted-in pages, including swapped-in pages, mmaped pages and 
zeroed anon pages, take the same path as file IO pages.

A reminder of why we're going to all this effort in the first place: it's to 
distinguish automatically between streaming IO and repeated use of data.  
With the improvements described here, we will additionally be able to detect 
used-once anon pages, which would include execute-once.

Because of readahead, generic_file_read has to work a little differently.  
Ideally, we'd have a time-ordered readahead list and when the readahead 
heuristics accidently get too aggressive, we can cannibalize the future end 
of the list (and pour some cold water on the readahead thingy).  A crude 
approximation of that behavior is just to have a readahead FIFO, and an even 
cruder approximation is to use the inactive list for this purpose.  
Unfortunately, the latter is too crude, because not-yet-used-readahead pages 
have to have a higher priority than just-used pages, otherwise the former 
will be recovered before the latter, which is not what we want.

In any event, each page that passes under the read head of generic_file_read 
goes to the hot end of the delay queue, and from there behaves just like 
other kinds of pages.

Attention has to be paid to balancing the aggressiveness of readahead against 
the refill_inactive scanning rate.  These move in opposite directions in 
response to memory pressure.

One could argue that program text is inherently more valuable than allocated 
data or file cache, in which case it may want its own inactive list, so that 
we can reclaim program text vs other kinds of data at different rates.  The 
relative rates could depend on the relative instantiation rates (which 
includes the faulting rate and the file IO cache page creation rate).  
However, I'd like to see how well the crude presumption of equality works, 
and besides, it's less work that way.  (So ignore this paragraph, please.)

As far as zones go, the route of least resistance is to make both the delay 
queue and the readahead list per-zone, and since that means it's also 
per-node, numa people should like it.

On the testing front, one useful cross-check is to determine whether hot 
spots in code are correctly detected.  After running a while under mixed 
program activity and file IO, we should see that the hot spots as determined 
by a profiler (or cooked by a test program) have in fact moved to the active 
list, while initialization code has been evicted.

All of the above is O(1).

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-10  0:02         ` Andrew Morton
@ 2002-09-10  0:21           ` William Lee Irwin III
  2002-09-10  1:13             ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-10  0:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm

William Lee Irwin III wrote:
>> This seemed to work fine when I just tweaked problem areas to use
>> __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but
>> generic_file_read(), generic_file_write(), the rest of filemap.c,
>> slab allocations, and allocating file descriptor tables for poll() and
>> select() appeared to generate OOM when it appeared to me that failing
>> system calls with -ENOMEM was a better alternative than shooting tasks.

On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote:
> But clearly there is reclaimable pagecache down there; we just
> have to wait for it.  No idea why you'd get an oom on ZONE_HIGHMEM,
> but when I have a few more gigs I might be able to say.
> Anyway, it's all too much scanning.

Well, there was no swap, and most things were dirty. Not sure about the
rest. I was miffed by "Something tells it there's no memory and it
shoots tasks instead of returning -ENOMEM to userspace in a syscall?"
Saying "no" to the task allocating seems better than shooting tasks to
me. out_of_memory() being called too early sounds bad, too, though.


On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote:
> You'll probably find that segq helps by accident.  I installed
> SEGQ (and the shrink-slab-harder-if-mapped-pages-are-enountered)
> on my desktop here.  Initial indications are that SEGQ kicks butt.

It seems to be a nice strategy a priori. It's good to hear initial
indications of the advantages coming out in practice. Something to
bench soon for sure.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-10  0:21           ` William Lee Irwin III
@ 2002-09-10  1:13             ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-10  1:13 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Daniel Phillips, Rik van Riel, sfkaplan, linux-mm

William Lee Irwin III wrote:
> 
> William Lee Irwin III wrote:
> >> This seemed to work fine when I just tweaked problem areas to use
> >> __GFP_NOKILL. mempool was fixed by the __GFP_FS checks, but
> >> generic_file_read(), generic_file_write(), the rest of filemap.c,
> >> slab allocations, and allocating file descriptor tables for poll() and
> >> select() appeared to generate OOM when it appeared to me that failing
> >> system calls with -ENOMEM was a better alternative than shooting tasks.
> 
> On Mon, Sep 09, 2002 at 05:02:31PM -0700, Andrew Morton wrote:
> > But clearly there is reclaimable pagecache down there; we just
> > have to wait for it.  No idea why you'd get an oom on ZONE_HIGHMEM,
> > but when I have a few more gigs I might be able to say.
> > Anyway, it's all too much scanning.
> 
> Well, there was no swap, and most things were dirty. Not sure about the
> rest. I was miffed by "Something tells it there's no memory and it
> shoots tasks instead of returning -ENOMEM to userspace in a syscall?"
> Saying "no" to the task allocating seems better than shooting tasks to
> me. out_of_memory() being called too early sounds bad, too, though.

If there is dirty memory or memory under writeback then
going oom or returning NULL is a bug.

It's just a search problem, and not a very complex one.  Per-zone
dirty accounting, per-zone throttling and a separate known-to-be-unreclaimable list should fix it up.  Give me
a few days to find a motivated moment...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-09 22:58     ` Andrew Morton
  2002-09-09 23:40       ` William Lee Irwin III
@ 2002-09-10  1:50       ` Daniel Phillips
  2002-09-10  2:02         ` Rik van Riel
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Phillips @ 2002-09-10  1:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, sfkaplan, linux-mm

On Tuesday 10 September 2002 00:58, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > On Monday 09 September 2002 11:38, Andrew Morton wrote:
> > > One thing this patch did do was to speed up the initial untar of
> > > the kernel source - 50 seconds down to 25.  That'll be due to not
> > > having so much dirt on the inactive list.  The "nonblocking page
> > > reclaim" code (needs a better name...)
> > 
> > Nonblocking kswapd, no?  Perhaps 'kscand' would be a better name, now.
> 
> Well, it blocks still.  But it doesn't block on "this particular
> request queue" or on "that particular page ending IO".  It
> blocks on "any queue putting back a write request".   Which is
> basically equivalent to blocking on "a bunch of pages came clean".

It's not that far from being truly nonblocking, which would be a useful 
property.  Instead of calling ->writepage, just bump the page to the front of 
the pdlist (getting deja vu here).  Move locked pages off to a locked list 
and let them rehabilitate themselves asynchronously (since we can now do lru 
list moves inside interrupts).  If necessary, fall back to scanning the 
locked list for pages that slipped through the cracks, though it may be 
possible to make things airtight so that never happens.

What other ways for kswapd to block are there?  Buffers may be locked; a 
similar strategy applies, which is one reason why buffer state should not be 
opaque to the vfs.  ->releasepage is a can of worms, at which I'm looking 
suspiciously.

> Skipping is dumb.  It shouldn't have been on that list in the
> first place.

Sure, it's not the only way to skin the cat.  Anyway, skipping isn't so dumb 
that we haven't been doing it for years.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] modified segq for 2.5
  2002-09-10  1:50       ` Daniel Phillips
@ 2002-09-10  2:02         ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2002-09-10  2:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, William Lee Irwin III, sfkaplan, linux-mm

On Tue, 10 Sep 2002, Daniel Phillips wrote:

> > Skipping is dumb.  It shouldn't have been on that list in the
> > first place.
>
> Sure, it's not the only way to skin the cat.  Anyway, skipping isn't so
> dumb that we haven't been doing it for years.

Skipping might even be the correct thing to do, if we leave
the pages on the inactive list in strict LRU order instead
of wrapping them over to the other end of the list...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2002-09-10  2:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-15 14:24 [PATCH] modified segq for 2.5 Rik van Riel
2002-09-09  9:38 ` Andrew Morton
2002-09-09 11:40   ` Ed Tomlinson
2002-09-09 17:10     ` William Lee Irwin III
2002-09-09 18:58     ` Andrew Morton
2002-09-09 13:10   ` Rik van Riel
2002-09-09 19:03     ` Andrew Morton
2002-09-09 19:25       ` Rik van Riel
2002-09-09 19:55         ` Andrew Morton
2002-09-09 20:03           ` Rik van Riel
2002-09-09 20:51         ` Andrew Morton
2002-09-09 20:57           ` Andrew Morton
2002-09-09 21:09           ` Rik van Riel
2002-09-09 21:52             ` Andrew Morton
2002-09-09 22:41               ` Rik van Riel
2002-09-10  0:17                 ` Daniel Phillips
2002-09-09 22:49           ` William Lee Irwin III
2002-09-09 22:54             ` Rik van Riel
2002-09-09 23:32               ` William Lee Irwin III
2002-09-09 23:53                 ` Rik van Riel
2002-09-09 22:46   ` Daniel Phillips
2002-09-09 22:58     ` Andrew Morton
2002-09-09 23:40       ` William Lee Irwin III
2002-09-10  0:02         ` Andrew Morton
2002-09-10  0:21           ` William Lee Irwin III
2002-09-10  1:13             ` Andrew Morton
2002-09-10  1:50       ` Daniel Phillips
2002-09-10  2:02         ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox