linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFT 0/5] CLOCK-Pro page replacement
@ 2005-08-10 20:02 Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 1/5] " Rik van Riel
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Here it is, the result of many months of thinking and a few
all-nighters.  CLOCK-Pro page replacement is an algorithm
designed to keep those pages on the active list that were
referenced "most frequently, recently", ie. the pages that
have the smallest distance between the last two subsequent
references.

I had to make some changes to the algorithm in order to
reduce the space overhead of keeping track of non-resident
pages, as well as work in a multi-zone VM.

The algorithm still needs lots of testing, and probably tuning:
- should new anonymous pages start out on the active or
  the inactive list ?
- is this implementation of the algorithm buggy ?
- are there performance regressions ?

I have only done very rudimentary testing of the algorithm
here, and while it appears to be behaving as expected, I do
not know whether the expected behaviour is the right thing...

I think I have acted on all the feedback people have given
me on the non-resident pages patch set.

Any comments, observations, etc. are appreciated.

-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH/RFT 1/5] CLOCK-Pro page replacement
  2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
@ 2005-08-10 20:02 ` Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 2/5] " Rik van Riel
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: nonresident --]
[-- Type: text/plain, Size: 8982 bytes --]

Track non-resident pages through a simple hashing scheme.  This way
the space overhead is limited to 1 u32 per page, or 0.1% space overhead
and lookups are one cache miss.

Aside from seeing whether or not a page was recently evicted, we can
also take a reasonable guess at how many other pages were evicted since
this page was evicted.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.12-vm/include/linux/swap.h
===================================================================
--- linux-2.6.12-vm.orig/include/linux/swap.h
+++ linux-2.6.12-vm/include/linux/swap.h
@@ -153,6 +153,11 @@ extern void out_of_memory(unsigned int _
 /* linux/mm/memory.c */
 extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
 
+/* linux/mm/nonresident.c */
+extern int remember_page(struct address_space *, unsigned long);
+extern int recently_evicted(struct address_space *, unsigned long);
+extern void init_nonresident(void);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalhigh_pages;
@@ -288,6 +293,11 @@ static inline swp_entry_t get_swap_page(
 #define grab_swap_token()  do { } while(0)
 #define has_swap_token(x) 0
 
+/* linux/mm/nonresident.c */
+#define init_nonresident()	do { } while (0)
+#define remember_page(x,y)	0
+#define recently_evicted(x,y)	0
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
Index: linux-2.6.12-vm/init/main.c
===================================================================
--- linux-2.6.12-vm.orig/init/main.c
+++ linux-2.6.12-vm/init/main.c
@@ -47,6 +47,7 @@
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
 #include <linux/key.h>
+#include <linux/swap.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -488,6 +489,7 @@ asmlinkage void __init start_kernel(void
 	}
 #endif
 	vfs_caches_init_early();
+	init_nonresident();
 	mem_init();
 	kmem_cache_init();
 	numa_policy_init();
Index: linux-2.6.12-vm/mm/Makefile
===================================================================
--- linux-2.6.12-vm.orig/mm/Makefile
+++ linux-2.6.12-vm/mm/Makefile
@@ -12,7 +12,8 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   readahead.o slab.o swap.o truncate.o vmscan.o \
 			   prio_tree.o $(mmu-y)
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o \
+			   nonresident.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SHMEM) += shmem.o
Index: linux-2.6.12-vm/mm/nonresident.c
===================================================================
--- /dev/null
+++ linux-2.6.12-vm/mm/nonresident.c
@@ -0,0 +1,162 @@
+/*
+ * mm/nonresident.c
+ * (C) 2004,2005 Red Hat, Inc
+ * Written by Rik van Riel <riel@redhat.com>
+ * Released under the GPL, see the file COPYING for details.
+ *
+ * Keeps track of whether a non-resident page was recently evicted
+ * and should be immediately promoted to the active list. This also
+ * helps automatically tune the inactive target.
+ *
+ * The pageout code stores a recently evicted page in this cache
+ * by calling remember_page(mapping/mm, index/vaddr, generation)
+ * and can look it up in the cache by calling recently_evicted()
+ * with the same arguments.
+ *
+ * Note that there is no way to invalidate pages after eg. truncate
+ * or exit, we let the pages fall out of the non-resident set through
+ * normal replacement.
+ */
+#include <linux/mm.h>
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/bootmem.h>
+#include <linux/hash.h>
+#include <linux/prefetch.h>
+#include <linux/kernel.h>
+
+/* Number of non-resident pages per hash bucket. Never smaller than 15. */
+#if (L1_CACHE_BYTES < 64)
+#define NR_BUCKET_BYTES 64
+#else
+#define NR_BUCKET_BYTES L1_CACHE_BYTES
+#endif
+#define NUM_NR ((NR_BUCKET_BYTES - sizeof(atomic_t))/sizeof(u32))
+
+struct nr_bucket
+{
+	atomic_t hand;
+	u32 page[NUM_NR];
+} ____cacheline_aligned;
+
+/* The non-resident page hash table. */
+static struct nr_bucket * nonres_table;
+static unsigned int nonres_shift;
+static unsigned int nonres_mask;
+
+struct nr_bucket * nr_hash(void * mapping, unsigned long index)
+{
+	unsigned long bucket;
+	unsigned long hash;
+
+	hash = hash_ptr(mapping, BITS_PER_LONG);
+	hash = 37 * hash + hash_long(index, BITS_PER_LONG);
+	bucket = hash & nonres_mask;
+
+	return nonres_table + bucket;
+}
+
+static u32 nr_cookie(struct address_space * mapping, unsigned long index)
+{
+	unsigned long cookie = hash_ptr(mapping, BITS_PER_LONG);
+	cookie = 37 * cookie + hash_long(index, BITS_PER_LONG);
+
+	if (mapping->host) {
+		cookie = 37 * cookie + hash_long(mapping->host->i_ino, BITS_PER_LONG);
+	}
+
+	return (u32)(cookie >> (BITS_PER_LONG - 32));
+}
+
+int recently_evicted(struct address_space * mapping, unsigned long index)
+{
+	struct nr_bucket * nr_bucket;
+	int distance;
+	u32 wanted;
+	int i;
+
+	prefetch(mapping->host);
+	nr_bucket = nr_hash(mapping, index);
+
+	prefetch(nr_bucket);
+	wanted = nr_cookie(mapping, index);
+
+	for (i = 0; i < NUM_NR; i++) {
+		if (nr_bucket->page[i] == wanted) {
+			nr_bucket->page[i] = 0;
+			/* Return the distance between entry and clock hand. */
+			distance = atomic_read(&nr_bucket->hand) + NUM_NR - i;
+			distance = (distance % NUM_NR) + 1;
+			return distance * (1 << nonres_shift);
+		}
+	}
+
+	return -1;
+}
+
+int remember_page(struct address_space * mapping, unsigned long index)
+{
+	struct nr_bucket * nr_bucket;
+	u32 nrpage;
+	int i;
+
+	prefetch(mapping->host);
+	nr_bucket = nr_hash(mapping, index);
+
+	prefetchw(nr_bucket);
+	nrpage = nr_cookie(mapping, index);
+
+	/* Atomically find the next array index. */
+	preempt_disable();
+  retry:
+	i = atomic_inc_return(&nr_bucket->hand);
+	if (unlikely(i >= NUM_NR)) {
+		if (i == NUM_NR)
+			atomic_set(&nr_bucket->hand, -1);
+		goto retry;
+	}
+	preempt_enable();
+
+	/* Statistics may want to know whether the entry was in use. */
+	return xchg(&nr_bucket->page[i], nrpage);
+}
+
+/*
+ * For interactive workloads, we remember about as many non-resident pages
+ * as we have actual memory pages.  For server workloads with large inter-
+ * reference distances we could benefit from remembering more.
+ */
+static __initdata unsigned long nonresident_factor = 1;
+void __init init_nonresident(void)
+{
+	int target;
+	int i;
+
+	/*
+	 * Calculate the non-resident hash bucket target. Use a power of
+	 * two for the division because alloc_large_system_hash rounds up.
+	 */
+	target = nr_all_pages * nonresident_factor;
+	target /= (sizeof(struct nr_bucket) / sizeof(u32));
+
+	nonres_table = alloc_large_system_hash("Non-resident page tracking",
+					sizeof(struct nr_bucket),
+					target,
+					0,
+					HASH_EARLY | HASH_HIGHMEM,
+					&nonres_shift,
+					&nonres_mask,
+					0);
+
+	for (i = 0; i < (1 << nonres_shift); i++)
+		atomic_set(&nonres_table[i].hand, 0);
+}
+
+static int __init set_nonresident_factor(char * str)
+{
+	if (!str)
+		return 0;
+	nonresident_factor = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("nonresident_factor=", set_nonresident_factor);
Index: linux-2.6.12-vm/mm/vmscan.c
===================================================================
--- linux-2.6.12-vm.orig/mm/vmscan.c
+++ linux-2.6.12-vm/mm/vmscan.c
@@ -509,6 +509,7 @@ static int shrink_list(struct list_head 
 #ifdef CONFIG_SWAP
 		if (PageSwapCache(page)) {
 			swp_entry_t swap = { .val = page->private };
+			remember_page(&swapper_space, page->private);
 			__delete_from_swap_cache(page);
 			write_unlock_irq(&mapping->tree_lock);
 			swap_free(swap);
@@ -517,6 +518,7 @@ static int shrink_list(struct list_head 
 		}
 #endif /* CONFIG_SWAP */
 
+		remember_page(page->mapping, page->index);
 		__remove_from_page_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		__put_page(page);
Index: linux-2.6.12-vm/mm/filemap.c
===================================================================
--- linux-2.6.12-vm.orig/mm/filemap.c
+++ linux-2.6.12-vm/mm/filemap.c
@@ -400,6 +400,7 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, int gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
+	recently_evicted(mapping, offset);
 	if (ret == 0)
 		lru_cache_add(page);
 	return ret;
Index: linux-2.6.12-vm/mm/swap_state.c
===================================================================
--- linux-2.6.12-vm.orig/mm/swap_state.c
+++ linux-2.6.12-vm/mm/swap_state.c
@@ -344,6 +344,8 @@ struct page *read_swap_cache_async(swp_e
 				break;		/* Out of memory */
 		}
 
+		recently_evicted(&swapper_space, entry.val);
+
 		/*
 		 * Associate the page with swap entry in the swap cache.
 		 * May fail (-ENOENT) if swap entry has been freed since

--
-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH/RFT 2/5] CLOCK-Pro page replacement
  2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 1/5] " Rik van Riel
@ 2005-08-10 20:02 ` Rik van Riel
  2005-08-10 20:27   ` David S. Miller, Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 3/5] " Rik van Riel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: nonresident-stats --]
[-- Type: text/plain, Size: 4723 bytes --]

Prints a histogram of refaults in /proc/refaults.  This allows somebody
to estimate how much more memory a memory starved system would need to
run better.  

It can also help with the evaluation of page replacement algorithms,
since the algorithm that would need the least amount of extra memory
to fit a workload can be identified.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.12-vm/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.12-vm.orig/fs/proc/proc_misc.c
+++ linux-2.6.12-vm/fs/proc/proc_misc.c
@@ -219,6 +219,20 @@ static struct file_operations fragmentat
 	.release	= seq_release,
 };
 
+extern struct seq_operations refaults_op;
+static int refaults_open(struct inode *inode, struct file *file)
+{
+	(void)inode;
+	return seq_open(file, &refaults_op);
+}
+
+static struct file_operations refaults_file_operations = {
+	.open		= refaults_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static int version_read_proc(char *page, char **start, off_t off,
 				 int count, int *eof, void *data)
 {
@@ -588,6 +602,7 @@ void __init proc_misc_init(void)
 	create_seq_entry("interrupts", 0, &proc_interrupts_operations);
 	create_seq_entry("slabinfo",S_IWUSR|S_IRUGO,&proc_slabinfo_operations);
 	create_seq_entry("buddyinfo",S_IRUGO, &fragmentation_file_operations);
+	create_seq_entry("refaults",S_IRUGO, &refaults_file_operations);
 	create_seq_entry("vmstat",S_IRUGO, &proc_vmstat_file_operations);
 	create_seq_entry("diskstats", 0, &proc_diskstats_operations);
 #ifdef CONFIG_MODULES
Index: linux-2.6.12-vm/mm/nonresident.c
===================================================================
--- linux-2.6.12-vm.orig/mm/nonresident.c
+++ linux-2.6.12-vm/mm/nonresident.c
@@ -24,6 +24,7 @@
 #include <linux/hash.h>
 #include <linux/prefetch.h>
 #include <linux/kernel.h>
+#include <linux/percpu.h>
 
 /* Number of non-resident pages per hash bucket. Never smaller than 15. */
 #if (L1_CACHE_BYTES < 64)
@@ -39,6 +40,9 @@ struct nr_bucket
 	u32 page[NUM_NR];
 } ____cacheline_aligned;
 
+/* Histogram for non-resident refault hits. [NUM_NR] means "not found". */
+DEFINE_PER_CPU(unsigned long[NUM_NR+1], refault_histogram);
+
 /* The non-resident page hash table. */
 static struct nr_bucket * nonres_table;
 static unsigned int nonres_shift;
@@ -86,11 +90,14 @@ int recently_evicted(struct address_spac
 			nr_bucket->page[i] = 0;
 			/* Return the distance between entry and clock hand. */
 			distance = atomic_read(&nr_bucket->hand) + NUM_NR - i;
-			distance = (distance % NUM_NR) + 1;
-			return distance * (1 << nonres_shift);
+			distance = distance % NUM_NR;
+			__get_cpu_var(refault_histogram)[distance]++;
+			return (distance + 1) * (1 << nonres_shift);
 		}
 	}
 
+	/* If this page was evicted, it was longer ago than our history. */
+	__get_cpu_var(refault_histogram)[NUM_NR]++;
 	return -1;
 }
 
@@ -160,3 +167,68 @@ static int __init set_nonresident_factor
 	return 1;
 }
 __setup("nonresident_factor=", set_nonresident_factor);
+
+#ifdef CONFIG_PROC_FS
+
+#include <linux/seq_file.h>
+
+static void *frag_start(struct seq_file *m, loff_t *pos)
+{
+	if (*pos < 0 || *pos > NUM_NR)
+		return NULL;
+
+	m->private = (unsigned long)*pos;
+
+	return pos;
+}
+
+static void *frag_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+	if (*pos < NUM_NR) {
+		(*pos)++;
+		(unsigned long)m->private++;
+		return pos;
+	}
+	return NULL;
+}
+
+static void frag_stop(struct seq_file *m, void *arg)
+{
+}
+
+unsigned long get_refault_stat(unsigned long index)
+{
+	unsigned long total = 0;
+	int cpu;
+
+	for (cpu = first_cpu(cpu_online_map); cpu < NR_CPUS; cpu++) {
+		total += per_cpu(refault_histogram, cpu)[index];
+	}
+	return total;
+}
+
+static int frag_show(struct seq_file *m, void *arg)
+{
+	unsigned long index = (unsigned long)m->private;
+	unsigned long upper = ((unsigned long)index + 1) << nonres_shift;
+	unsigned long lower = (unsigned long)index << nonres_shift;
+	unsigned long hits = get_refault_stat(index);
+
+	if (index == 0)
+		seq_printf(m, "     Refault distance          Hits\n");
+
+	if (index < NUM_NR)
+		seq_printf(m, "%9lu - %9lu     %9lu\n", lower, upper, hits);
+	else
+		seq_printf(m, " New/Beyond %9lu     %9lu\n", lower, hits);
+
+	return 0;
+}
+
+struct seq_operations refaults_op = {
+	.start  = frag_start,
+	.next   = frag_next,
+	.stop   = frag_stop,
+	.show   = frag_show,
+};
+#endif /* CONFIG_PROCFS */

--
-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH/RFT 3/5] CLOCK-Pro page replacement
  2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 1/5] " Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 2/5] " Rik van Riel
@ 2005-08-10 20:02 ` Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 4/5] " Rik van Riel
  2005-08-10 20:02 ` [PATCH/RFT 5/5] " Rik van Riel
  4 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: useonce-cleanup --]
[-- Type: text/plain, Size: 6450 bytes --]

Simplify the use-once code.  I have not benchmarked this change yet,
but I expect it to have little impact on most workloads.  It gets rid
of some magic code though, which is nice.

Signed-off-by: Rik van Riel <riel@surriel.com>

Index: linux-2.6.12-vm/include/linux/page-flags.h
===================================================================
--- linux-2.6.12-vm.orig/include/linux/page-flags.h
+++ linux-2.6.12-vm/include/linux/page-flags.h
@@ -75,7 +75,9 @@
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_nosave_free		19	/* Free, should not be written */
+
 #define PG_uncached		20	/* Page has been mapped as uncached */
+#define PG_new			21	/* Newly allocated page */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -306,6 +308,11 @@ extern void __mod_page_state(unsigned of
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNew(page)		test_bit(PG_new, &(page)->flags)
+#define SetPageNew(page)	set_bit(PG_new, &(page)->flags)
+#define ClearPageNew(page)	clear_bit(PG_new, &(page)->flags)
+#define TestClearPageNew(page)	test_and_clear_bit(PG_new, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6.12-vm/mm/filemap.c
===================================================================
--- linux-2.6.12-vm.orig/mm/filemap.c
+++ linux-2.6.12-vm/mm/filemap.c
@@ -383,6 +383,7 @@ int add_to_page_cache(struct page *page,
 		if (!error) {
 			page_cache_get(page);
 			SetPageLocked(page);
+			SetPageNew(page);
 			page->mapping = mapping;
 			page->index = offset;
 			mapping->nrpages++;
@@ -723,7 +724,6 @@ void do_generic_mapping_read(struct addr
 	unsigned long offset;
 	unsigned long last_index;
 	unsigned long next_index;
-	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
 	int error;
@@ -732,7 +732,6 @@ void do_generic_mapping_read(struct addr
 	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
-	prev_index = ra.prev_page;
 	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
@@ -779,13 +778,7 @@ page_ok:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
-		/*
-		 * When (part of) the same page is read multiple times
-		 * in succession, only mark it as accessed the first time.
-		 */
-		if (prev_index != index)
-			mark_page_accessed(page);
-		prev_index = index;
+		mark_page_accessed(page);
 
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
Index: linux-2.6.12-vm/mm/shmem.c
===================================================================
--- linux-2.6.12-vm.orig/mm/shmem.c
+++ linux-2.6.12-vm/mm/shmem.c
@@ -1525,11 +1525,8 @@ static void do_shmem_file_read(struct fi
 			 */
 			if (mapping_writably_mapped(mapping))
 				flush_dcache_page(page);
-			/*
-			 * Mark the page accessed if we read the beginning.
-			 */
-			if (!offset)
-				mark_page_accessed(page);
+
+			mark_page_accessed(page);
 		} else
 			page = ZERO_PAGE(0);
 
Index: linux-2.6.12-vm/mm/swap.c
===================================================================
--- linux-2.6.12-vm.orig/mm/swap.c
+++ linux-2.6.12-vm/mm/swap.c
@@ -115,19 +115,11 @@ void fastcall activate_page(struct page 
 
 /*
  * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
+	if (!PageReferenced(page))
 		SetPageReferenced(page);
-	}
 }
 
 EXPORT_SYMBOL(mark_page_accessed);
@@ -157,6 +149,7 @@ void fastcall lru_cache_add_active(struc
 	if (!pagevec_add(pvec, page))
 		__pagevec_lru_add_active(pvec);
 	put_cpu_var(lru_add_active_pvecs);
+	ClearPageNew(page);
 }
 
 void lru_add_drain(void)
Index: linux-2.6.12-vm/mm/vmscan.c
===================================================================
--- linux-2.6.12-vm.orig/mm/vmscan.c
+++ linux-2.6.12-vm/mm/vmscan.c
@@ -225,27 +225,6 @@ static int shrink_slab(unsigned long sca
 	return 0;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	return page_count(page) - !!PagePrivate(page) == 2;
@@ -398,9 +377,13 @@ static int shrink_list(struct list_head 
 			goto keep_locked;
 
 		referenced = page_referenced(page, 1, sc->priority <= 0);
-		/* In active use or really unfreeable?  Activate it. */
-		if (referenced && page_mapping_inuse(page))
+
+		if (referenced) {
+			/* New page. Wait and see if it gets used again... */
+			if (TestClearPageNew(page))
+				goto keep_locked;
 			goto activate_locked;
+		}
 
 #ifdef CONFIG_SWAP
 		/*
@@ -694,6 +677,7 @@ refill_inactive_zone(struct zone *zone, 
 	long mapped_ratio;
 	long distress;
 	long swap_tendency;
+	int referenced;
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
@@ -738,10 +722,10 @@ refill_inactive_zone(struct zone *zone, 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+		referenced = page_referenced(page, 0, sc->priority <= 0);
 		if (page_mapped(page)) {
-			if (!reclaim_mapped ||
-			    (total_swap_pages == 0 && PageAnon(page)) ||
-			    page_referenced(page, 0, sc->priority <= 0)) {
+			if (referenced || !reclaim_mapped ||
+			    (total_swap_pages == 0 && PageAnon(page))) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}

--
-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
                   ` (2 preceding siblings ...)
  2005-08-10 20:02 ` [PATCH/RFT 3/5] " Rik van Riel
@ 2005-08-10 20:02 ` Rik van Riel
  2005-08-10 20:31   ` David S. Miller, Rik van Riel
  2005-08-10 23:22   ` Marcelo Tosatti
  2005-08-10 20:02 ` [PATCH/RFT 5/5] " Rik van Riel
  4 siblings, 2 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: clockpro --]
[-- Type: text/plain, Size: 11028 bytes --]

Implement an approximation to Song Jiang's CLOCK-Pro page replacement
algorithm.  The algorithm has been extended to handle multiple memory
zones and, consequently, needed some changes in the active page limit
readjustment.

TODO:
 - verify that things work as expected
 - figure out where to put new anonymous pages

More information can be found at:
 - http://www.cs.wm.edu/hpcs/WWW/HTML/publications/abs05-3.html
 - http://linux-mm.org/wiki/ClockProApproximation

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.12-vm/include/linux/mmzone.h
===================================================================
--- linux-2.6.12-vm.orig/include/linux/mmzone.h
+++ linux-2.6.12-vm/include/linux/mmzone.h
@@ -143,6 +143,8 @@ struct zone {
 	unsigned long		nr_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
+	unsigned long		active_limit;
+	unsigned long		active_scanned;
 
 	/*
 	 * prev_priority holds the scanning priority for this zone.  It is
Index: linux-2.6.12-vm/include/linux/swap.h
===================================================================
--- linux-2.6.12-vm.orig/include/linux/swap.h
+++ linux-2.6.12-vm/include/linux/swap.h
@@ -154,10 +154,15 @@ extern void out_of_memory(unsigned int _
 extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
 
 /* linux/mm/nonresident.c */
-extern int remember_page(struct address_space *, unsigned long);
+extern int do_remember_page(struct address_space *, unsigned long);
 extern int recently_evicted(struct address_space *, unsigned long);
 extern void init_nonresident(void);
 
+/* linux/mm/clockpro.c */
+extern void remember_page(struct page *, struct address_space *, unsigned long);
+extern int page_is_hot(struct page *, struct address_space *, unsigned long);
+DECLARE_PER_CPU(unsigned long, evicted_pages);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalhigh_pages;
@@ -298,6 +303,9 @@ static inline swp_entry_t get_swap_page(
 #define remember_page(x,y)	0
 #define recently_evicted(x,y)	0
 
+/* linux/mm/clockpro.c */
+#define page_is_hot(x,y,z)	0
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
Index: linux-2.6.12-vm/mm/Makefile
===================================================================
--- linux-2.6.12-vm.orig/mm/Makefile
+++ linux-2.6.12-vm/mm/Makefile
@@ -13,7 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o \
-			   nonresident.o
+			   nonresident.o clockpro.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SHMEM) += shmem.o
Index: linux-2.6.12-vm/mm/clockpro.c
===================================================================
--- /dev/null
+++ linux-2.6.12-vm/mm/clockpro.c
@@ -0,0 +1,102 @@
+/*
+ * mm/clockpro.c
+ * (C) 2005 Red Hat, Inc
+ * Written by Rik van Riel <riel@redhat.com>
+ * Released under the GPL, see the file COPYING for details.
+ *
+ * Helper functions to implement CLOCK-Pro page replacement policy.
+ * For details see: http://linux-mm.org/wiki/AdvancedPageReplacement
+ */
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/swap.h>
+
+DEFINE_PER_CPU(unsigned long, evicted_pages);
+static unsigned long get_evicted(void)
+{
+	unsigned long total = 0;
+	int cpu;
+
+	for (cpu = first_cpu(cpu_online_map); cpu < NR_CPUS; cpu++)
+		total += per_cpu(evicted_pages, cpu);
+
+	return total;
+}
+
+static unsigned long estimate_pageable_memory(void)
+{
+	static unsigned long next_check;
+	static unsigned long total;
+	unsigned long active, inactive, free;
+
+	if (time_after(jiffies, next_check)) {
+		get_zone_counts(&active, &inactive, &free);
+		total = active + inactive + free;
+		next_check = jiffies + HZ/10;
+	}
+
+	return total;
+}
+
+static void decay_clockpro_variables(void)
+{
+	struct zone * zone;
+	int cpu;
+
+	for (cpu = first_cpu(cpu_online_map); cpu < NR_CPUS; cpu++)
+		per_cpu(evicted_pages, cpu) /= 2;
+
+	for_each_zone(zone)
+		zone->active_scanned /= 2;
+}
+
+int page_is_hot(struct page * page, struct address_space * mapping,
+		unsigned long index)
+{
+	unsigned long long distance;
+	unsigned long long evicted;
+	int refault_distance;
+	struct zone *zone;
+
+	/* Was the page recently evicted ? */
+	refault_distance = recently_evicted(mapping, index);
+	if (refault_distance < 0)
+		return 0;
+
+	distance = estimate_pageable_memory() + refault_distance;
+	evicted = get_evicted();
+	zone = page_zone(page);
+
+	/* Only consider recent history for the calculation below. */
+	if (unlikely(evicted > distance))
+		decay_clockpro_variables();
+
+	/*
+	 * Estimate whether the inter-reference distance of the tested
+	 * page is smaller than the inter-reference distance of the
+	 * oldest page on the active list.
+	 *
+	 *  distance        zone->nr_active
+	 * ---------- <  ----------------------
+	 *  evicted       zone->active_scanned
+	 */
+	if (distance * zone->active_scanned < evicted * zone->nr_active) {
+		if (zone->active_limit > zone->present_pages / 8)
+			zone->active_limit--;
+		return 1;
+	}
+
+	/* Increase the active limit more slowly. */
+	if ((evicted & 1) && zone->active_limit < zone->present_pages * 7 / 8)
+		zone->active_limit++;
+	return 0;
+}
+
+void remember_page(struct page * page, struct address_space * mapping,
+		unsigned long index)
+{
+	struct zone * zone = page_zone(page);
+	if (do_remember_page(mapping, index) && (index & 1) &&
+			zone->active_limit < zone->present_pages * 7 / 8)
+		zone->active_limit++;
+}
Index: linux-2.6.12-vm/mm/filemap.c
===================================================================
--- linux-2.6.12-vm.orig/mm/filemap.c
+++ linux-2.6.12-vm/mm/filemap.c
@@ -401,9 +401,12 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, int gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	recently_evicted(mapping, offset);
-	if (ret == 0)
-		lru_cache_add(page);
+	if (ret == 0) {
+		if (page_is_hot(page, mapping, offset))
+			lru_cache_add_active(page);
+		else
+			lru_cache_add(page);
+	}
 	return ret;
 }
 
Index: linux-2.6.12-vm/mm/nonresident.c
===================================================================
--- linux-2.6.12-vm.orig/mm/nonresident.c
+++ linux-2.6.12-vm/mm/nonresident.c
@@ -25,6 +25,7 @@
 #include <linux/prefetch.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/swap.h>
 
 /* Number of non-resident pages per hash bucket. Never smaller than 15. */
 #if (L1_CACHE_BYTES < 64)
@@ -101,7 +102,7 @@ int recently_evicted(struct address_spac
 	return -1;
 }
 
-int remember_page(struct address_space * mapping, unsigned long index)
+int do_remember_page(struct address_space * mapping, unsigned long index)
 {
 	struct nr_bucket * nr_bucket;
 	u32 nrpage;
@@ -125,6 +126,7 @@ int remember_page(struct address_space *
 	preempt_enable();
 
 	/* Statistics may want to know whether the entry was in use. */
+	__get_cpu_var(evicted_pages)++;
 	return xchg(&nr_bucket->page[i], nrpage);
 }
 
Index: linux-2.6.12-vm/mm/page_alloc.c
===================================================================
--- linux-2.6.12-vm.orig/mm/page_alloc.c
+++ linux-2.6.12-vm/mm/page_alloc.c
@@ -1715,6 +1715,7 @@ static void __init free_area_init_core(s
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		zone->active_limit = zone->present_pages * 2 / 3;
 		if (!size)
 			continue;
 
Index: linux-2.6.12-vm/mm/swap_state.c
===================================================================
--- linux-2.6.12-vm.orig/mm/swap_state.c
+++ linux-2.6.12-vm/mm/swap_state.c
@@ -323,6 +323,7 @@ struct page *read_swap_cache_async(swp_e
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *found_page, *new_page = NULL;
+	int active;
 	int err;
 
 	do {
@@ -344,7 +345,7 @@ struct page *read_swap_cache_async(swp_e
 				break;		/* Out of memory */
 		}
 
-		recently_evicted(&swapper_space, entry.val);
+		active = page_is_hot(new_page, &swapper_space, entry.val);
 
 		/*
 		 * Associate the page with swap entry in the swap cache.
@@ -361,7 +362,10 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			if (active) {
+				lru_cache_add_active(new_page);
+			} else
+				lru_cache_add(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.12-vm/mm/vmscan.c
===================================================================
--- linux-2.6.12-vm.orig/mm/vmscan.c
+++ linux-2.6.12-vm/mm/vmscan.c
@@ -355,12 +355,14 @@ static int shrink_list(struct list_head 
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
 		struct page *page;
+		struct zone *zone;
 		int may_enter_fs;
 		int referenced;
 
 		cond_resched();
 
 		page = lru_to_page(page_list);
+		zone = page_zone(page);
 		list_del(&page->lru);
 
 		if (TestSetPageLocked(page))
@@ -492,7 +494,7 @@ static int shrink_list(struct list_head 
 #ifdef CONFIG_SWAP
 		if (PageSwapCache(page)) {
 			swp_entry_t swap = { .val = page->private };
-			remember_page(&swapper_space, page->private);
+			remember_page(page, &swapper_space, page->private);
 			__delete_from_swap_cache(page);
 			write_unlock_irq(&mapping->tree_lock);
 			swap_free(swap);
@@ -501,7 +503,7 @@ static int shrink_list(struct list_head 
 		}
 #endif /* CONFIG_SWAP */
 
-		remember_page(page->mapping, page->index);
+		remember_page(page, page->mapping, page->index);
 		__remove_from_page_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		__put_page(page);
@@ -684,6 +686,7 @@ refill_inactive_zone(struct zone *zone, 
 	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
 				    &l_hold, &pgscanned);
 	zone->pages_scanned += pgscanned;
+	zone->active_scanned += pgscanned;
 	zone->nr_active -= pgmoved;
 	spin_unlock_irq(&zone->lru_lock);
 
@@ -799,10 +802,15 @@ shrink_zone(struct zone *zone, struct sc
 	unsigned long nr_inactive;
 
 	/*
-	 * Add one to `nr_to_scan' just to make sure that the kernel will
-	 * slowly sift through the active list.
+	 * Scan the active list if we have too many active pages.
+	 * The limit is automatically adjusted through refaults
+	 * measuring how well the VM did in the past.
 	 */
-	zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
+	if (zone->nr_active > zone->active_limit)
+		zone->nr_scan_active += zone->nr_active - zone->active_limit;
+	else if (sc->priority < DEF_PRIORITY - 2)
+		zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
+
 	nr_active = zone->nr_scan_active;
 	if (nr_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;

--
-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH/RFT 5/5] CLOCK-Pro page replacement
  2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
                   ` (3 preceding siblings ...)
  2005-08-10 20:02 ` [PATCH/RFT 4/5] " Rik van Riel
@ 2005-08-10 20:02 ` Rik van Riel
  2005-08-11 22:08   ` Song Jiang
  4 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: clockpro-stats --]
[-- Type: text/plain, Size: 2510 bytes --]

Export the active limit statistic through /proc.  We may want to
export some more CLOCK-Pro statistics in the future, but I'm not
sure yet which ones.

Signed-off-by: Rik van Riel

Index: linux-2.6.12-vm/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.12-vm.orig/fs/proc/proc_misc.c
+++ linux-2.6.12-vm/fs/proc/proc_misc.c
@@ -125,11 +125,13 @@ static int meminfo_read_proc(char *page,
 	unsigned long free;
 	unsigned long committed;
 	unsigned long allowed;
+	unsigned long active_limit;
 	struct vmalloc_info vmi;
 	long cached;
 
 	get_page_state(&ps);
 	get_zone_counts(&active, &inactive, &free);
+	active_limit = get_active_limit();
 
 /*
  * display in kilobytes.
@@ -158,6 +160,7 @@ static int meminfo_read_proc(char *page,
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8lu kB\n"
 		"Inactive:     %8lu kB\n"
+		"ActiveLimit:  %8lu kB\n"
 		"HighTotal:    %8lu kB\n"
 		"HighFree:     %8lu kB\n"
 		"LowTotal:     %8lu kB\n"
@@ -181,6 +184,7 @@ static int meminfo_read_proc(char *page,
 		K(total_swapcache_pages),
 		K(active),
 		K(inactive),
+		K(active_limit),
 		K(i.totalhigh),
 		K(i.freehigh),
 		K(i.totalram-i.totalhigh),
Index: linux-2.6.12-vm/include/linux/swap.h
===================================================================
--- linux-2.6.12-vm.orig/include/linux/swap.h
+++ linux-2.6.12-vm/include/linux/swap.h
@@ -161,6 +161,7 @@ extern void init_nonresident(void);
 /* linux/mm/clockpro.c */
 extern void remember_page(struct page *, struct address_space *, unsigned long);
 extern int page_is_hot(struct page *, struct address_space *, unsigned long);
+extern unsigned long get_active_limit(void);
 DECLARE_PER_CPU(unsigned long, evicted_pages);
 
 /* linux/mm/page_alloc.c */
Index: linux-2.6.12-vm/mm/clockpro.c
===================================================================
--- linux-2.6.12-vm.orig/mm/clockpro.c
+++ linux-2.6.12-vm/mm/clockpro.c
@@ -100,3 +100,14 @@ void remember_page(struct page * page, s
 			zone->active_limit < zone->present_pages * 7 / 8)
 		zone->active_limit++;
 }
+
+unsigned long get_active_limit(void)
+{
+	unsigned long total = 0;
+	struct zone * zone;
+	
+	for_each_zone(zone)
+		total += zone->active_limit;
+
+	return total;
+}

--
-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 2/5] CLOCK-Pro page replacement
  2005-08-10 20:02 ` [PATCH/RFT 2/5] " Rik van Riel
@ 2005-08-10 20:27   ` David S. Miller, Rik van Riel
  2005-08-10 20:38     ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Miller, Rik van Riel @ 2005-08-10 20:27 UTC (permalink / raw)
  To: riel; +Cc: linux-mm, linux-kernel

> --- linux-2.6.12-vm.orig/fs/proc/proc_misc.c
> +++ linux-2.6.12-vm/fs/proc/proc_misc.c
> @@ -219,6 +219,20 @@ static struct file_operations fragmentat
>  	.release	= seq_release,
>  };
>  
> +extern struct seq_operations refaults_op;

Please put this in linux/mm.h or similar, so that we'll get proper
type checking of the definition in nonresident.c

Otherwise looks great.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-10 20:02 ` [PATCH/RFT 4/5] " Rik van Riel
@ 2005-08-10 20:31   ` David S. Miller, Rik van Riel
  2005-08-18  0:38     ` Andrew Morton
  2005-08-10 23:22   ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: David S. Miller, Rik van Riel @ 2005-08-10 20:31 UTC (permalink / raw)
  To: riel; +Cc: linux-mm, linux-kernel

> +DEFINE_PER_CPU(unsigned long, evicted_pages);

DEFINE_PER_CPU() needs an explicit initializer to work
around some bugs in gcc-2.95, wherein on some platforms
if you let it end up as a BSS candidate it won't end up
in the per-cpu section properly.

I'm actually happy you made this mistake as it forced me
to audit the whole current 2.6.x tree and there are few
missing cases in there which I'll fix up and send to Linus.
:-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 2/5] CLOCK-Pro page replacement
  2005-08-10 20:27   ` David S. Miller, Rik van Riel
@ 2005-08-10 20:38     ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-10 20:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-mm, linux-kernel

On Wed, 10 Aug 2005, David S. Miller wrote:
> From: Rik van Riel <riel@redhat.com>
> Date: Wed, 10 Aug 2005 16:02:18 -0400
> 
> > --- linux-2.6.12-vm.orig/fs/proc/proc_misc.c
> > +++ linux-2.6.12-vm/fs/proc/proc_misc.c
> > @@ -219,6 +219,20 @@ static struct file_operations fragmentat
> >  	.release	= seq_release,
> >  };
> >  
> > +extern struct seq_operations refaults_op;
> 
> Please put this in linux/mm.h or similar, so that we'll get proper
> type checking of the definition in nonresident.c

The reason it is in fs/proc/proc_misc.c is that the rest of
these definitions are there.

I agree with you though, it would be a good thing if they
moved into a header file.

-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-10 20:02 ` [PATCH/RFT 4/5] " Rik van Riel
  2005-08-10 20:31   ` David S. Miller, Rik van Riel
@ 2005-08-10 23:22   ` Marcelo Tosatti
  2005-08-11  0:06     ` Rik van Riel
  1 sibling, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2005-08-10 23:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

Hi Rik,

First of all, this is very nice! The code is amazingly easy to read.

Now the usual ranting:

You change the rate of active list scanning, which I suppose won't
change the current reclaiming behaviour much (at least not on the
"stress system to death" tests which most folks use to test page
replacement policies). I'll do some STP benchmarking.

But the fundamental metric for page replacement decision continues to be
recency alone.

IMHO much deeper surgery is needed: actually use inter-reference
distance as the metric for page replacement decision.

As we talked, I've got an ARC variant working, but from what I gather
so far its not as simple as I've imagined. Direct replacement from the
active list seems to screw up most "stress system to death" workloads,
increasing major pagefaults.

Still lack a set of well analyzed pertinent VM tests... 

On Wed, Aug 10, 2005 at 04:02:20PM -0400, Rik van Riel wrote:
> Implement an approximation to Song Jiang's CLOCK-Pro page replacement
> algorithm.  The algorithm has been extended to handle multiple memory
> zones and, consequently, needed some changes in the active page limit
> readjustment.
> 
> TODO:
>  - verify that things work as expected
>  - figure out where to put new anonymous pages
> 
> More information can be found at:
>  - http://www.cs.wm.edu/hpcs/WWW/HTML/publications/abs05-3.html
>  - http://linux-mm.org/wiki/ClockProApproximation
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> Index: linux-2.6.12-vm/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.12-vm.orig/include/linux/mmzone.h
> +++ linux-2.6.12-vm/include/linux/mmzone.h
> @@ -143,6 +143,8 @@ struct zone {
>  	unsigned long		nr_inactive;
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	int			all_unreclaimable; /* All pages pinned */
> +	unsigned long		active_limit;
> +	unsigned long		active_scanned;
>  
>  	/*
>  	 * prev_priority holds the scanning priority for this zone.  It is
> Index: linux-2.6.12-vm/include/linux/swap.h
> ===================================================================
> --- linux-2.6.12-vm.orig/include/linux/swap.h
> +++ linux-2.6.12-vm/include/linux/swap.h
> @@ -154,10 +154,15 @@ extern void out_of_memory(unsigned int _
>  extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
>  
>  /* linux/mm/nonresident.c */
> -extern int remember_page(struct address_space *, unsigned long);
> +extern int do_remember_page(struct address_space *, unsigned long);
>  extern int recently_evicted(struct address_space *, unsigned long);
>  extern void init_nonresident(void);
>  
> +/* linux/mm/clockpro.c */
> +extern void remember_page(struct page *, struct address_space *, unsigned long);
> +extern int page_is_hot(struct page *, struct address_space *, unsigned long);
> +DECLARE_PER_CPU(unsigned long, evicted_pages);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalhigh_pages;
> @@ -298,6 +303,9 @@ static inline swp_entry_t get_swap_page(
>  #define remember_page(x,y)	0
>  #define recently_evicted(x,y)	0
>  
> +/* linux/mm/clockpro.c */
> +#define page_is_hot(x,y,z)	0
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> Index: linux-2.6.12-vm/mm/Makefile
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/Makefile
> +++ linux-2.6.12-vm/mm/Makefile
> @@ -13,7 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
>  			   prio_tree.o $(mmu-y)
>  
>  obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o \
> -			   nonresident.o
> +			   nonresident.o clockpro.o
>  obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
>  obj-$(CONFIG_NUMA) 	+= mempolicy.o
>  obj-$(CONFIG_SHMEM) += shmem.o
> Index: linux-2.6.12-vm/mm/clockpro.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.12-vm/mm/clockpro.c
> @@ -0,0 +1,102 @@
> +/*
> + * mm/clockpro.c
> + * (C) 2005 Red Hat, Inc
> + * Written by Rik van Riel <riel@redhat.com>
> + * Released under the GPL, see the file COPYING for details.
> + *
> + * Helper functions to implement CLOCK-Pro page replacement policy.
> + * For details see: http://linux-mm.org/wiki/AdvancedPageReplacement
> + */
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/swap.h>
> +
> +DEFINE_PER_CPU(unsigned long, evicted_pages);
> +static unsigned long get_evicted(void)
> +{
> +	unsigned long total = 0;
> +	int cpu;
> +
> +	for (cpu = first_cpu(cpu_online_map); cpu < NR_CPUS; cpu++)
> +		total += per_cpu(evicted_pages, cpu);
> +
> +	return total;
> +}
> +
> +static unsigned long estimate_pageable_memory(void)
> +{
> +	static unsigned long next_check;
> +	static unsigned long total;
> +	unsigned long active, inactive, free;
> +
> +	if (time_after(jiffies, next_check)) {
> +		get_zone_counts(&active, &inactive, &free);
> +		total = active + inactive + free;
> +		next_check = jiffies + HZ/10;
> +	}
> +
> +	return total;
> +}
> +
> +static void decay_clockpro_variables(void)
> +{
> +	struct zone * zone;
> +	int cpu;
> +
> +	for (cpu = first_cpu(cpu_online_map); cpu < NR_CPUS; cpu++)
> +		per_cpu(evicted_pages, cpu) /= 2;
> +
> +	for_each_zone(zone)
> +		zone->active_scanned /= 2;
> +}
> +
> +int page_is_hot(struct page * page, struct address_space * mapping,
> +		unsigned long index)
> +{
> +	unsigned long long distance;
> +	unsigned long long evicted;
> +	int refault_distance;
> +	struct zone *zone;
> +
> +	/* Was the page recently evicted ? */
> +	refault_distance = recently_evicted(mapping, index);
> +	if (refault_distance < 0)
> +		return 0;
> +
> +	distance = estimate_pageable_memory() + refault_distance;
> +	evicted = get_evicted();
> +	zone = page_zone(page);
> +
> +	/* Only consider recent history for the calculation below. */
> +	if (unlikely(evicted > distance))
> +		decay_clockpro_variables();
> +
> +	/*
> +	 * Estimate whether the inter-reference distance of the tested
> +	 * page is smaller than the inter-reference distance of the
> +	 * oldest page on the active list.
> +	 *
> +	 *  distance        zone->nr_active
> +	 * ---------- <  ----------------------
> +	 *  evicted       zone->active_scanned
> +	 */
> +	if (distance * zone->active_scanned < evicted * zone->nr_active) {
> +		if (zone->active_limit > zone->present_pages / 8)
> +			zone->active_limit--;
> +		return 1;
> +	}
> +
> +	/* Increase the active limit more slowly. */
> +	if ((evicted & 1) && zone->active_limit < zone->present_pages * 7 / 8)
> +		zone->active_limit++;
> +	return 0;
> +}
> +
> +void remember_page(struct page * page, struct address_space * mapping,
> +		unsigned long index)
> +{
> +	struct zone * zone = page_zone(page);
> +	if (do_remember_page(mapping, index) && (index & 1) &&
> +			zone->active_limit < zone->present_pages * 7 / 8)
> +		zone->active_limit++;
> +}
> Index: linux-2.6.12-vm/mm/filemap.c
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/filemap.c
> +++ linux-2.6.12-vm/mm/filemap.c
> @@ -401,9 +401,12 @@ int add_to_page_cache_lru(struct page *p
>  				pgoff_t offset, int gfp_mask)
>  {
>  	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
> -	recently_evicted(mapping, offset);
> -	if (ret == 0)
> -		lru_cache_add(page);
> +	if (ret == 0) {
> +		if (page_is_hot(page, mapping, offset))
> +			lru_cache_add_active(page);
> +		else
> +			lru_cache_add(page);
> +	}
>  	return ret;
>  }
>  
> Index: linux-2.6.12-vm/mm/nonresident.c
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/nonresident.c
> +++ linux-2.6.12-vm/mm/nonresident.c
> @@ -25,6 +25,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/kernel.h>
>  #include <linux/percpu.h>
> +#include <linux/swap.h>
>  
>  /* Number of non-resident pages per hash bucket. Never smaller than 15. */
>  #if (L1_CACHE_BYTES < 64)
> @@ -101,7 +102,7 @@ int recently_evicted(struct address_spac
>  	return -1;
>  }
>  
> -int remember_page(struct address_space * mapping, unsigned long index)
> +int do_remember_page(struct address_space * mapping, unsigned long index)
>  {
>  	struct nr_bucket * nr_bucket;
>  	u32 nrpage;
> @@ -125,6 +126,7 @@ int remember_page(struct address_space *
>  	preempt_enable();
>  
>  	/* Statistics may want to know whether the entry was in use. */
> +	__get_cpu_var(evicted_pages)++;
>  	return xchg(&nr_bucket->page[i], nrpage);
>  }
>  
> Index: linux-2.6.12-vm/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/page_alloc.c
> +++ linux-2.6.12-vm/mm/page_alloc.c
> @@ -1715,6 +1715,7 @@ static void __init free_area_init_core(s
>  		zone->nr_scan_inactive = 0;
>  		zone->nr_active = 0;
>  		zone->nr_inactive = 0;
> +		zone->active_limit = zone->present_pages * 2 / 3;
>  		if (!size)
>  			continue;
>  
> Index: linux-2.6.12-vm/mm/swap_state.c
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/swap_state.c
> +++ linux-2.6.12-vm/mm/swap_state.c
> @@ -323,6 +323,7 @@ struct page *read_swap_cache_async(swp_e
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct page *found_page, *new_page = NULL;
> +	int active;
>  	int err;
>  
>  	do {
> @@ -344,7 +345,7 @@ struct page *read_swap_cache_async(swp_e
>  				break;		/* Out of memory */
>  		}
>  
> -		recently_evicted(&swapper_space, entry.val);
> +		active = page_is_hot(new_page, &swapper_space, entry.val);
>  
>  		/*
>  		 * Associate the page with swap entry in the swap cache.
> @@ -361,7 +362,10 @@ struct page *read_swap_cache_async(swp_e
>  			/*
>  			 * Initiate read into locked page and return.
>  			 */
> -			lru_cache_add_active(new_page);
> +			if (active) {
> +				lru_cache_add_active(new_page);
> +			} else
> +				lru_cache_add(new_page);
>  			swap_readpage(NULL, new_page);
>  			return new_page;
>  		}
> Index: linux-2.6.12-vm/mm/vmscan.c
> ===================================================================
> --- linux-2.6.12-vm.orig/mm/vmscan.c
> +++ linux-2.6.12-vm/mm/vmscan.c
> @@ -355,12 +355,14 @@ static int shrink_list(struct list_head 
>  	while (!list_empty(page_list)) {
>  		struct address_space *mapping;
>  		struct page *page;
> +		struct zone *zone;
>  		int may_enter_fs;
>  		int referenced;
>  
>  		cond_resched();
>  
>  		page = lru_to_page(page_list);
> +		zone = page_zone(page);
>  		list_del(&page->lru);
>  
>  		if (TestSetPageLocked(page))
> @@ -492,7 +494,7 @@ static int shrink_list(struct list_head 
>  #ifdef CONFIG_SWAP
>  		if (PageSwapCache(page)) {
>  			swp_entry_t swap = { .val = page->private };
> -			remember_page(&swapper_space, page->private);
> +			remember_page(page, &swapper_space, page->private);
>  			__delete_from_swap_cache(page);
>  			write_unlock_irq(&mapping->tree_lock);
>  			swap_free(swap);
> @@ -501,7 +503,7 @@ static int shrink_list(struct list_head 
>  		}
>  #endif /* CONFIG_SWAP */
>  
> -		remember_page(page->mapping, page->index);
> +		remember_page(page, page->mapping, page->index);
>  		__remove_from_page_cache(page);
>  		write_unlock_irq(&mapping->tree_lock);
>  		__put_page(page);
> @@ -684,6 +686,7 @@ refill_inactive_zone(struct zone *zone, 
>  	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
>  				    &l_hold, &pgscanned);
>  	zone->pages_scanned += pgscanned;
> +	zone->active_scanned += pgscanned;
>  	zone->nr_active -= pgmoved;
>  	spin_unlock_irq(&zone->lru_lock);
>  
> @@ -799,10 +802,15 @@ shrink_zone(struct zone *zone, struct sc
>  	unsigned long nr_inactive;
>  
>  	/*
> -	 * Add one to `nr_to_scan' just to make sure that the kernel will
> -	 * slowly sift through the active list.
> +	 * Scan the active list if we have too many active pages.
> +	 * The limit is automatically adjusted through refaults
> +	 * measuring how well the VM did in the past.
>  	 */
> -	zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
> +	if (zone->nr_active > zone->active_limit)
> +		zone->nr_scan_active += zone->nr_active - zone->active_limit;
> +	else if (sc->priority < DEF_PRIORITY - 2)
> +		zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
> +
>  	nr_active = zone->nr_scan_active;
>  	if (nr_active >= sc->swap_cluster_max)
>  		zone->nr_scan_active = 0;
> 
> --
> -- 
> All Rights Reversed
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-10 23:22   ` Marcelo Tosatti
@ 2005-08-11  0:06     ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-11  0:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, linux-kernel

On Wed, 10 Aug 2005, Marcelo Tosatti wrote:

> First of all, this is very nice! The code is amazingly easy to read.

Thank you.

> You change the rate of active list scanning, which I suppose won't
> change the current reclaiming behaviour much (at least not on the
> "stress system to death" tests which most folks use to test page
> replacement policies). I'll do some STP benchmarking.
> 
> But the fundamental metric for page replacement decision continues
> to be recency alone.
>
> IMHO much deeper surgery is needed: actually use inter-reference
> distance as the metric for page replacement decision.

Actually, inter-reference distance is what triggers whether the
active list gets scanned in addition to the inactive list.

The inter-reference distance also determines on which list a
page gets allocated.

I agree the code probably needs tweaking in this respect, though.

> As we talked, I've got an ARC variant working, but from what I gather
> so far its not as simple as I've imagined. Direct replacement from the
> active list seems to screw up most "stress system to death" workloads,
> increasing major pagefaults.

I'm not surprised!

If a page is not frequently enough accessed to be on the active
list, chances are it could still be accessed more frequently than
many pages on the inactive list, and evicting the page is the
wrong thing to do.

I suspect that ARC and CAR/CART are more suited for databases
than for general purpose OSes.  The reason databases are
probably different is that databases tend to have large indexes,
which are accessed more frequently than most data.  This may
lead to a stricter "separation" between hot and cold pages.

> Still lack a set of well analyzed pertinent VM tests... 

Agreed - I really am not sure how to properly test replacement
algorithms.

-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 5/5] CLOCK-Pro page replacement
  2005-08-10 20:02 ` [PATCH/RFT 5/5] " Rik van Riel
@ 2005-08-11 22:08   ` Song Jiang
  2005-08-12  1:22     ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Song Jiang @ 2005-08-11 22:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

My current test focuses on the looping case, where I repeatedly 
scan a file whose size is larger than the memory size but less 
than two times of memory sizes. My initial results are as follows:


My machine has 2GB memory.
The size of the file to be scanned is 2.5GB.
I looped for 4 time. The times and associated disk bandwidths
for each loop are below:

loop 0 time = 34.229424s bandwith = 76.58MB
loop 1 time = 37.574041s bandwith = 69.76MB
loop 2 time = 38.181791s bandwith = 68.65MB
loop 3 time = 38.141794s bandwith = 68.72MB

This shows that the current patches cannot do a 
better job than the original kernel, which notoriously
underperforms for the case -- no matter how many times
the file is accessed, no hits at all. Meanwhile, Clock-Pro
is supposed to do a better job, because part of the
file can be protected in the active list and get a decent 
number of hits.

Here is from /proc/meminfo:

Active:          11356 kB
Inactive:      1994400 kB

So no file pages are promoted into the active list, just
as in the original kernel.

Here is from /proc/refaults:     

    Refault distance          Hits
         0 -     32768           192
    32768 -     65536           269
    65536 -     98304           447
    98304 -    131072           603
   131072 -    163840          1087
   163840 -    196608           909
   196608 -    229376           558
   229376 -    262144           404
   262144 -    294912           287
   294912 -    327680           191
   327680 -    360448            79
   360448 -    393216            68
   393216 -    425984            41
   425984 -    458752            45
   458752 -    491520            31
New/Beyond    491520          2443

In the statistic, we do see many hits at the distance of around 
150,000 pages. If we consider the inactive list size (1.9GB), 
this position corresponds to the file size. However, if everything
happens as expected, all the hits should happen at the
distance. Unfortunately, there are also many hits listed as
New/Beyond. Because "Beyond"s should not be there, are they all
"New"s? Futhermore, I didn't see where the refault_histogram 
statistics get reset, though they almost stop increasing after
the first run. Can you show me that? 


   Song Jiang
   at LANL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 5/5] CLOCK-Pro page replacement
  2005-08-11 22:08   ` Song Jiang
@ 2005-08-12  1:22     ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2005-08-12  1:22 UTC (permalink / raw)
  To: Song Jiang; +Cc: linux-mm, linux-kernel

On Thu, 11 Aug 2005, Song Jiang wrote:

> My machine has 2GB memory.
> The size of the file to be scanned is 2.5GB.

> Meanwhile, Clock-Pro is supposed to do a better job, because
> part of the file can be protected in the active list and get
> a decent number of hits.

> Active:          11356 kB
> Inactive:      1994400 kB

There is an error somewhere in my implementation of Clock-Pro.

I have made some tweaks but haven't found a proper fix yet.

The problem is that if I make things too well in favor of the
evicted pages, then pages on the active list may get replaced
by pages that have the _same_ inter-reference distance, which
would result in similarly bad behaviour.

Eyeballs on vmscan.c, nonresident.c and clockpro.c would be
very much appreciated ;)

> Here is from /proc/refaults:     
> 
>     Refault distance          Hits
>          0 -     32768           192
>     32768 -     65536           269
>     65536 -     98304           447
>     98304 -    131072           603
>    131072 -    163840          1087
>    163840 -    196608           909
>    196608 -    229376           558
>    229376 -    262144           404
>    262144 -    294912           287
>    294912 -    327680           191
>    327680 -    360448            79
>    360448 -    393216            68
>    393216 -    425984            41
>    425984 -    458752            45
>    458752 -    491520            31
> New/Beyond    491520          2443
> 
> In the statistic, we do see many hits at the distance of around 
> 150,000 pages. If we consider the inactive list size (1.9GB), 
> this position corresponds to the file size. However, if everything
> happens as expected, all the hits should happen at the
> distance. Unfortunately, there are also many hits listed as
> New/Beyond. Because "Beyond"s should not be there, are they all
> "New"s? Futhermore, I didn't see where the refault_histogram 
> statistics get reset, though they almost stop increasing after
> the first run. Can you show me that? 

Currently the statistics never get reset.

The number of "new/beyond" sounds about right for the
startup of a fully running Linux system.

-- 
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-10 20:31   ` David S. Miller, Rik van Riel
@ 2005-08-18  0:38     ` Andrew Morton
  2005-08-18  2:48       ` David S. Miller, Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-08-18  0:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: riel, linux-mm, linux-kernel

"David S. Miller" <davem@davemloft.net> wrote:
>
> > +DEFINE_PER_CPU(unsigned long, evicted_pages);
> 
> DEFINE_PER_CPU() needs an explicit initializer to work
> around some bugs in gcc-2.95, wherein on some platforms
> if you let it end up as a BSS candidate it won't end up
> in the per-cpu section properly.
> 
> I'm actually happy you made this mistake as it forced me
> to audit the whole current 2.6.x tree and there are few
> missing cases in there which I'll fix up and send to Linus.

I'm prety sure we fixed that somehow.  But I forget how.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-18  0:38     ` Andrew Morton
@ 2005-08-18  2:48       ` David S. Miller, Andrew Morton
  2005-08-18  4:05         ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Miller, Andrew Morton @ 2005-08-18  2:48 UTC (permalink / raw)
  To: akpm; +Cc: riel, linux-mm, linux-kernel

> I'm prety sure we fixed that somehow.  But I forget how.

I wish you could remember :-)  I honestly don't think we did.
The DEFINE_PER_CPU() definition still looks the same, and the
way the .data.percpu section is layed out in the vmlinux.lds.S
is still the same as well.

The places which are not handled currently are in not-often-used areas
such as IPVS, some S390 drivers, and some other platform specific
code (likely platforms where the gcc problem in question never
existed).

I do note two important spots where the initialization is not
present, the loopback driver statistics and the scsi_done_q.
Hmmm...

If we are handling it somehow, that would be nice to know for
certain, because we could thus remove all of the ugly
initializers.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-18  2:48       ` David S. Miller, Andrew Morton
@ 2005-08-18  4:05         ` Andrew Morton
  2005-08-18  4:48           ` David S. Miller, Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-08-18  4:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: riel, linux-mm, linux-kernel, Rusty Russell

"David S. Miller" <davem@davemloft.net> wrote:
>
> From: Andrew Morton <akpm@osdl.org>
> Date: Wed, 17 Aug 2005 17:38:18 -0700
> 
> > I'm prety sure we fixed that somehow.  But I forget how.
> 
> I wish you could remember :-)  I honestly don't think we did.
> The DEFINE_PER_CPU() definition still looks the same, and the
> way the .data.percpu section is layed out in the vmlinux.lds.S
> is still the same as well.

Argh, can't remember, can't find it with archive grep.  I just have a
mental note that it got fixed somehow.  Perhaps by uprevving the compiler
version?  We certainly have a ton of uninitialised DEFINE_PER_CPUs in there
nowadays and people's kernels aren't crashing.

Rusty, do you recall if/how we fixed the
DEFINE_PER_CPU-needs-explicit-initialisation thing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-18  4:05         ` Andrew Morton
@ 2005-08-18  4:48           ` David S. Miller, Andrew Morton
  2005-08-19  7:03             ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Miller, Andrew Morton @ 2005-08-18  4:48 UTC (permalink / raw)
  To: akpm; +Cc: riel, linux-mm, linux-kernel, rusty

> Perhaps by uprevving the compiler version?

Can't be, we definitely support gcc-2.95 and that compiler
definitely has the bug on sparc64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-18  4:48           ` David S. Miller, Andrew Morton
@ 2005-08-19  7:03             ` Rusty Russell
  2005-08-19  7:10               ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2005-08-19  7:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, riel, linux-mm, linux-kernel

On Wed, 2005-08-17 at 21:48 -0700, David S. Miller wrote:
> From: Andrew Morton <akpm@osdl.org>
> Date: Wed, 17 Aug 2005 21:05:32 -0700
> 
> > Perhaps by uprevving the compiler version?
> 
> Can't be, we definitely support gcc-2.95 and that compiler
> definitely has the bug on sparc64.

I believe we just ignored sparc64.  That usually works for solving these
kind of bugs. 8)

Rusty.
-- 
A bad analogy is like a leaky screwdriver -- Richard Braakman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-19  7:03             ` Rusty Russell
@ 2005-08-19  7:10               ` Andrew Morton
  2005-08-19  7:27                 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-08-19  7:10 UTC (permalink / raw)
  To: Rusty Russell; +Cc: davem, riel, linux-mm, linux-kernel

Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> On Wed, 2005-08-17 at 21:48 -0700, David S. Miller wrote:
> > From: Andrew Morton <akpm@osdl.org>
> > Date: Wed, 17 Aug 2005 21:05:32 -0700
> > 
> > > Perhaps by uprevving the compiler version?
> > 
> > Can't be, we definitely support gcc-2.95 and that compiler
> > definitely has the bug on sparc64.
> 
> I believe we just ignored sparc64.  That usually works for solving these
> kind of bugs. 8)

heh.  iirc, it was demonstrable on x86 also.

Dunno, it beats me.  But it is the case that we now have lots of
uninitialised DEFINE_PER_CPUs and nobody's crashing.  hm..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-19  7:10               ` Andrew Morton
@ 2005-08-19  7:27                 ` Rusty Russell
  2005-08-19 13:04                   ` Horst von Brand
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2005-08-19  7:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, riel, linux-mm, linux-kernel

On Fri, 2005-08-19 at 00:10 -0700, Andrew Morton wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote:
> > I believe we just ignored sparc64.  That usually works for solving these
> > kind of bugs. 8)
> 
> heh.  iirc, it was demonstrable on x86 also.

No.  gcc-2.95 on Sparc64 put uninititialized vars into the bss, ignoring
the __attribute__((section(".data.percpu"))) directive.  x86 certainly
doesn't have this, I just tested it w/2.95.

Really, it's Sparc64 + gcc-2.95.  Send an urgent telegram to the user
telling them to upgrade.

Rusty.
-- 
A bad analogy is like a leaky screwdriver -- Richard Braakman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH/RFT 4/5] CLOCK-Pro page replacement
  2005-08-19  7:27                 ` Rusty Russell
@ 2005-08-19 13:04                   ` Horst von Brand
  0 siblings, 0 replies; 21+ messages in thread
From: Horst von Brand @ 2005-08-19 13:04 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Andrew Morton, davem, riel, linux-mm, linux-kernel

Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Fri, 2005-08-19 at 00:10 -0700, Andrew Morton wrote:
> > Rusty Russell <rusty@rustcorp.com.au> wrote:
> > > I believe we just ignored sparc64.  That usually works for solving these
> > > kind of bugs. 8)
> > 
> > heh.  iirc, it was demonstrable on x86 also.
> 
> No.  gcc-2.95 on Sparc64 put uninititialized vars into the bss, ignoring
> the __attribute__((section(".data.percpu"))) directive.  x86 certainly
> doesn't have this, I just tested it w/2.95.
> 
> Really, it's Sparc64 + gcc-2.95.  Send an urgent telegram to the user
> telling them to upgrade.

I recently asked if gcc-2.95 was really still supported, and was told that
it is in common use for its speed...
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-08-19 13:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-10 20:02 [PATCH/RFT 0/5] CLOCK-Pro page replacement Rik van Riel
2005-08-10 20:02 ` [PATCH/RFT 1/5] " Rik van Riel
2005-08-10 20:02 ` [PATCH/RFT 2/5] " Rik van Riel
2005-08-10 20:27   ` David S. Miller, Rik van Riel
2005-08-10 20:38     ` Rik van Riel
2005-08-10 20:02 ` [PATCH/RFT 3/5] " Rik van Riel
2005-08-10 20:02 ` [PATCH/RFT 4/5] " Rik van Riel
2005-08-10 20:31   ` David S. Miller, Rik van Riel
2005-08-18  0:38     ` Andrew Morton
2005-08-18  2:48       ` David S. Miller, Andrew Morton
2005-08-18  4:05         ` Andrew Morton
2005-08-18  4:48           ` David S. Miller, Andrew Morton
2005-08-19  7:03             ` Rusty Russell
2005-08-19  7:10               ` Andrew Morton
2005-08-19  7:27                 ` Rusty Russell
2005-08-19 13:04                   ` Horst von Brand
2005-08-10 23:22   ` Marcelo Tosatti
2005-08-11  0:06     ` Rik van Riel
2005-08-10 20:02 ` [PATCH/RFT 5/5] " Rik van Riel
2005-08-11 22:08   ` Song Jiang
2005-08-12  1:22     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox