[patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
@ 2007-07-27  8:42 Nick Piggin
  2007-07-27 14:30 ` Lee Schermerhorn
  2007-08-08 20:25 ` Lee Schermerhorn
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2007-07-27  8:42 UTC (permalink / raw)
  To: Linux Memory Management List, Linux Kernel Mailing List
  Cc: Joachim Deguara, Lee Schermerhorn

Hi,

Just got a bit of time to take another look at the replicated pagecache
patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
gives me more confidence in the locking now; the new ->fault API makes
MAP_SHARED write faults much more efficient; and a few bugs were found
and fixed.

More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
tests...

--

Page-based NUMA pagecache replication.

This is a scheme for page replication replicates read-only pagecache pages
opportunistically, at pagecache lookup time (at points where we know the
page is being looked up for read only).

The page will be replicated if it resides on a different node to what the
requesting CPU is on. Also, the original page must meet some conditions:
it must be clean, uptodate, not under writeback, and not have an elevated
refcount or filesystem private data. However it is allowed to be mapped
into pagetables.

Replication is done at the pagecache level, where a replicated pagecache
(inode,offset) key will have a special bit set in its radix-tree entry,
which tells us the entry points to a descriptor rather than a page.

This descriptor (struct pcache_desc) has another radix-tree which is keyed by
node. The pagecache gains an (optional) 3rd dimension!

Pagecache lookups which are not explicitly denoted as being read-only are
treaded as writes, and they collapse the replication before proceeding.
Writes into pagetables are caught by using the same mechanism as dirty page
throttling uses, and also collapse the replication.

After collapsing a replication, all process page tables are unmapped, so that
any processes mapping discarded pages will refault in the correct one.

/proc/vmstat has nr_repl_pages, which is the number of _additional_ pages
replicated, beyond the first.

Status:
- Lee showed that ~10s (1%) user time was cut off a kernel compile benchmark
  on his 4 node 16-way box.

Todo:
- find_get_page locking semantics are slightly changed. This doesn't appear
  to be a problem but I need to have a more thorough look.
- Would like to be able to control replication via userspace, and maybe
  even internally to the kernel.
- Ideally, reclaim might reclaim replicated pages preferentially, however
  I aim to be _minimally_ intrusive, and this conflicts with that.
- More correctness testing.
- Eventually, have to look at playing nicely with migration.
- radix-tree nodes start using up a large amount of memory. Try to improve.
  (eg. different data structure, smaller tree, or don't load master immediately).

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -5,6 +5,8 @@
 #include <linux/threads.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/radix-tree.h>
+#include <linux/nodemask.h>
 
 struct address_space;
 
@@ -80,4 +82,10 @@ struct page {
 #endif /* WANT_PAGE_VIRTUAL */
 };
 
+struct pcache_desc {
+	struct page *master;
+	nodemask_t nodes_present;
+	struct radix_tree_root page_tree;
+};
+
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -593,16 +593,13 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
-	return page;
+	return get_unreplicated_page(mapping, offset, page);
 }
 EXPORT_SYMBOL(find_get_page);
 
@@ -621,26 +618,16 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
-
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping ||
-				     page->index != offset)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
+		lock_page(page);
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -709,15 +696,12 @@ EXPORT_SYMBOL(find_or_create_page);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	get_unreplicated_pages(mapping, pages, ret);
 	return ret;
 }
 
@@ -745,11 +729,9 @@ unsigned find_get_pages_contig(struct ad
 	for (i = 0; i < ret; i++) {
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
-
-		page_cache_get(pages[i]);
-		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+
+	get_unreplicated_pages(mapping, pages, i);
 	return i;
 }
 EXPORT_SYMBOL(find_get_pages_contig);
@@ -768,17 +750,18 @@ EXPORT_SYMBOL(find_get_pages_contig);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
+	/*
+	 * Don't need to check for replicated pages, because dirty
+	 * and writeback pages should never be replicated.
+	 */
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	get_unreplicated_pages(mapping, pages, ret);
 	return ret;
 }
 EXPORT_SYMBOL(find_get_pages_tag);
@@ -892,7 +875,7 @@ void do_generic_mapping_read(struct addr
 
 		cond_resched();
 find_page:
-		page = find_get_page(mapping, index);
+		page = find_get_page_readonly(mapping, index);
 		if (!page) {
 			page_cache_sync_readahead(mapping,
 					&ra, filp,
@@ -1021,7 +1004,8 @@ readpage:
 			unlock_page(page);
 		}
 
-		goto page_ok;
+		page_cache_release(page);
+		goto find_page;
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
@@ -1306,6 +1290,14 @@ static int fastcall page_cache_read(stru
 
 #define MMAP_LOTSAMISS  (100)
 
+static struct page *find_lock_page_write(struct address_space *mapping, pgoff_t index, int write)
+{
+	if (write)
+		return find_lock_page(mapping, index);
+	else
+		return find_lock_page_readonly(mapping, index);
+}
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1342,7 +1334,7 @@ int filemap_fault(struct vm_area_struct 
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_lock_page(mapping, vmf->pgoff);
+	page = find_lock_page_write(mapping, vmf->pgoff, vmf->flags & FAULT_FLAG_WRITE);
 	/*
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
@@ -1350,7 +1342,7 @@ retry_find:
 		if (!page) {
 			page_cache_sync_readahead(mapping, ra, file,
 							   vmf->pgoff, 1);
-			page = find_lock_page(mapping, vmf->pgoff);
+			page = find_lock_page_write(mapping, vmf->pgoff, vmf->flags & FAULT_FLAG_WRITE);
 			if (!page)
 				goto no_cached_page;
 		}
@@ -1389,7 +1381,7 @@ retry_find:
 				start = vmf->pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_lock_page(mapping, vmf->pgoff);
+		page = find_lock_page_write(mapping, vmf->pgoff, vmf->flags & FAULT_FLAG_WRITE);
 		if (!page)
 			goto no_cached_page;
 	}
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,62 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#ifdef CONFIG_REPLICATION
+extern int reclaim_replicated_page(struct address_space *mapping,
+		struct page *page);
+extern struct page *get_unreplicated_page(struct address_space *mapping,
+				unsigned long offset, struct page *page);
+extern void get_unreplicated_pages(struct address_space *mapping,
+				struct page **pages, int nr);
+extern struct page *find_get_page_readonly(struct address_space *mapping,
+						unsigned long offset);
+extern struct page *find_lock_page_readonly(struct address_space *mapping,
+						unsigned long offset);
+int page_write_fault_retry(struct page *page);
+#else
+
+static inline int reclaim_replicated_page(struct address_space *mapping,
+		struct page *page)
+{
+	BUG();
+	return 0;
+}
+
+static inline struct page *get_unreplicated_page(struct address_space *mapping,
+				unsigned long offset, struct page *page)
+{
+	if (page)
+		page_cache_get(page);
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+
+static inline void get_unreplicated_pages(struct address_space *mapping,
+				struct page **pages, int nr)
+{
+	int i;
+	for (i = 0; i < nr; i++)
+		page_cache_get(pages[i]);
+	read_unlock_irq(&mapping->tree_lock);
+}
+
+static inline struct page *find_get_page_readonly(struct address_space *mapping,
+						unsigned long offset)
+{
+	return find_get_page(mapping, offset);
+}
+
+static inline struct page *find_lock_page_readonly(struct address_space *mapping,
+						unsigned long offset)
+{
+	return find_lock_page(mapping, offset);
+}
+
+static inline int page_write_fault_retry(struct page *page)
+{
+	return 0;
+}
+
+#endif
+
 #endif
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -368,6 +368,7 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+again:
 	write_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -409,7 +410,11 @@ int remove_mapping(struct address_space 
 		return 1;
 	}
 
-	__remove_from_page_cache(page);
+	if (PageReplicated(page)) {
+		if (reclaim_replicated_page(mapping, page))
+			goto again;
+	} else
+		__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
 	__put_page(page);
 	return 1;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1047,6 +1047,12 @@ extern void show_mem(void);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
+#ifdef CONFIG_REPLICATION
+extern void replication_init(void);
+#else
+static inline void replication_init(void) {}
+#endif
+
 #ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
 #else
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -615,6 +615,7 @@ asmlinkage void __init start_kernel(void
 	kmem_cache_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
+	replication_init();
 	if (late_time_init)
 		late_time_init();
 	calibrate_delay();
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -63,6 +63,9 @@ enum zone_stat_item {
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
+#ifdef CONFIG_REPLICATION
+	NR_REPL_PAGES,
+#endif
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
 	/* Second 128 byte cacheline */
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -482,6 +482,9 @@ static const char * const vmstat_text[] 
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
+#ifdef CONFIG_REPLICATION
+	"nr_repl_pages",
+#endif
 	"nr_dirty",
 	"nr_writeback",
 	"nr_slab_reclaimable",
@@ -515,6 +518,11 @@ static const char * const vmstat_text[] 
 	"pgfault",
 	"pgmajfault",
 
+#ifdef CONFIG_REPLICATION
+	"pgreplicated",
+	"pgreplicazap",
+#endif
+
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal")
 	TEXTS_FOR_ZONES("pgscan_kswapd")
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -152,6 +152,17 @@ config MIGRATION
 	  example on NUMA systems to put pages nearer to the processors accessing
 	  the page.
 
+#
+# support for NUMA pagecache replication
+#
+config REPLICATION
+	bool "Pagecache replication"
+	def_bool n
+	depends on NUMA
+	help
+	  Enables NUMA pagecache page replication
+
+
 config RESOURCES_64BIT
 	bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL)
 	default 64BIT
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -29,4 +29,4 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-
+obj-$(CONFIG_REPLICATION) += replication.o
Index: linux-2.6/mm/replication.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/replication.c
@@ -0,0 +1,609 @@
+/*
+ *	linux/mm/replication.c
+ *
+ * NUMA pagecache replication
+ *
+ * Copyright (C) 2007  Nick Piggin, SuSE Labs
+ */
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/page-flags.h>
+#include <linux/pagevec.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/spinlock.h>
+
+#include "internal.h"
+
+static struct kmem_cache *pcache_desc_cachep __read_mostly;
+
+void __init replication_init(void)
+{
+	pcache_desc_cachep = kmem_cache_create("pcache_desc",
+			sizeof(struct pcache_desc), 0, SLAB_PANIC, NULL);
+}
+
+static struct pcache_desc *alloc_pcache_desc(void)
+{
+	struct pcache_desc *ret;
+
+	/* NOIO because find_get_page_readonly may be called in the IO path */
+	ret = kmem_cache_alloc(pcache_desc_cachep, GFP_NOIO);
+	if (ret) {
+		memset(ret, 0, sizeof(struct pcache_desc));
+		/* XXX: should use non-atomic preloads */
+		INIT_RADIX_TREE(&ret->page_tree, GFP_ATOMIC);
+	}
+	return ret;
+}
+
+static void free_pcache_desc(struct pcache_desc *pcd)
+{
+	kmem_cache_free(pcache_desc_cachep, pcd);
+}
+
+/*
+ * Free the struct pcache_desc, and all slaves. The pagecache refcount is
+ * retained for the master (because presumably we're collapsing the replication.
+ *
+ * Returns 1 if any of the slaves had a non-zero mapcount (in which case, we'll
+ * have to unmap them), otherwise returns 0.
+ */
+static int release_pcache_desc(struct pcache_desc *pcd)
+{
+	int ret = 0;
+	int i;
+
+	for_each_node_mask(i, pcd->nodes_present) {
+		struct page *page;
+
+		page = radix_tree_delete(&pcd->page_tree, i);
+		BUG_ON(!page);
+		if (page != pcd->master) {
+			BUG_ON(PageDirty(page));
+			BUG_ON(!PageUptodate(page));
+			BUG_ON(!PageReplicated(page));
+			BUG_ON(PagePrivate(page));
+			ClearPageReplicated(page);
+			count_vm_event(PGREPLICAZAP);
+			page->mapping = NULL;
+			dec_zone_page_state(page, NR_REPL_PAGES);
+
+			if (page_mapped(page))
+				ret = 1; /* tell caller to unmap the ptes */
+
+			page_cache_release(page);
+		}
+	}
+	{
+		void *ptr;
+		BUG_ON(radix_tree_gang_lookup(&pcd->page_tree, &ptr, 0, 1) != 0);
+	}
+	free_pcache_desc(pcd);
+
+	return ret;
+}
+
+#define PCACHE_DESC_BIT	2 /* 1 is used internally by the radix-tree */
+
+static inline int __is_pcache_desc(void *ptr)
+{
+	if ((unsigned long)ptr & PCACHE_DESC_BIT)
+		return 1;
+	return 0;
+}
+
+static inline int is_pcache_desc(void *ptr)
+{
+	/* debugging */
+	if ((unsigned long)ptr & PCACHE_DESC_BIT) {
+		struct pcache_desc *pcd;
+		pcd = (struct pcache_desc *)((unsigned long)ptr & ~PCACHE_DESC_BIT);
+		BUG_ON(!PageReplicated(pcd->master));
+	} else {
+		struct page *page = ptr;
+		BUG_ON(PageReplicated(page));
+	}
+	return __is_pcache_desc(ptr);
+}
+
+static inline struct pcache_desc *ptr_to_pcache_desc(void *ptr)
+{
+	BUG_ON(!__is_pcache_desc(ptr));
+	return (struct pcache_desc *)((unsigned long)ptr & ~PCACHE_DESC_BIT);
+}
+
+static inline void *pcache_desc_to_ptr(struct pcache_desc *pcd)
+{
+	BUG_ON(__is_pcache_desc(pcd));
+	return (void *)((unsigned long)pcd | PCACHE_DESC_BIT);
+}
+
+/*
+ * Must be called with the page locked and tree_lock held to give a non-racy
+ * answer.
+ */
+static int should_replicate_pcache(struct page *page, struct address_space *mapping, unsigned long offset, int nid)
+{
+	umode_t mode;
+
+	if (unlikely(PageSwapCache(page)))
+		return 0;
+
+	if (nid == page_to_nid(page))
+		return 0;
+
+	if (page_count(page) != 2 + page_mapcount(page))
+		return 0;
+	smp_rmb();
+	if (!PageUptodate(page) || PageDirty(page) || PageWriteback(page))
+		return 0;
+
+	if (!PagePrivate(page))
+		return 1;
+
+	mode = mapping->host->i_mode;
+	if (S_ISREG(mode) || S_ISBLK(mode))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Try to convert pagecache coordinate (mapping, offset) (with page residing)
+ * into a replicated pagecache.
+ *
+ * Returns 1 if we leave with a successfully converted pagecache. Otherwise 0.
+ * (note, that return value is racy, so it is a hint only)
+ */
+static int try_to_replicate_pcache(struct page *page, struct address_space *mapping, unsigned long offset)
+{
+	int page_node;
+	void **pslot;
+	struct pcache_desc *pcd;
+	int ret = 0;
+
+	lock_page(page);
+	if (unlikely(!page->mapping))
+		goto out;
+
+	/* Already been replicated? Return yes! */
+	if (PageReplicated(page)) {
+		ret = 1;
+		goto out;
+	}
+
+	pcd = alloc_pcache_desc();
+	if (!pcd)
+		goto out;
+
+	page_node = page_to_nid(page);
+	if (radix_tree_insert(&pcd->page_tree, page_node, page))
+		goto out_pcd;
+	pcd->master = page;
+	node_set(page_node, pcd->nodes_present);
+
+	write_lock_irq(&mapping->tree_lock);
+
+	/* The non-racy check */
+	if (unlikely(!should_replicate_pcache(page, mapping, offset,
+							numa_node_id())))
+		goto out_lock;
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+
+	/*
+	 * The page is being held in pagecache and kept unreplicated because
+	 * it is locked. The following bugchecks.
+	 */
+	BUG_ON(!pslot);
+	BUG_ON(PageReplicated(page));
+	BUG_ON(page != radix_tree_deref_slot(pslot));
+	BUG_ON(is_pcache_desc(radix_tree_deref_slot(pslot)));
+	SetPageReplicated(page);
+	radix_tree_replace_slot(pslot, pcache_desc_to_ptr(pcd));
+	ret = 1;
+
+out_lock:
+	write_unlock_irq(&mapping->tree_lock);
+out_pcd:
+	if (ret == 0)
+		free_pcache_desc(pcd);
+out:
+	unlock_page(page);
+	return ret;
+}
+
+/*
+ * Called with tree_lock held for write, and (mapping, offset) guaranteed to be
+ * replicated. Drops tree_lock.
+ */
+static void __unreplicate_pcache(struct address_space *mapping, unsigned long offset, void **pslot)
+{
+	struct pcache_desc *pcd;
+	struct page *page;
+
+	pcd = ptr_to_pcache_desc(radix_tree_deref_slot(pslot));
+
+	page = pcd->master;
+	BUG_ON(PageDirty(page));
+	BUG_ON(!PageUptodate(page));
+	BUG_ON(!PageReplicated(page));
+	ClearPageReplicated(page);
+
+	radix_tree_replace_slot(pslot, page);
+
+	write_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * XXX: this actually changes all the find_get_pages APIs, so
+	 * we might want to just coax unmap_mapping_range into not
+	 * sleeping instead.
+	 */
+	might_sleep();
+
+	if (release_pcache_desc(pcd)) {
+		/* release_pcache_desc saw some mapped slaves */
+		unmap_mapping_range(mapping, (loff_t)offset<<PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, 0);
+	}
+}
+
+/*
+ * Collapse pagecache coordinate (mapping, offset) into a non-replicated
+ * state. Must not fail.
+ */
+static void unreplicate_pcache(struct address_space *mapping, unsigned long offset, int locked)
+{
+	void **pslot;
+
+	if (!locked)
+		write_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+
+	/* Gone? Success */
+	if (unlikely(!pslot)) {
+		write_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+
+	/* Already been un-replicated? Success */
+	if (unlikely(!is_pcache_desc(radix_tree_deref_slot(pslot)))) {
+		write_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+
+	__unreplicate_pcache(mapping, offset, pslot);
+}
+
+/*
+ * Insert a newly replicated page into (mapping, offset) at node nid.
+ * Called without tree_lock. May not be successful.
+ *
+ * Returns 1 on success, otherwise 0.
+ */
+static int insert_replicated_page(struct page *page, struct address_space *mapping, unsigned long offset, int nid)
+{
+	void **pslot;
+	struct pcache_desc *pcd;
+
+	BUG_ON(PageReplicated(page));
+	BUG_ON(!PageUptodate(page));
+
+	write_lock_irq(&mapping->tree_lock);
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+
+	/* Truncated? */
+	if (unlikely(!pslot))
+		goto failed;
+
+	/* Not replicated? */
+	if (unlikely(!is_pcache_desc(radix_tree_deref_slot(pslot))))
+		goto failed;
+
+	pcd = ptr_to_pcache_desc(radix_tree_deref_slot(pslot));
+
+	if (unlikely(node_isset(nid, pcd->nodes_present)))
+		goto failed;
+
+	if (radix_tree_insert(&pcd->page_tree, nid, page))
+		goto failed;
+	node_set(nid, pcd->nodes_present);
+	count_vm_event(PGREPLICATED);
+	SetPageReplicated(page); /* XXX: could rework to use non-atomic */
+
+	page->mapping = mapping;
+	page->index = offset;
+
+	page_cache_get(page); /* pagecache ref */
+	__inc_zone_page_state(page, NR_REPL_PAGES);
+	write_unlock_irq(&mapping->tree_lock);
+
+	lru_cache_add(page);
+
+	return 1;
+
+failed:
+	write_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
+ * Removes a replicated (not master) page. Called with tree_lock held for write
+ */
+static void __remove_replicated_page(struct pcache_desc *pcd, struct page *page,
+			struct address_space *mapping, unsigned long offset)
+{
+	int nid = page_to_nid(page);
+	BUG_ON(page == pcd->master);
+	BUG_ON(!node_isset(nid, pcd->nodes_present));
+	BUG_ON(radix_tree_delete(&pcd->page_tree, nid) != page);
+	node_clear(nid, pcd->nodes_present);
+	BUG_ON(!PageReplicated(page));
+	ClearPageReplicated(page);
+	count_vm_event(PGREPLICAZAP);
+	page->mapping = NULL;
+	__dec_zone_page_state(page, NR_REPL_PAGES);
+}
+
+/*
+ * Reclaim a replicated page. Called with tree_lock held for write and the
+ * page locked.
+ * Drops tree_lock and returns 1 and the caller should retry. Otherwise
+ * retains the tree_lock and returns 0 if successful.
+ */
+int reclaim_replicated_page(struct address_space *mapping, struct page *page)
+{
+	void **pslot;
+	struct pcache_desc *pcd;
+	unsigned long offset = page->index;
+
+	BUG_ON(PagePrivate(page));
+	BUG_ON(!PageReplicated(page));
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	pcd = ptr_to_pcache_desc(radix_tree_deref_slot(pslot));
+	if (page == pcd->master) {
+		if (nodes_weight(pcd->nodes_present) == 1) {
+			__unreplicate_pcache(mapping, offset, pslot);
+			return 1;
+		} else {
+			/* promote one of the slaves to master */
+			struct page *new_master;
+			int nid, new_nid;
+
+			nid = page_to_nid(page);
+			new_nid = next_node(nid, pcd->nodes_present);
+			if (new_nid == MAX_NUMNODES)
+				new_nid = first_node(pcd->nodes_present);
+			BUG_ON(new_nid == nid);
+			new_master = radix_tree_lookup(&pcd->page_tree, new_nid);
+			BUG_ON(!new_master);
+			BUG_ON(new_master == page);
+
+			if (PageError(page))
+				SetPageError(new_master);
+			if (PageChecked(page))
+				SetPageChecked(new_master);
+			if (PageMappedToDisk(page))
+				SetPageMappedToDisk(new_master);
+
+			pcd->master = new_master;
+			/* now fall through and remove the old master */
+		}
+	}
+	__remove_replicated_page(pcd, page, mapping, offset);
+	return 0;
+}
+
+/*
+ * Try to create a replica of page at the given nid.
+ * Called without any locks held. page has its refcount elevated.
+ * Returns the newly replicated page with an elevated refcount on
+ * success, or NULL on failure.
+ */
+static struct page *try_to_create_replica(struct address_space *mapping,
+			unsigned long offset, struct page *page, int nid)
+{
+	struct page *repl_page;
+
+	repl_page = alloc_pages_node(nid, mapping_gfp_mask(mapping) |
+					  __GFP_THISNODE | __GFP_NORETRY, 0);
+	if (!repl_page)
+		return NULL;
+
+	copy_highpage(repl_page, page);
+	flush_dcache_page(repl_page);
+	SetPageUptodate(repl_page); /* XXX: can use nonatomic */
+
+	if (!insert_replicated_page(repl_page, mapping, offset, nid)) {
+		page_cache_release(repl_page);
+		return NULL;
+	}
+
+	return repl_page;
+}
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
+ */
+struct page *find_get_page_readonly(struct address_space *mapping,
+						unsigned long offset)
+{
+	int nid;
+	struct page *page;
+
+retry:
+	read_lock_irq(&mapping->tree_lock);
+	nid = numa_node_id();
+	page = radix_tree_lookup(&mapping->page_tree, offset);
+	if (!page)
+		goto out;
+
+	if (is_pcache_desc(page)) {
+		struct pcache_desc *pcd;
+		pcd = ptr_to_pcache_desc(page);
+		if (!node_isset(nid, pcd->nodes_present)) {
+			int src_nid;
+			struct page *new_page;
+
+			src_nid = next_node(nid, pcd->nodes_present);
+			if (src_nid == MAX_NUMNODES)
+				src_nid = first_node(pcd->nodes_present);
+			page = radix_tree_lookup(&pcd->page_tree, src_nid);
+			BUG_ON(!page);
+			page_cache_get(page);
+			read_unlock_irq(&mapping->tree_lock);
+
+			new_page = try_to_create_replica(mapping, offset, page, nid);
+			if (new_page) {
+				page_cache_release(page);
+				page = new_page;
+			}
+		} else {
+			page = radix_tree_lookup(&pcd->page_tree, nid);
+			page_cache_get(page);
+			read_unlock_irq(&mapping->tree_lock);
+		}
+		BUG_ON(!page);
+		return page;
+
+	}
+
+	page_cache_get(page);
+
+	if (should_replicate_pcache(page, mapping, offset, nid)) {
+		read_unlock_irq(&mapping->tree_lock);
+		if (try_to_replicate_pcache(page, mapping, offset)) {
+			page_cache_release(page);
+			goto retry;
+		}
+		return page;
+	}
+
+out:
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+
+struct page *find_lock_page_readonly(struct address_space *mapping,
+						unsigned long offset)
+{
+	struct page *page;
+
+again:
+	page = find_get_page_readonly(mapping, offset);
+	if (page) {
+		lock_page(page);
+		if (page->mapping)
+			return page;
+		unlock_page(page);
+		goto again;
+	}
+	return NULL;
+}
+
+/*
+ * Takes a page at the given (mapping, offset), and returns an unreplicated
+ * page with elevated refcount.
+ *
+ * Called with tree_lock held for read, drops tree_lock.
+ */
+struct page *get_unreplicated_page(struct address_space *mapping,
+				unsigned long offset, struct page *page)
+{
+	if (page) {
+		if (is_pcache_desc(page)) {
+			struct pcache_desc *pcd;
+
+			pcd = ptr_to_pcache_desc(page);
+			page = pcd->master;
+			page_cache_get(page);
+			read_unlock_irq(&mapping->tree_lock);
+
+			unreplicate_pcache(mapping, offset, 0);
+
+			return page;
+		}
+
+		page_cache_get(page);
+	}
+	read_unlock_irq(&mapping->tree_lock);
+	might_sleep();
+
+	return page;
+}
+
+void get_unreplicated_pages(struct address_space *mapping, struct page **pages,
+					int nr)
+{
+	unsigned long offsets[PAGEVEC_SIZE];
+	int i, replicas;
+
+	/*
+	 * XXX: really need to prevent this at the find_get_pages API
+	 */
+	BUG_ON(nr > PAGEVEC_SIZE);
+
+	replicas = 0;
+	for (i = 0; i < nr; i++) {
+		struct page *page = pages[i];
+
+		if (is_pcache_desc(page)) {
+			struct pcache_desc *pcd;
+			pcd = ptr_to_pcache_desc(page);
+			page = pcd->master;
+			offsets[replicas++] = page->index;
+			pages[i] = page;
+		}
+
+		page_cache_get(page);
+	}
+	read_unlock_irq(&mapping->tree_lock);
+	might_sleep();
+
+	for (i = 0; i < replicas; i++)
+		unreplicate_pcache(mapping, offsets[i], 0);
+}
+
+/*
+ * Collapse a possible page replication. The page is held unreplicated by
+ * the elevated refcount on the passed-in page.
+ */
+int page_write_fault_retry(struct page *page)
+{
+	struct address_space *mapping;
+	pgoff_t offset;
+
+	if (!PageReplicated(page)) {
+		/* The elevated page refcount will hold off replication */
+		return 0;
+	}
+
+	/* Truncate would remove pte and get noticed by caller anyway... */
+	mapping = page->mapping;
+	if (!mapping)
+		return 1;
+
+	write_lock_irq(&mapping->tree_lock);
+	if (page->mapping != mapping) {
+		write_unlock_irq(&mapping->tree_lock);
+		return 1;
+	}
+
+	offset = page->index;
+	unreplicate_pcache(mapping, offset, 1);
+
+	return 1;
+}
+
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -60,6 +60,8 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 
+#include "internal.h"
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -1661,7 +1663,10 @@ static int do_wp_page(struct mm_struct *
 		 * read-only shared pages can get COWed by
 		 * get_user_pages(.write=1, .force=1).
 		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+#ifndef CONFIG_REPLICATION
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+#endif
+		{
 			/*
 			 * Notify the address space that the page is about to
 			 * become writable so that it can prohibit this or wait
@@ -1673,6 +1678,18 @@ static int do_wp_page(struct mm_struct *
 			page_cache_get(old_page);
 			pte_unmap_unlock(page_table, ptl);
 
+			/*
+			 * XXX: this could just be run under ptl and unmap
+			 * just the single pte and let the replication collapse
+			 * get done by the next page fault.
+			 */
+			if (page_write_fault_retry(old_page)) {
+				page_cache_release(old_page);
+				return 0;
+			}
+#ifdef CONFIG_REPLICATION
+			if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+#endif
 			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
 				goto unwritable_page;
 
@@ -1688,8 +1705,14 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
+
 		dirty_page = old_page;
-		get_page(dirty_page);
+		/*
+	 	 * This extra ref also holds off replication after the mapcount
+		 * is elevated, until after the page is set dirty and the ref
+		 * dropped. Similarly for __do_fault.
+		 */
+		page_cache_get(dirty_page);
 		reuse = 1;
 	}
 
@@ -1775,7 +1798,7 @@ unlock:
 		 */
 		wait_on_page_locked(dirty_page);
 		set_page_dirty_balance(dirty_page);
-		put_page(dirty_page);
+		page_cache_release(dirty_page);
 	}
 	return ret;
 oom:
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -921,7 +921,12 @@ int clear_page_dirty_for_io(struct page 
 	BUG_ON(!PageLocked(page));
 
 	ClearPageReclaim(page);
+
+#ifndef CONFIG_REPLICATION
 	if (mapping && mapping_cap_account_dirty(mapping)) {
+#else
+	if (mapping) {
+#endif
 		/*
 		 * Yes, Virginia, this is indeed insane.
 		 *
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -90,6 +90,8 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_replicated		20	/* Page is replicated pagecache */
+
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
@@ -144,8 +146,8 @@ static inline void SetPageUptodate(struc
 #define ClearPageUptodate(page)	clear_bit(PG_uptodate, &(page)->flags)
 
 #define PageDirty(page)		test_bit(PG_dirty, &(page)->flags)
-#define SetPageDirty(page)	set_bit(PG_dirty, &(page)->flags)
-#define TestSetPageDirty(page)	test_and_set_bit(PG_dirty, &(page)->flags)
+#define SetPageDirty(page)	do { BUG_ON(PageReplicated(page)); set_bit(PG_dirty, &(page)->flags); } while (0)
+#define TestSetPageDirty(page)	({ BUG_ON(PageReplicated(page)); test_and_set_bit(PG_dirty, &(page)->flags); })
 #define ClearPageDirty(page)	clear_bit(PG_dirty, &(page)->flags)
 #define __ClearPageDirty(page)	__clear_bit(PG_dirty, &(page)->flags)
 #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
@@ -194,15 +196,23 @@ static inline void SetPageUptodate(struc
  * risky: they bypass page accounting.
  */
 #define PageWriteback(page)	test_bit(PG_writeback, &(page)->flags)
-#define TestSetPageWriteback(page) test_and_set_bit(PG_writeback,	\
-							&(page)->flags)
-#define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,	\
-							&(page)->flags)
+#define TestSetPageWriteback(page) ({ BUG_ON(PageReplicated(page)); test_and_set_bit(PG_writeback, &(page)->flags); })
+#define TestClearPageWriteback(page) \
+		test_and_clear_bit(PG_writeback, &(page)->flags)
 
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
 
+#ifdef CONFIG_REPLICATION
+#define PageReplicated(page)	test_bit(PG_replicated, &(page)->flags)
+#define __SetPageReplicated(page) do { BUG_ON(PageDirty(page) || PageWriteback(page)); __set_bit(PG_replicated, &(page)->flags); } while (0)
+#define SetPageReplicated(page)	do { BUG_ON(PageDirty(page) || PageWriteback(page)); set_bit(PG_replicated, &(page)->flags); } while (0)
+#define ClearPageReplicated(page) clear_bit(PG_replicated, &(page)->flags)
+#else
+#define PageReplicated(page)	0
+#endif
+
 #define PageMappedToDisk(page)	test_bit(PG_mappedtodisk, &(page)->flags)
 #define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, &(page)->flags)
 #define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, &(page)->flags)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -215,7 +215,8 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_buddy );
+			1 << PG_buddy |
+			1 << PG_replicated );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -451,7 +452,8 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
-			1 << PG_buddy ))))
+			1 << PG_buddy |
+			1 << PG_replicated))))
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -600,7 +602,8 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
-			1 << PG_buddy ))))
+			1 << PG_buddy |
+			1 << PG_replicated ))))
 		bad_page(page);
 
 	/*
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -31,6 +31,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGREPLICATED, PGREPLICAZAP,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL),
 		FOR_ALL_ZONES(PGSCAN_KSWAPD),
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -96,6 +96,17 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
+static inline int probe_page(struct address_space *mapping, pgoff_t pgoff)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = !!radix_tree_lookup(&mapping->page_tree, pgoff);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1221,17 +1221,13 @@ repeat:
 			goto repeat;
 		}
 	} else if (sgp == SGP_READ && !filepage) {
+		int page;
+
 		shmem_swp_unmap(entry);
-		filepage = find_get_page(mapping, idx);
-		if (filepage &&
-		    (!PageUptodate(filepage) || TestSetPageLocked(filepage))) {
-			spin_unlock(&info->lock);
-			wait_on_page_locked(filepage);
-			page_cache_release(filepage);
-			filepage = NULL;
-			goto repeat;
-		}
+		page = probe_page(mapping, idx);
 		spin_unlock(&info->lock);
+		if (page)
+			goto repeat;
 	} else {
 		shmem_swp_unmap(entry);
 		sbinfo = SHMEM_SB(inode->i_sb);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-07-27  8:42 [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache Nick Piggin
@ 2007-07-27 14:30 ` Lee Schermerhorn
  2007-07-30  3:16   ` Nick Piggin
  2007-08-08 20:25 ` Lee Schermerhorn
  1 sibling, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 14:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Eric Whitney

On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> Hi,
> 
> Just got a bit of time to take another look at the replicated pagecache
> patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> gives me more confidence in the locking now; the new ->fault API makes
> MAP_SHARED write faults much more efficient; and a few bugs were found
> and fixed.
> 
> More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> tests...
> 
> --
> 
> Page-based NUMA pagecache replication.
<snip really big patch!>

Hi, Nick.

Glad to see you're back on this.  It's been on my list, but delayed by
other patch streams...

As I mentioned to you in prior mail, I want to try to integrate this
atop my "auto/lazy migration" patches, such that when a task moves to a
new node, we remove just that task's pte ref's to page cache pages
[along with all refs to anon pages, as I do now] so that the task will
take a fault on next touch and either use an existing local copy or
replicate the page at that time.  Unfortunately, that's in the queue
behind the memoryless node patches and my stalled shared policy patches,
among other things :-(.

Also, what kernel is this patch against?  Diffs just say "linux-2.6".
Somewhat ambiguous...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-07-27 14:30 ` Lee Schermerhorn
@ 2007-07-30  3:16   ` Nick Piggin
  2007-07-30 16:29     ` Lee Schermerhorn
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2007-07-30  3:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Eric Whitney

On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
> On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > Hi,
> > 
> > Just got a bit of time to take another look at the replicated pagecache
> > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > gives me more confidence in the locking now; the new ->fault API makes
> > MAP_SHARED write faults much more efficient; and a few bugs were found
> > and fixed.
> > 
> > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > tests...
> > 
> > --
> > 
> > Page-based NUMA pagecache replication.
> <snip really big patch!>
> 
> Hi, Nick.
> 
> Glad to see you're back on this.  It's been on my list, but delayed by
> other patch streams...

Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

 
> As I mentioned to you in prior mail, I want to try to integrate this
> atop my "auto/lazy migration" patches, such that when a task moves to a
> new node, we remove just that task's pte ref's to page cache pages
> [along with all refs to anon pages, as I do now] so that the task will
> take a fault on next touch and either use an existing local copy or
> replicate the page at that time.  Unfortunately, that's in the queue
> behind the memoryless node patches and my stalled shared policy patches,
> among other things :-(.

That's OK. It will likely be a long process to get any of this in...
As you know, replicated currently needs some of your automigration
infrastructure in order to get ptes pointing to the right places
after a task migration. I'd like to try some experiments with them on
a larger system, once you get time to update your patchset...

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-07-30  3:16   ` Nick Piggin
@ 2007-07-30 16:29     ` Lee Schermerhorn
  0 siblings, 0 replies; 18+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 16:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Eric Whitney

On Mon, 2007-07-30 at 05:16 +0200, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
> > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > Hi,
> > > 
> > > Just got a bit of time to take another look at the replicated pagecache
> > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > gives me more confidence in the locking now; the new ->fault API makes
> > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > and fixed.
> > > 
> > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > tests...
> > > 
> > > --
> > > 
> > > Page-based NUMA pagecache replication.
> > <snip really big patch!>
> > 
> > Hi, Nick.
> > 
> > Glad to see you're back on this.  It's been on my list, but delayed by
> > other patch streams...
> 
> Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

D'Oh!  :-(  You could have just said "Read the subject line, Lee!"
> 
>  
> > As I mentioned to you in prior mail, I want to try to integrate this
> > atop my "auto/lazy migration" patches, such that when a task moves to a
> > new node, we remove just that task's pte ref's to page cache pages
> > [along with all refs to anon pages, as I do now] so that the task will
> > take a fault on next touch and either use an existing local copy or
> > replicate the page at that time.  Unfortunately, that's in the queue
> > behind the memoryless node patches and my stalled shared policy patches,
> > among other things :-(.
> 
> That's OK. It will likely be a long process to get any of this in...
> As you know, replicated currently needs some of your automigration
> infrastructure in order to get ptes pointing to the right places
> after a task migration. I'd like to try some experiments with them on
> a larger system, once you get time to update your patchset...

I'll try to make a pass this week, maybe next...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-07-27  8:42 [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache Nick Piggin
  2007-07-27 14:30 ` Lee Schermerhorn
@ 2007-08-08 20:25 ` Lee Schermerhorn
  2007-08-10 21:08   ` Lee Schermerhorn
  1 sibling, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 20:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Christoph Lameter, Mel Gorman

On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> Hi,
> 
> Just got a bit of time to take another look at the replicated pagecache
> patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> gives me more confidence in the locking now; the new ->fault API makes
> MAP_SHARED write faults much more efficient; and a few bugs were found
> and fixed.
> 
> More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> tests...
> 

Sending this out to give Nick an update and to give the list a
heads up on what I've found so far with the replication patch.

I have rebased Nick's recent pagecache replication patch against
2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
patch sets.  These include:

+ shared policy
+ migrate-on-fault a.k.a. lazy migration
+ auto-migration - trigger lazy migration on inter-node task
                   task migration
+ migration cache - pseudo-swap cache for parking unmapped
                    anon pages awaiting migrate-on-fault

I added a couple of patches to fix up the interaction of replication
with migration [discussed more below] and a per cpuset control to
enable/disable replication.  The latter allowed me to boot successfully
and to survive any bugs encountered by restricting the effects to 
tasks in the test cpuset with replication enabled.  That was the
theory, anyway :-).  Mostly worked...

Rather than spam the list, I've placed the entire quilt series that
I'm testing, less the 23-rc1 and 23-rc1-mm2 patches, at:

	http://free.linux.hp.com/~lts/Patches/Replication/

It's the 070808 tarball.

I plan to measure the effects on performance with various combinations
of these features enabled.  First, however, I ran into one problem that
required me to investigate further.  In the migrate-on-fault set, I've
introduced a function named "migrate_page_unmap_only()".  It parallels
Christoph's "migrate_pages()" but for lazy migration, it just removes
the pte mappings from the selected pages so that they will incur a fault
on next touch and be migrated to the node specified by policy, if
necessary and "easy" to do.  [don't want to try too hard, as this is 
just a performance optimization.  supposed to be, anyway.]

In migrate_page_unmap_only(), I had a BUG_ON to catch [non-anon] pages
with a NULL page_mapping().  I never hit this in my testing until I
added in the page replication.  To investigate, I took the opportunity
to update my mmtrace instrumentation.   I added a few trace points for
Nick's replication functions and replaced the BUG_ON with a trace
point and skipped pages w/ a NULL mapping.  The kernel patches are in
the patch tarball at the link above.  The user space tools are available
at:

	http://free.linux.hp.com/~lts/Tools/mmtrace-latest.tar.gz

A rather large tarball containing formatted traces from a usex run
that hit the NULL mapping trace point is also available from the
replication patches directory linked above.  I've extracted traces
related to the "bug check" and annotated them--also in the tarball.
See the README.

So what's happening?

I think I'm hitting a race between the page replication code when it
"unreplicates" a page and a task that references one of the replicas
attempting to unmap that replica for lazy migration.  When "unreplicating"
a page, the replication patch nulls out all of the mappings for the 
"slave pages", without locking the pages or otherwise coordinating with
other possible accesses to the page, and then calls unmap_mapping_range()
to unmap them.  Meanwhile, these pages are still referenced by various tasks'
page tables.  

One interesting thing I see in the traces is that, in the couple of
instances I looked at, the attempt to unmap [migrate_pages_unmap_only()]
came approximately a second after the __unreplicate_pcache() call that
apparently nulled out the mapping.  I.e., the slave page remained
referenced by the task's page table for almost a second after unreplication.
Nick does have a comment about unmap_mapping_range() sleeping, but a
second seems like a long time.

I don't know whether this is a real problem or not.  I removed the 
BUG_ON and now just skip pages with NULL mapping.  They're being removed
anyway.  I'm running a stress test now, and haven't seen any obvious
problems yet.  I do have concerns, tho'.  Page migration assumes that
if it can successfully isolate a page from the LRU and lock it, that it
has pretty much exclusive access.

Direct migration [Christoph's implementation] is a bit stricter regarding
reference and map counts, unless "MOVE_ALL" is specified.  In my lazy
migration patches, I want to be able to unmap pages with multiple pte
references [currently a per cpuset tunable threshold] to test the
performance impact of trying harder to unmap vs being able to migrate
fewer pages.  

I'm also seeing a lot of "thrashing"--pages being repeatedly replicated
and unreplicated on every other fault to the page.  I haven't investigated
how long the intervals are between the faults, so maybe the faulting
tasks are getting a good deal of usage of the page between faults.

Other Considerations

I figured that direct migration should not try to migrate a replicated
page [to a new node] because this would mess up Nick's tracking of
slave pages in the pcache_descriptor.  Don't know what the effects
would be, but I added a test to skip replicated pages in migrate_pages().

I didn't want to filter these pages in migrate_page_add() because I
want to be able to unmap at least the current task's pte references
for lazy migration, so that the task will fault on next touch and
use/create a local replica.  [Patch "under consideration".]  
However, I think that migrate_page_add() is too early, because the
page could become replicated after we check.  In fact, where I've
placed the check in migrate_pages() is too early.  Needs to be 
moved into unmap_and_move() after the page lock is obtained.  Replication
DOES lock the page to replicate it.  We'll need to add some checks
after "try_to_replicate_pcache()" obtains the page lock to ensure
that it hasn't been migrated away.  Or, maybe the checks in 
should_replicate_pcache() already handle this?

One also might want to migrate a page to evacuate memory--either for
hotplug or to consolidate contiguous memory to make more higher order
pages available.  In these cases, we might want to handle replicated
pages by just removing the local replica and using a remote copy.

More as the story unfolds.  Film at 11...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-08-08 20:25 ` Lee Schermerhorn
@ 2007-08-10 21:08   ` Lee Schermerhorn
  2007-08-13  7:43     ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-08-10 21:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney

On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > Hi,
> > 
> > Just got a bit of time to take another look at the replicated pagecache
> > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > gives me more confidence in the locking now; the new ->fault API makes
> > MAP_SHARED write faults much more efficient; and a few bugs were found
> > and fixed.
> > 
> > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > tests...
> > 
> 
> Sending this out to give Nick an update and to give the list a
> heads up on what I've found so far with the replication patch.
> 
> I have rebased Nick's recent pagecache replication patch against
> 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
> patch sets.  These include:
> 
> + shared policy
> + migrate-on-fault a.k.a. lazy migration
> + auto-migration - trigger lazy migration on inter-node task
>                    task migration
> + migration cache - pseudo-swap cache for parking unmapped
>                     anon pages awaiting migrate-on-fault
> 
> I added a couple of patches to fix up the interaction of replication
> with migration [discussed more below] and a per cpuset control to
> enable/disable replication.  The latter allowed me to boot successfully
> and to survive any bugs encountered by restricting the effects to 
> tasks in the test cpuset with replication enabled.  That was the
> theory, anyway :-).  Mostly worked...

After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
When I came in the next morning, the console window was full of soft lockups
on various cpus with varions stack traces.  /var/log/messages showed 142, in
all.

I've placed the soft lockup reports from /var/log/messages in the Replication
directory on free.linux:

	http://free.linux.hp.com/~lts/Patches/Replication.

The lockups appeared in several places in the traces I looked at.  Here's a
couple of examples:

+ unlink_file_vma() from free_pgtables() during task exit:
	mapping->i_mmap_lock ???

+ smp_call_function() from ia64_global_tlb_purge().
	  Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
  Traces show us getting to here in one of 2 ways:

  1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]

  2) from zap_page_range() when __unreplicate_pcache() calls unmap_mapping_range.

+ get_page_from_freelist -> zone_lru_lock?

An interesting point:  all of the soft lockup messages said that the cpu was
locked for 11s.  Ring any bells?

I should note that I was trying to unmap all mappings to the file backed pages
on internode task migration, instead of just the current task's pte's.  However,
I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
was looping trying to unmap pages with mapcounts of several 10s--such as I see
on some page cache pages in my traces.

Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
again and will let it run over the weekend--or as long as it stays up, which 
ever is shorter :-).

I put a tarball with the rebased series in the Replication directory linked
above, in case you're interested.  I haven't added the patch description for
the new patch yet, but it's pretty simple.  Maybe even correct.

----

Unrelated to the lockups  [I think]:

I forgot to look before I rebooted, but earlier the previous evening, I checked
the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
replications and ~4.8 million "zaps" [collapse of replicated page].  That's around
98% zaps.  Do we need some filter in the fault path to reduce the "thrashing"--if
that's what I'm seeing.  

A while back I took a look at the Virtual Iron page replication patch.  They had
set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set another
flag for shared executables, if we can detect them, and any shared mapping that has
no writable mappings ?

I'll try to remember to check the replication statistics after the currently
running test.  If the system stays up, that is.  A quick look < 10 minutes into
the test shows that zaps are now ~84% of replications.  Also, ~47k replicated pages
out of ~287K file pages.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-08-10 21:08   ` Lee Schermerhorn
@ 2007-08-13  7:43     ` Nick Piggin
  2007-08-13 14:05       ` Lee Schermerhorn
  2007-09-11 20:52       ` Update: [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1 Lee Schermerhorn
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2007-08-13  7:43 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney

On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > Hi,
> > > 
> > > Just got a bit of time to take another look at the replicated pagecache
> > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > gives me more confidence in the locking now; the new ->fault API makes
> > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > and fixed.
> > > 
> > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > tests...
> > > 
> > 
> > Sending this out to give Nick an update and to give the list a
> > heads up on what I've found so far with the replication patch.
> > 
> > I have rebased Nick's recent pagecache replication patch against
> > 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
> > patch sets.  These include:
> > 
> > + shared policy
> > + migrate-on-fault a.k.a. lazy migration
> > + auto-migration - trigger lazy migration on inter-node task
> >                    task migration
> > + migration cache - pseudo-swap cache for parking unmapped
> >                     anon pages awaiting migrate-on-fault
> > 
> > I added a couple of patches to fix up the interaction of replication
> > with migration [discussed more below] and a per cpuset control to
> > enable/disable replication.  The latter allowed me to boot successfully
> > and to survive any bugs encountered by restricting the effects to 
> > tasks in the test cpuset with replication enabled.  That was the
> > theory, anyway :-).  Mostly worked...
> 
> After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
> When I came in the next morning, the console window was full of soft lockups
> on various cpus with varions stack traces.  /var/log/messages showed 142, in
> all.
> 
> I've placed the soft lockup reports from /var/log/messages in the Replication
> directory on free.linux:
> 
> 	http://free.linux.hp.com/~lts/Patches/Replication.
> 
> The lockups appeared in several places in the traces I looked at.  Here's a
> couple of examples:
> 
> + unlink_file_vma() from free_pgtables() during task exit:
> 	mapping->i_mmap_lock ???
> 
> + smp_call_function() from ia64_global_tlb_purge().
> 	  Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
>   Traces show us getting to here in one of 2 ways:
> 
>   1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]
> 
>   2) from zap_page_range() when __unreplicate_pcache() calls unmap_mapping_range.
> 
> + get_page_from_freelist -> zone_lru_lock?
> 
> An interesting point:  all of the soft lockup messages said that the cpu was
> locked for 11s.  Ring any bells?

Hi Lee,

Am sick with the flu for the past few days, so I haven't done much more
work here, but I'll just add some (not very useful) comments....

The get_page_from_freelist hang is quite strange. It would be zone->lock,
which shouldn't have too much contention...

Replication may be putting more stress on some locks. It will cause more
tlb flushing that can not be batched well, which could cause the call_lock
to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
inherit the latency from call_lock. (If this is the case, we could
potentially extend the tlb flushing API slightly to cope better with
unmapping of pages from multiple mm's, but that comes way down the track
when/if replication proves itself!).

tlb flushing AFAIKS should not do the IPI unless it is deadling with a
multithreaded mm... does usex use threads?


> I should note that I was trying to unmap all mappings to the file backed pages
> on internode task migration, instead of just the current task's pte's.  However,
> I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
> was looping trying to unmap pages with mapcounts of several 10s--such as I see
> on some page cache pages in my traces.

Replication teardown would still have to unmap all... but that shouldn't
particularly be any worse than, say, page reclaim (except I guess that it
could occur more often).

 
> Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
> task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
> again and will let it run over the weekend--or as long as it stays up, which 
> ever is shorter :-).

Ah, so it does eventually die? Any hints of why?

> 
> I put a tarball with the rebased series in the Replication directory linked
> above, in case you're interested.  I haven't added the patch description for
> the new patch yet, but it's pretty simple.  Maybe even correct.
> 
> ----
> 
> Unrelated to the lockups  [I think]:
> 
> I forgot to look before I rebooted, but earlier the previous evening, I checked
> the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
> replications and ~4.8 million "zaps" [collapse of replicated page].  That's around
> 98% zaps.  Do we need some filter in the fault path to reduce the "thrashing"--if
> that's what I'm seeing.  

Yep. The current replication patch is very much only infrastructure at
this stage (and is good for stress testing). I feel sure that heuristics
and perhaps tunables would be needed to make the most of it.


> A while back I took a look at the Virtual Iron page replication patch.  They had
> set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
> in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set another
> flag for shared executables, if we can detect them, and any shared mapping that has
> no writable mappings ?

mapping_writably_mapped would be a good one to try. That may be too
broad in some corner cases where we do want occasionally-written files
or even parts of files to be replicated, but if we were ever to enable
CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would
be the default heuristic.

Still, I appreciate the testing of the "thrashing" case, because with
the mapping_writably_mapped heuristic, it is likely that bugs could
remain lurking even in production workloads on huge systems (because
they will hardly ever get unreplicated).

 
> I'll try to remember to check the replication statistics after the currently
> running test.  If the system stays up, that is.  A quick look < 10 minutes into
> the test shows that zaps are now ~84% of replications.  Also, ~47k replicated pages
> out of ~287K file pages.

Yeah I guess it can be a little misleading: as time approaches infinity,
zaps will probably approach replications. But that doesn't tell you how
long a replica stayed around and usefully fed CPUs with local memory...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-08-13  7:43     ` Nick Piggin
@ 2007-08-13 14:05       ` Lee Schermerhorn
  2007-08-14  2:08         ` Nick Piggin
  2007-09-11 20:52       ` Update: [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1 Lee Schermerhorn
  1 sibling, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-08-13 14:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney

On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
> On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > > Hi,
> > > > 
> > > > Just got a bit of time to take another look at the replicated pagecache
> > > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > > gives me more confidence in the locking now; the new ->fault API makes
> > > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > > and fixed.
> > > > 
> > > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > > tests...
> > > > 

<snip>
> 
> Hi Lee,
> 
> Am sick with the flu for the past few days, so I haven't done much more
> work here, but I'll just add some (not very useful) comments....
> 
> The get_page_from_freelist hang is quite strange. It would be zone->lock,
> which shouldn't have too much contention...
> 
> Replication may be putting more stress on some locks. It will cause more
> tlb flushing that can not be batched well, which could cause the call_lock
> to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
> inherit the latency from call_lock. (If this is the case, we could
> potentially extend the tlb flushing API slightly to cope better with
> unmapping of pages from multiple mm's, but that comes way down the track
> when/if replication proves itself!).
> 
> tlb flushing AFAIKS should not do the IPI unless it is deadling with a
> multithreaded mm... does usex use threads?

Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
apps that get run repeatedly, that are multi-threaded.  This job mix
caught a number of races in my auto-migration patches when
multi-threaded tasks race in the page fault paths.

More below...

> 
> 
> > I should note that I was trying to unmap all mappings to the file backed pages
> > on internode task migration, instead of just the current task's pte's.  However,
> > I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
> > was looping trying to unmap pages with mapcounts of several 10s--such as I see
> > on some page cache pages in my traces.
> 
> Replication teardown would still have to unmap all... but that shouldn't
> particularly be any worse than, say, page reclaim (except I guess that it
> could occur more often).
> 
>  
> > Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
> > task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
> > again and will let it run over the weekend--or as long as it stays up, which 
> > ever is shorter :-).
> 
> Ah, so it does eventually die? Any hints of why?

No, doesn't die--as in panic.  I was just commenting that I'd leave it
running ...  However [:-(], it DID hang again.  The test window said
that the tests ran for 62h:28m before the screen stopped updating.  In
another window, I was running a script to snap the replication and #
file pages vmstats, along with a timestamp, every 10 minutes.  That
stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
the test.  It still wrote the timestamps [date command] until around 7am
this morning [Monday]--or ~62 hours into test.

So, I do have ~14 hours of replication stats that I can send you or plot
up...

Re: the hang:  again, console was scrolling soft lockups continuously.
Checking the messages file, I see hangs in copy_process(),
smp_call_function [as in prev test], vma_link [from mmap], ...

I also see a number of NaT ["not a thing"] consumptions--ia64 specific
error, probably invalid pointer deref--in swapin_readahead, which my
patches hack.  These might be the cause of the fork/mmap hangs.

Didn't see that in the 8-9Aug runs, so it might be a result of continued
operation after other hangs/problems; or a botch in the rebase to
rc2-mm2.  In any case, I have some work to do there...

> 
> > 
> > I put a tarball with the rebased series in the Replication directory linked
> > above, in case you're interested.  I haven't added the patch description for
> > the new patch yet, but it's pretty simple.  Maybe even correct.
> > 
> > ----
> > 
> > Unrelated to the lockups  [I think]:
> > 
> > I forgot to look before I rebooted, but earlier the previous evening, I checked
> > the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
> > replications and ~4.8 million "zaps" [collapse of replicated page].  That's around
> > 98% zaps.  Do we need some filter in the fault path to reduce the "thrashing"--if
> > that's what I'm seeing.  
> 
> Yep. The current replication patch is very much only infrastructure at
> this stage (and is good for stress testing). I feel sure that heuristics
> and perhaps tunables would be needed to make the most of it.

Yeah.  I have some ideas to try...

At the end of the 14.5 hours when it stopped dumping vmstats, we were at
~95% zaps.

> 
> 
> > A while back I took a look at the Virtual Iron page replication patch.  They had
> > set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
> > in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set another
> > flag for shared executables, if we can detect them, and any shared mapping that has
> > no writable mappings ?
> 
> mapping_writably_mapped would be a good one to try. That may be too
> broad in some corner cases where we do want occasionally-written files
> or even parts of files to be replicated, but if we were ever to enable
> CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would
> be the default heuristic.
> 
> Still, I appreciate the testing of the "thrashing" case, because with
> the mapping_writably_mapped heuristic, it is likely that bugs could
> remain lurking even in production workloads on huge systems (because
> they will hardly ever get unreplicated).
> 
>  
> > I'll try to remember to check the replication statistics after the currently
> > running test.  If the system stays up, that is.  A quick look < 10 minutes into
> > the test shows that zaps are now ~84% of replications.  Also, ~47k replicated pages
> > out of ~287K file pages.
> 
> Yeah I guess it can be a little misleading: as time approaches infinity,
> zaps will probably approach replications. But that doesn't tell you how
> long a replica stayed around and usefully fed CPUs with local memory...

May be able to capture that info with a more invasive patch -- e.g., add
a timestamp to the page struct.  I'll think about it.

And, I'll keep you posted.  Not sure how much time I'll be able to
dedicate to this patch stream.  Got a few others I need to get back
to...

Later,
Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
  2007-08-13 14:05       ` Lee Schermerhorn
@ 2007-08-14  2:08         ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2007-08-14  2:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney

On Mon, Aug 13, 2007 at 10:05:01AM -0400, Lee Schermerhorn wrote:
> On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
> > 
> > Replication may be putting more stress on some locks. It will cause more
> > tlb flushing that can not be batched well, which could cause the call_lock
> > to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
> > inherit the latency from call_lock. (If this is the case, we could
> > potentially extend the tlb flushing API slightly to cope better with
> > unmapping of pages from multiple mm's, but that comes way down the track
> > when/if replication proves itself!).
> > 
> > tlb flushing AFAIKS should not do the IPI unless it is deadling with a
> > multithreaded mm... does usex use threads?
> 
> Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
> apps that get run repeatedly, that are multi-threaded.  This job mix
> caught a number of races in my auto-migration patches when
> multi-threaded tasks race in the page fault paths.
> 
> More below...

Hmm, come to think of it: I'm a bit mistaken. The replica zaps will often
to be coming from _other_ CPUs, so they will require an IPI regardless of
whether they are threaded or not.

The generic ia64 tlb flushing code also does a really bad job at flushing one
'mm' from another: it uses the single-threaded smp_call_function and broadcasts
IPIs (and TLB invalidates) to ALL CPUs, regardless of the cpu_vm_mask of the
target process. So you have a multiplicative problem with call_lock.

I think this path could be significantly optimised... but it's a bit nasty
to be playing around with the TLB flushing code while trying to test
something else :P

Can we make a simple change to smp_flush_tlb_all to do
smp_flush_tlb_cpumask(cpu_online_map), rather than on_each_cpu()? At least
then it will use the direct IPI vector and avoid call_lock.

> > Ah, so it does eventually die? Any hints of why?
> 
> No, doesn't die--as in panic.  I was just commenting that I'd leave it
> running ...  However [:-(], it DID hang again.  The test window said
> that the tests ran for 62h:28m before the screen stopped updating.  In
> another window, I was running a script to snap the replication and #
> file pages vmstats, along with a timestamp, every 10 minutes.  That
> stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
> the test.  It still wrote the timestamps [date command] until around 7am
> this morning [Monday]--or ~62 hours into test.
> 
> So, I do have ~14 hours of replication stats that I can send you or plot
> up...

If you think it could be useful, sure.

> Re: the hang:  again, console was scrolling soft lockups continuously.
> Checking the messages file, I see hangs in copy_process(),
> smp_call_function [as in prev test], vma_link [from mmap], ...

I don't suppose it should hang even if it is encountering 10s delays on
call_lock.... but I wonder how it would go with the tlb flush change.
With luck, it would add more concurrency and make it hang _faster_ ;)

> > Yeah I guess it can be a little misleading: as time approaches infinity,
> > zaps will probably approach replications. But that doesn't tell you how
> > long a replica stayed around and usefully fed CPUs with local memory...
> 
> May be able to capture that info with a more invasive patch -- e.g., add
> a timestamp to the page struct.  I'll think about it.

Yeah that actually could be a good approach. You could make a histogram
of lifetimes which would be a decent metric to start tuning with. Ideally
you'd also want to record some context of what caused the zap and the status
of the file, but it may be difficult to get a good S/N on those metrics.

> And, I'll keep you posted.  Not sure how much time I'll be able to
> dedicate to this patch stream.  Got a few others I need to get back
> to...

Thanks, I appreciate it. I'm pretty much in the same boat, just spending a
bit of time on it here and there.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Update:  [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1
  2007-08-13  7:43     ` Nick Piggin
  2007-08-13 14:05       ` Lee Schermerhorn
@ 2007-09-11 20:52       ` Lee Schermerhorn
  2007-09-12  1:52         ` Balbir Singh
  1 sibling, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 20:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, balbir, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney

[Balbir:  see notes re:  replication and memory controller below]

A quick update:  I have rebased the automatic/lazy page migration and
replication patches to 23-rc4-mm1.  If interested, you can find the
entire series that I push in the '070911' tarball at:

	http://free.linux.hp.com/~lts/Patches/Replication/

I haven't gotten around to some of the things you suggested to address
the soft lockups. etc.  I just wanted to keep the patches up to date.  

In the process of doing a quick sanity test, I encountered an issue with
replication and the new memory controller patches.  I had built the
kernel with the memory controller enabled.  I encountered a panic in
reclaim, while attempting to "drop caches", because replication was not
"charging" the replicated pages and reclaim tried to deref a null
"page_container" pointer.  [!!! new member in page struct !!!]

I added code to try_to_create_replica(), __remove_replicated_page() and
release_pcache_desc() to charge/uncharge where I thought appropriate
[replication patch # 02].  That seemed to solve the panic during drop
caches triggered reclaim.  However, when I tried a more stressful load,
I hit another panic ["NaT Consumption" == ia64-ese for invalid pointer
deref, I think] in shrink_active_list() called from direct reclaim.
Still to be investigated.  I wanted to give you and Balbir a heads up
about the interaction of memory controllers with page replication.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Update:  [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1
  2007-09-11 20:52       ` Update: [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1 Lee Schermerhorn
@ 2007-09-12  1:52         ` Balbir Singh
  2007-09-12 13:48           ` Lee Schermerhorn
  0 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2007-09-12  1:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nick Piggin, Linux Memory Management List, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney

Lee Schermerhorn wrote:
> [Balbir:  see notes re:  replication and memory controller below]
> 
> A quick update:  I have rebased the automatic/lazy page migration and
> replication patches to 23-rc4-mm1.  If interested, you can find the
> entire series that I push in the '070911' tarball at:
> 
> 	http://free.linux.hp.com/~lts/Patches/Replication/
> 
> I haven't gotten around to some of the things you suggested to address
> the soft lockups. etc.  I just wanted to keep the patches up to date.  
> 
> In the process of doing a quick sanity test, I encountered an issue with
> replication and the new memory controller patches.  I had built the
> kernel with the memory controller enabled.  I encountered a panic in
> reclaim, while attempting to "drop caches", because replication was not
> "charging" the replicated pages and reclaim tried to deref a null
> "page_container" pointer.  [!!! new member in page struct !!!]
> 
> I added code to try_to_create_replica(), __remove_replicated_page() and
> release_pcache_desc() to charge/uncharge where I thought appropriate
> [replication patch # 02].  That seemed to solve the panic during drop
> caches triggered reclaim.  However, when I tried a more stressful load,
> I hit another panic ["NaT Consumption" == ia64-ese for invalid pointer
> deref, I think] in shrink_active_list() called from direct reclaim.
> Still to be investigated.  I wanted to give you and Balbir a heads up
> about the interaction of memory controllers with page replication.
> 

Hi, Lee,

Thanks for testing the memory controller with page replication. I do
have some questions on the problem you are seeing

Did you see the problem with direct reclaim or container reclaim?
drop_caches calls remove_mapping(), which should eventually call
the uncharge routine. We have some sanity checks in there.

We do try to see at several places if the page->page_container is NULL
and check for it. I'll look at your patches to see if there are any
changes to the reclaim logic. I tried looking for the oops you
mentioned, but could not find it in your directory, I saw the soft
lockup logs though. Do you still have the oops saved somewhere?

I think the fix you have is correct and makes things works, but it
worries me that in direct reclaim we dereference the page_container
pointer without the page belonging to a container? What are the
properties of replicated pages? Are they assumed to be exact
replicas (struct page mappings, page_container expected to be the
same for all replicated pages) of the replicated page?


> Later,
> Lee
> 
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Update:  [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1
  2007-09-12  1:52         ` Balbir Singh
@ 2007-09-12 13:48           ` Lee Schermerhorn
  2007-09-12 14:08             ` Balbir Singh
  0 siblings, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-09-12 13:48 UTC (permalink / raw)
  To: balbir
  Cc: Nick Piggin, Linux Memory Management List, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney

On Wed, 2007-09-12 at 07:22 +0530, Balbir Singh wrote:
> Lee Schermerhorn wrote:
> > [Balbir:  see notes re:  replication and memory controller below]
> > 
> > A quick update:  I have rebased the automatic/lazy page migration and
> > replication patches to 23-rc4-mm1.  If interested, you can find the
> > entire series that I push in the '070911' tarball at:
> > 
> > 	http://free.linux.hp.com/~lts/Patches/Replication/
> > 
> > I haven't gotten around to some of the things you suggested to address
> > the soft lockups. etc.  I just wanted to keep the patches up to date.  
> > 
> > In the process of doing a quick sanity test, I encountered an issue with
> > replication and the new memory controller patches.  I had built the
> > kernel with the memory controller enabled.  I encountered a panic in
> > reclaim, while attempting to "drop caches", because replication was not
> > "charging" the replicated pages and reclaim tried to deref a null
> > "page_container" pointer.  [!!! new member in page struct !!!]
> > 
> > I added code to try_to_create_replica(), __remove_replicated_page() and
> > release_pcache_desc() to charge/uncharge where I thought appropriate
> > [replication patch # 02].  That seemed to solve the panic during drop
> > caches triggered reclaim.  However, when I tried a more stressful load,
> > I hit another panic ["NaT Consumption" == ia64-ese for invalid pointer
> > deref, I think] in shrink_active_list() called from direct reclaim.
> > Still to be investigated.  I wanted to give you and Balbir a heads up
> > about the interaction of memory controllers with page replication.
> > 
> 
> Hi, Lee,
> 
> Thanks for testing the memory controller with page replication. I do
> have some questions on the problem you are seeing
> 
> Did you see the problem with direct reclaim or container reclaim?
> drop_caches calls remove_mapping(), which should eventually call
> the uncharge routine. We have some sanity checks in there.

Sorry.  This one wasn't in reclaim.  It was from the fault path, via
activate page.  The bug in reclaim occurred after I "fixed" page
replication to charge for replicated pages, thus adding the
page_container.  The second panic resulted from bad pointer ref in
shrink_active_list() from direct reclaim.

[abbreviated] stack traces attached below.

I took a look at an assembly language objdump and it appears that the
bad pointer deref occurred in the "while (!list_empty(&l_inactive))"
loop.  I see that there is also a mem_container_move_lists() call there.
I will try to rerun the workload on an unpatched 23-rc4-mm1 today to see
if it's reproducible there.  I can believe that this is a race between
replication [possibly "unreplicate"] and vmscan.  I don't know what type
of protection, if any, we have against that.  

> 
> We do try to see at several places if the page->page_container is NULL
> and check for it. I'll look at your patches to see if there are any
> changes to the reclaim logic. I tried looking for the oops you
> mentioned, but could not find it in your directory, I saw the soft
> lockup logs though. Do you still have the oops saved somewhere?
> 
> I think the fix you have is correct and makes things works, but it
> worries me that in direct reclaim we dereference the page_container
> pointer without the page belonging to a container? What are the
> properties of replicated pages? Are they assumed to be exact
> replicas (struct page mappings, page_container expected to be the
> same for all replicated pages) of the replicated page?

Before "fix"

Running spol+lpm+repl patches on 23-rc4-mm1.  kernel build test
echo 1 >/proc/sys/vm/drop_caches
Then [perhaps a coincidence]:

Unable to handle kernel NULL pointer dereference (address 0000000000000008)
cc1[23366]: Oops 11003706212352 [1]
Modules linked in: sunrpc binfmt_misc fan dock sg thermal processor container button sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore

Pid: 23366, CPU 6, comm:                  cc1
<snip>
 [<a000000100191a30>] __mem_container_move_lists+0x50/0x100
                                sp=e0000720449a7d60 bsp=e0000720449a1040
 [<a000000100192570>] mem_container_move_lists+0x50/0x80
                                sp=e0000720449a7d60 bsp=e0000720449a1010
 [<a0000001001382b0>] activate_page+0x1d0/0x220
                                sp=e0000720449a7d60 bsp=e0000720449a0fd0
 [<a0000001001389c0>] mark_page_accessed+0xe0/0x160
                                sp=e0000720449a7d60 bsp=e0000720449a0fb0
 [<a000000100125f30>] filemap_fault+0x390/0x840
                                sp=e0000720449a7d60 bsp=e0000720449a0f10
 [<a000000100146870>] __do_fault+0xd0/0xbc0
                                sp=e0000720449a7d60 bsp=e0000720449a0e90
 [<a00000010014b8e0>] handle_mm_fault+0x280/0x1540
                                sp=e0000720449a7d90 bsp=e0000720449a0e00
 [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
                                sp=e0000720449a7da0 bsp=e0000720449a0da0
 [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
                                sp=e0000720449a7e30 bsp=e0000720449a0da0


After "fix:"

Running "usex" [unix systems exerciser] load, with kernel build, io tests,
vm tests, memtoy "lock" tests, ...

as[15608]: NaT consumption 2216203124768 [1]
Modules linked in: sunrpc binfmt_misc fan dock sg container thermal button processor sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore

Pid: 15608, CPU 8, comm:                   as
<snip>
 [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
                                sp=e00007401f53fab0 bsp=e00007401f539238
 [<a00000010013b4a0>] shrink_active_list+0x160/0xe80
                                sp=e00007401f53fc80 bsp=e00007401f539158
 [<a00000010013e780>] shrink_zone+0x240/0x280
                                sp=e00007401f53fd40 bsp=e00007401f539100
 [<a00000010013fec0>] zone_reclaim+0x3c0/0x580
                                sp=e00007401f53fd40 bsp=e00007401f539098
 [<a000000100130950>] get_page_from_freelist+0xb30/0x1360
                                sp=e00007401f53fd80 bsp=e00007401f538f08
 [<a000000100131310>] __alloc_pages+0xd0/0x620
                                sp=e00007401f53fd80 bsp=e00007401f538e38
 [<a000000100173240>] alloc_page_pol+0x100/0x180
                                sp=e00007401f53fd90 bsp=e00007401f538e08
 [<a0000001001733b0>] alloc_page_vma+0xf0/0x120
                                sp=e00007401f53fd90 bsp=e00007401f538dc8
 [<a00000010014bda0>] handle_mm_fault+0x740/0x1540
                                sp=e00007401f53fd90 bsp=e00007401f538d38
 [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
                                sp=e00007401f53fda0 bsp=e00007401f538ce0
 [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
                                sp=e00007401f53fe30 bsp=e00007401f538ce0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Update:  [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1
  2007-09-12 13:48           ` Lee Schermerhorn
@ 2007-09-12 14:08             ` Balbir Singh
  2007-09-12 15:09               ` Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache Lee Schermerhorn
  0 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2007-09-12 14:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nick Piggin, Linux Memory Management List, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney

Lee Schermerhorn wrote:
> On Wed, 2007-09-12 at 07:22 +0530, Balbir Singh wrote:
>> Lee Schermerhorn wrote:
>>> [Balbir:  see notes re:  replication and memory controller below]
>>>
>>> A quick update:  I have rebased the automatic/lazy page migration and
>>> replication patches to 23-rc4-mm1.  If interested, you can find the
>>> entire series that I push in the '070911' tarball at:
>>>
>>> 	http://free.linux.hp.com/~lts/Patches/Replication/
>>>
>>> I haven't gotten around to some of the things you suggested to address
>>> the soft lockups. etc.  I just wanted to keep the patches up to date.  
>>>
>>> In the process of doing a quick sanity test, I encountered an issue with
>>> replication and the new memory controller patches.  I had built the
>>> kernel with the memory controller enabled.  I encountered a panic in
>>> reclaim, while attempting to "drop caches", because replication was not
>>> "charging" the replicated pages and reclaim tried to deref a null
>>> "page_container" pointer.  [!!! new member in page struct !!!]
>>>
>>> I added code to try_to_create_replica(), __remove_replicated_page() and
>>> release_pcache_desc() to charge/uncharge where I thought appropriate
>>> [replication patch # 02].  That seemed to solve the panic during drop
>>> caches triggered reclaim.  However, when I tried a more stressful load,
>>> I hit another panic ["NaT Consumption" == ia64-ese for invalid pointer
>>> deref, I think] in shrink_active_list() called from direct reclaim.
>>> Still to be investigated.  I wanted to give you and Balbir a heads up
>>> about the interaction of memory controllers with page replication.
>>>
>> Hi, Lee,
>>
>> Thanks for testing the memory controller with page replication. I do
>> have some questions on the problem you are seeing
>>
>> Did you see the problem with direct reclaim or container reclaim?
>> drop_caches calls remove_mapping(), which should eventually call
>> the uncharge routine. We have some sanity checks in there.
> 
> Sorry.  This one wasn't in reclaim.  It was from the fault path, via
> activate page.  The bug in reclaim occurred after I "fixed" page
> replication to charge for replicated pages, thus adding the
> page_container.  The second panic resulted from bad pointer ref in
> shrink_active_list() from direct reclaim.
> 
> [abbreviated] stack traces attached below.
> 
> I took a look at an assembly language objdump and it appears that the
> bad pointer deref occurred in the "while (!list_empty(&l_inactive))"
> loop.  I see that there is also a mem_container_move_lists() call there.
> I will try to rerun the workload on an unpatched 23-rc4-mm1 today to see
> if it's reproducible there.  I can believe that this is a race between
> replication [possibly "unreplicate"] and vmscan.  I don't know what type
> of protection, if any, we have against that.  
> 


Thanks, the stack trace makes sense now. So basically, we have a case
where a page is on the zone LRU, but does not belong to any container,
which is why we do indeed need your first fix (to charge/uncharge) the
pages on replication/removal.

>> We do try to see at several places if the page->page_container is NULL
>> and check for it. I'll look at your patches to see if there are any
>> changes to the reclaim logic. I tried looking for the oops you
>> mentioned, but could not find it in your directory, I saw the soft
>> lockup logs though. Do you still have the oops saved somewhere?
>>
>> I think the fix you have is correct and makes things works, but it
>> worries me that in direct reclaim we dereference the page_container
>> pointer without the page belonging to a container? What are the
>> properties of replicated pages? Are they assumed to be exact
>> replicas (struct page mappings, page_container expected to be the
>> same for all replicated pages) of the replicated page?
> 
> Before "fix"
> 
> Running spol+lpm+repl patches on 23-rc4-mm1.  kernel build test
> echo 1 >/proc/sys/vm/drop_caches
> Then [perhaps a coincidence]:
> 
> Unable to handle kernel NULL pointer dereference (address 0000000000000008)
> cc1[23366]: Oops 11003706212352 [1]
> Modules linked in: sunrpc binfmt_misc fan dock sg thermal processor container button sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore
> 
> Pid: 23366, CPU 6, comm:                  cc1
> <snip>
>  [<a000000100191a30>] __mem_container_move_lists+0x50/0x100
>                                 sp=e0000720449a7d60 bsp=e0000720449a1040
>  [<a000000100192570>] mem_container_move_lists+0x50/0x80
>                                 sp=e0000720449a7d60 bsp=e0000720449a1010
>  [<a0000001001382b0>] activate_page+0x1d0/0x220
>                                 sp=e0000720449a7d60 bsp=e0000720449a0fd0
>  [<a0000001001389c0>] mark_page_accessed+0xe0/0x160
>                                 sp=e0000720449a7d60 bsp=e0000720449a0fb0
>  [<a000000100125f30>] filemap_fault+0x390/0x840
>                                 sp=e0000720449a7d60 bsp=e0000720449a0f10
>  [<a000000100146870>] __do_fault+0xd0/0xbc0
>                                 sp=e0000720449a7d60 bsp=e0000720449a0e90
>  [<a00000010014b8e0>] handle_mm_fault+0x280/0x1540
>                                 sp=e0000720449a7d90 bsp=e0000720449a0e00
>  [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
>                                 sp=e0000720449a7da0 bsp=e0000720449a0da0
>  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
>                                 sp=e0000720449a7e30 bsp=e0000720449a0da0
> 
> 
> After "fix:"
> 
> Running "usex" [unix systems exerciser] load, with kernel build, io tests,
> vm tests, memtoy "lock" tests, ...
> 

Wow! thats a real stress, thanks for putting the controller through
this. How long is it before the system panics? BTW, is NaT NULL Address
Translation? Does this problem go away with the memory controller
disabled?

> as[15608]: NaT consumption 2216203124768 [1]
> Modules linked in: sunrpc binfmt_misc fan dock sg container thermal button processor sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore
> 
> Pid: 15608, CPU 8, comm:                   as
> <snip>
>  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
>                                 sp=e00007401f53fab0 bsp=e00007401f539238
>  [<a00000010013b4a0>] shrink_active_list+0x160/0xe80
>                                 sp=e00007401f53fc80 bsp=e00007401f539158
>  [<a00000010013e780>] shrink_zone+0x240/0x280
>                                 sp=e00007401f53fd40 bsp=e00007401f539100
>  [<a00000010013fec0>] zone_reclaim+0x3c0/0x580
>                                 sp=e00007401f53fd40 bsp=e00007401f539098
>  [<a000000100130950>] get_page_from_freelist+0xb30/0x1360
>                                 sp=e00007401f53fd80 bsp=e00007401f538f08
>  [<a000000100131310>] __alloc_pages+0xd0/0x620
>                                 sp=e00007401f53fd80 bsp=e00007401f538e38
>  [<a000000100173240>] alloc_page_pol+0x100/0x180
>                                 sp=e00007401f53fd90 bsp=e00007401f538e08
>  [<a0000001001733b0>] alloc_page_vma+0xf0/0x120
>                                 sp=e00007401f53fd90 bsp=e00007401f538dc8
>  [<a00000010014bda0>] handle_mm_fault+0x740/0x1540
>                                 sp=e00007401f53fd90 bsp=e00007401f538d38
>  [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
>                                 sp=e00007401f53fda0 bsp=e00007401f538ce0
>  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
>                                 sp=e00007401f53fe30 bsp=e00007401f538ce0
> 
> 

Interesting, I don't see a memory controller function in the stack
trace, but I'll double check to see if I can find some silly race
condition in there.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update:  [Automatic] NUMA replicated pagecache ...
  2007-09-12 14:08             ` Balbir Singh
@ 2007-09-12 15:09               ` Lee Schermerhorn
  2007-09-12 15:41                 ` Andy Whitcroft
  0 siblings, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-09-12 15:09 UTC (permalink / raw)
  To: balbir, Andrew Morton
  Cc: Nick Piggin, Linux Memory Management List, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney, linux-kernel

On Wed, 2007-09-12 at 19:38 +0530, Balbir Singh wrote:
> Lee Schermerhorn wrote:
> > On Wed, 2007-09-12 at 07:22 +0530, Balbir Singh wrote:
> >> Lee Schermerhorn wrote:
> >>> [Balbir:  see notes re:  replication and memory controller below]
> >>>
> >>> A quick update:  I have rebased the automatic/lazy page migration and
> >>> replication patches to 23-rc4-mm1.  If interested, you can find the
> >>> entire series that I push in the '070911' tarball at:
> >>>
> >>> 	http://free.linux.hp.com/~lts/Patches/Replication/
> >>>
> >>> I haven't gotten around to some of the things you suggested to address
> >>> the soft lockups. etc.  I just wanted to keep the patches up to date.  
> >>>
> >>> In the process of doing a quick sanity test, I encountered an issue with
> >>> replication and the new memory controller patches.  I had built the
> >>> kernel with the memory controller enabled.  I encountered a panic in
> >>> reclaim, while attempting to "drop caches", because replication was not
> >>> "charging" the replicated pages and reclaim tried to deref a null
> >>> "page_container" pointer.  [!!! new member in page struct !!!]
> >>>
> >>> I added code to try_to_create_replica(), __remove_replicated_page() and
> >>> release_pcache_desc() to charge/uncharge where I thought appropriate
> >>> [replication patch # 02].  That seemed to solve the panic during drop
> >>> caches triggered reclaim.  However, when I tried a more stressful load,
> >>> I hit another panic ["NaT Consumption" == ia64-ese for invalid pointer
> >>> deref, I think] in shrink_active_list() called from direct reclaim.
> >>> Still to be investigated.  I wanted to give you and Balbir a heads up
> >>> about the interaction of memory controllers with page replication.
> >>>
> >> Hi, Lee,
> >>
> >> Thanks for testing the memory controller with page replication. I do
> >> have some questions on the problem you are seeing
> >>
> >> Did you see the problem with direct reclaim or container reclaim?
> >> drop_caches calls remove_mapping(), which should eventually call
> >> the uncharge routine. We have some sanity checks in there.
> > 
> > Sorry.  This one wasn't in reclaim.  It was from the fault path, via
> > activate page.  The bug in reclaim occurred after I "fixed" page
> > replication to charge for replicated pages, thus adding the
> > page_container.  The second panic resulted from bad pointer ref in
> > shrink_active_list() from direct reclaim.
> > 
> > [abbreviated] stack traces attached below.
> > 
> > I took a look at an assembly language objdump and it appears that the
> > bad pointer deref occurred in the "while (!list_empty(&l_inactive))"
> > loop.  I see that there is also a mem_container_move_lists() call there.
> > I will try to rerun the workload on an unpatched 23-rc4-mm1 today to see
> > if it's reproducible there.  I can believe that this is a race between
> > replication [possibly "unreplicate"] and vmscan.  I don't know what type
> > of protection, if any, we have against that.  
> > 
> 
> 
> Thanks, the stack trace makes sense now. So basically, we have a case
> where a page is on the zone LRU, but does not belong to any container,
> which is why we do indeed need your first fix (to charge/uncharge) the
> pages on replication/removal.
> 
> >> We do try to see at several places if the page->page_container is NULL
> >> and check for it. I'll look at your patches to see if there are any
> >> changes to the reclaim logic. I tried looking for the oops you
> >> mentioned, but could not find it in your directory, I saw the soft
> >> lockup logs though. Do you still have the oops saved somewhere?
> >>
> >> I think the fix you have is correct and makes things works, but it
> >> worries me that in direct reclaim we dereference the page_container
> >> pointer without the page belonging to a container? What are the
> >> properties of replicated pages? Are they assumed to be exact
> >> replicas (struct page mappings, page_container expected to be the
> >> same for all replicated pages) of the replicated page?
> > 
> > Before "fix"
> > 
> > Running spol+lpm+repl patches on 23-rc4-mm1.  kernel build test
> > echo 1 >/proc/sys/vm/drop_caches
> > Then [perhaps a coincidence]:
> > 
> > Unable to handle kernel NULL pointer dereference (address 0000000000000008)
> > cc1[23366]: Oops 11003706212352 [1]
> > Modules linked in: sunrpc binfmt_misc fan dock sg thermal processor container button sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore
> > 
> > Pid: 23366, CPU 6, comm:                  cc1
> > <snip>
> >  [<a000000100191a30>] __mem_container_move_lists+0x50/0x100
> >                                 sp=e0000720449a7d60 bsp=e0000720449a1040
> >  [<a000000100192570>] mem_container_move_lists+0x50/0x80
> >                                 sp=e0000720449a7d60 bsp=e0000720449a1010
> >  [<a0000001001382b0>] activate_page+0x1d0/0x220
> >                                 sp=e0000720449a7d60 bsp=e0000720449a0fd0
> >  [<a0000001001389c0>] mark_page_accessed+0xe0/0x160
> >                                 sp=e0000720449a7d60 bsp=e0000720449a0fb0
> >  [<a000000100125f30>] filemap_fault+0x390/0x840
> >                                 sp=e0000720449a7d60 bsp=e0000720449a0f10
> >  [<a000000100146870>] __do_fault+0xd0/0xbc0
> >                                 sp=e0000720449a7d60 bsp=e0000720449a0e90
> >  [<a00000010014b8e0>] handle_mm_fault+0x280/0x1540
> >                                 sp=e0000720449a7d90 bsp=e0000720449a0e00
> >  [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
> >                                 sp=e0000720449a7da0 bsp=e0000720449a0da0
> >  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
> >                                 sp=e0000720449a7e30 bsp=e0000720449a0da0
> > 
> > 
> > After "fix:"
> > 
> > Running "usex" [unix systems exerciser] load, with kernel build, io tests,
> > vm tests, memtoy "lock" tests, ...
> > 
> 
> Wow! thats a real stress, thanks for putting the controller through
> this. How long is it before the system panics? BTW, is NaT NULL Address
> Translation? Does this problem go away with the memory controller
> disabled?

System panics within a few seconds of starting the test.

NaT == Not a Thing.  Kernel reports null pointer deref as such.  I
believe that NaT Consumption errors come from attempting to deref a
non-NULL pointer that points at non-existent memory.

I tried the workload again with an "unpatched kernel" -- i.e., no
automatic page migration nor replication, nor any other of my
experimental patches.  Still happens with memory controller configured
-- same stack trace.

Then I tried an unpatched 23-rc4-mm1 with memory controller NOT
configured, still panic'ed, but with a different symptom:  first a soft
lockup, then a NULL pointer deref--apparently in soft lockup detection
code.  Panics because it OOPses in interrupt handler.

Tried again, same kernel--mem controller unconfig'd:  this time I got
the original stack trace--NaT Consumption in shrink_active_list().
Then, softlockup with NULL pointer deref therein.  It's the null pointer
deref that causes the panic:  "Aiee, killing interrupt handler!"

So, maybe memory controller is "off the hook".

I guess I need to check the lists for 23-rc4-mm1 hot fixes, and try to
bisect rc4-mm1.

> 
> > as[15608]: NaT consumption 2216203124768 [1]
> > Modules linked in: sunrpc binfmt_misc fan dock sg container thermal button processor sr_mod scsi_wait_scan ehci_hcd ohci_hcd uhci_hcd usbcore
> > 
> > Pid: 15608, CPU 8, comm:                   as
> > <snip>
> >  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
> >                                 sp=e00007401f53fab0 bsp=e00007401f539238
> >  [<a00000010013b4a0>] shrink_active_list+0x160/0xe80
> >                                 sp=e00007401f53fc80 bsp=e00007401f539158
> >  [<a00000010013e780>] shrink_zone+0x240/0x280
> >                                 sp=e00007401f53fd40 bsp=e00007401f539100
> >  [<a00000010013fec0>] zone_reclaim+0x3c0/0x580
> >                                 sp=e00007401f53fd40 bsp=e00007401f539098
> >  [<a000000100130950>] get_page_from_freelist+0xb30/0x1360
> >                                 sp=e00007401f53fd80 bsp=e00007401f538f08
> >  [<a000000100131310>] __alloc_pages+0xd0/0x620
> >                                 sp=e00007401f53fd80 bsp=e00007401f538e38
> >  [<a000000100173240>] alloc_page_pol+0x100/0x180
> >                                 sp=e00007401f53fd90 bsp=e00007401f538e08
> >  [<a0000001001733b0>] alloc_page_vma+0xf0/0x120
> >                                 sp=e00007401f53fd90 bsp=e00007401f538dc8
> >  [<a00000010014bda0>] handle_mm_fault+0x740/0x1540
> >                                 sp=e00007401f53fd90 bsp=e00007401f538d38
> >  [<a000000100071940>] ia64_do_page_fault+0x600/0xa80
> >                                 sp=e00007401f53fda0 bsp=e00007401f538ce0
> >  [<a00000010000b5c0>] ia64_leave_kernel+0x0/0x270
> >                                 sp=e00007401f53fe30 bsp=e00007401f538ce0
> > 
> > 
> 
> Interesting, I don't see a memory controller function in the stack
> trace, but I'll double check to see if I can find some silly race
> condition in there.

right.  I noticed that after I sent the mail.  

Also, config available at:
http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont



Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update:  [Automatic] NUMA replicated pagecache ...
  2007-09-12 15:09               ` Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache Lee Schermerhorn
@ 2007-09-12 15:41                 ` Andy Whitcroft
  2007-09-12 17:04                   ` Lee Schermerhorn
  2007-09-12 19:46                   ` [PATCH] " Lee Schermerhorn
  0 siblings, 2 replies; 18+ messages in thread
From: Andy Whitcroft @ 2007-09-12 15:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: balbir, Andrew Morton, Nick Piggin, Linux Memory Management List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney,
	linux-kernel

On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote:

> > Interesting, I don't see a memory controller function in the stack
> > trace, but I'll double check to see if I can find some silly race
> > condition in there.
> 
> right.  I noticed that after I sent the mail.  
> 
> Also, config available at:
> http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont

Be interested to know the outcome of any bisect you do.  Given its
tripping in reclaim.

What size of box is this?  Wondering if we have anything big enough to
test with.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache ...
  2007-09-12 15:41                 ` Andy Whitcroft
@ 2007-09-12 17:04                   ` Lee Schermerhorn
  2007-09-12 19:46                   ` [PATCH] " Lee Schermerhorn
  1 sibling, 0 replies; 18+ messages in thread
From: Lee Schermerhorn @ 2007-09-12 17:04 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: balbir, Andrew Morton, Nick Piggin, Linux Memory Management List,
	Joachim Deguara, Christoph Lameter, Mel Gorman, Eric Whitney,
	linux-kernel

On Wed, 2007-09-12 at 16:41 +0100, Andy Whitcroft wrote:
> On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote:
> 
> > > Interesting, I don't see a memory controller function in the stack
> > > trace, but I'll double check to see if I can find some silly race
> > > condition in there.
> > 
> > right.  I noticed that after I sent the mail.  
> > 
> > Also, config available at:
> > http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont
> 
> Be interested to know the outcome of any bisect you do.  Given its
> tripping in reclaim.

FYI:  doesn't seem to fail with 23-rc6.  

> 
> What size of box is this?  Wondering if we have anything big enough to
> test with.

This is a 16-cpu, 4-node, 32GB HP rx8620.  The test load that I'm
running is Dave Anderson's "usex" with a custom test script that runs:

5 built-in usex IO tests to a separate file system on a SCSI disk.
1 built-in usex IO rate test -- to/from same disk/fs.
1 POV ray tracing app--just because I had it :-)
1 script that does "find / -type f | xargs strings >/dev/null" to
pollute the page cache.
2 memtoy scripts to allocate various size anon segments--up to 20GB--
and mlock() them down to force reclaim.
1 32-way parallel kernel build
3 1GB random vm tests
3 1GB sequential vm tests
9 built-in usex "bin" tests--these run a series of programs
from /usr/bin to simulate users doing random things.  Not really random,
tho'.  Just walks a table of commands sequentially.

This load beats up on the system fairly heavily.

I can package up the usex input script and the other associated scripts
that it invokes, if you're interested.  Let me know...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache ...
  2007-09-12 15:41                 ` Andy Whitcroft
  2007-09-12 17:04                   ` Lee Schermerhorn
@ 2007-09-12 19:46                   ` Lee Schermerhorn
  2007-09-12 21:23                     ` Balbir Singh
  1 sibling, 1 reply; 18+ messages in thread
From: Lee Schermerhorn @ 2007-09-12 19:46 UTC (permalink / raw)
  To: Andy Whitcroft, balbir, Andrew Morton
  Cc: Nick Piggin, Linux Memory Management List, Joachim Deguara,
	Christoph Lameter, Mel Gorman, Eric Whitney, linux-kernel

On Wed, 2007-09-12 at 16:41 +0100, Andy Whitcroft wrote:
> On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote:
> 
> > > Interesting, I don't see a memory controller function in the stack
> > > trace, but I'll double check to see if I can find some silly race
> > > condition in there.
> > 
> > right.  I noticed that after I sent the mail.  
> > 
> > Also, config available at:
> > http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont
> 
> Be interested to know the outcome of any bisect you do.  Given its
> tripping in reclaim.

Problem isolated to memory controller patches.  This patch seems to fix
this particular problem.  I've only run the test for a few minutes with
and without memory controller configured, but I did observe reclaim
kicking in several times.  W/o this patch, system would panic as soon as
I entered direct/zone reclaim--less than a minute.

Lee
--------------------------------

PATCH 2.6.23-rc4-mm1 Memory Controller:  initialize all scan_controls'
	isolate_pages member.

We need to initialize all scan_controls' isolate_pages member.
Otherwise, shrink_active_list() attempts to execute at undefined
location.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/vmscan.c |    2 ++
 1 file changed, 2 insertions(+)

Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-10 13:22:21.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-12 15:30:27.000000000 -0400
@@ -1758,6 +1758,7 @@ unsigned long shrink_all_memory(unsigned
 		.swap_cluster_max = nr_pages,
 		.may_writepage = 1,
 		.swappiness = vm_swappiness,
+		.isolate_pages = isolate_pages_global,
 	};
 
 	current->reclaim_state = &reclaim_state;
@@ -1941,6 +1942,7 @@ static int __zone_reclaim(struct zone *z
 					SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
+		.isolate_pages = isolate_pages_global,
 	};
 	unsigned long slab_reclaimable;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache ...
  2007-09-12 19:46                   ` [PATCH] " Lee Schermerhorn
@ 2007-09-12 21:23                     ` Balbir Singh
  0 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2007-09-12 21:23 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andy Whitcroft, Andrew Morton, Nick Piggin,
	Linux Memory Management List, Joachim Deguara, Christoph Lameter,
	Mel Gorman, Eric Whitney, linux-kernel

Lee Schermerhorn wrote:
> On Wed, 2007-09-12 at 16:41 +0100, Andy Whitcroft wrote:
>> On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote:
>>
>>>> Interesting, I don't see a memory controller function in the stack
>>>> trace, but I'll double check to see if I can find some silly race
>>>> condition in there.
>>> right.  I noticed that after I sent the mail.  
>>>
>>> Also, config available at:
>>> http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont
>> Be interested to know the outcome of any bisect you do.  Given its
>> tripping in reclaim.
> 
> Problem isolated to memory controller patches.  This patch seems to fix
> this particular problem.  I've only run the test for a few minutes with
> and without memory controller configured, but I did observe reclaim
> kicking in several times.  W/o this patch, system would panic as soon as
> I entered direct/zone reclaim--less than a minute.
> 

Thanks, excellent catch! The patch looks sane.  Thanks for your help in
sorting this issue out. Hmm.. that means I never hit direct/zone reclaim
in my tests (I'll make a mental note to enhance my test cases to cover
this scenario).

> Lee
> --------------------------------
> 
> PATCH 2.6.23-rc4-mm1 Memory Controller:  initialize all scan_controls'
> 	isolate_pages member.
> 
> We need to initialize all scan_controls' isolate_pages member.
> Otherwise, shrink_active_list() attempts to execute at undefined
> location.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/vmscan.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: Linux/mm/vmscan.c
> ===================================================================
> --- Linux.orig/mm/vmscan.c	2007-09-10 13:22:21.000000000 -0400
> +++ Linux/mm/vmscan.c	2007-09-12 15:30:27.000000000 -0400
> @@ -1758,6 +1758,7 @@ unsigned long shrink_all_memory(unsigned
>  		.swap_cluster_max = nr_pages,
>  		.may_writepage = 1,
>  		.swappiness = vm_swappiness,
> +		.isolate_pages = isolate_pages_global,
>  	};
> 
>  	current->reclaim_state = &reclaim_state;
> @@ -1941,6 +1942,7 @@ static int __zone_reclaim(struct zone *z
>  					SWAP_CLUSTER_MAX),
>  		.gfp_mask = gfp_mask,
>  		.swappiness = vm_swappiness,
> +		.isolate_pages = isolate_pages_global,
>  	};
>  	unsigned long slab_reclaimable;
> 
> 
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-09-12 21:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-27  8:42 [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache Nick Piggin
2007-07-27 14:30 ` Lee Schermerhorn
2007-07-30  3:16   ` Nick Piggin
2007-07-30 16:29     ` Lee Schermerhorn
2007-08-08 20:25 ` Lee Schermerhorn
2007-08-10 21:08   ` Lee Schermerhorn
2007-08-13  7:43     ` Nick Piggin
2007-08-13 14:05       ` Lee Schermerhorn
2007-08-14  2:08         ` Nick Piggin
2007-09-11 20:52       ` Update: [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1 Lee Schermerhorn
2007-09-12  1:52         ` Balbir Singh
2007-09-12 13:48           ` Lee Schermerhorn
2007-09-12 14:08             ` Balbir Singh
2007-09-12 15:09               ` Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache Lee Schermerhorn
2007-09-12 15:41                 ` Andy Whitcroft
2007-09-12 17:04                   ` Lee Schermerhorn
2007-09-12 19:46                   ` [PATCH] " Lee Schermerhorn
2007-09-12 21:23                     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox