[PATCH 0/2] Pzone based CKRM memory resource controller

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Pzone based CKRM memory resource controller
@ 2006-01-19  8:04 KUROSAWA Takahiro
  2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-19  8:04 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

(Changed the mail format into LKML-style and added linux-mm list to Cc:.
 These patches are almost the same as what I sent to ckrm-tech@lists.sf.net
 this week.)

The pzone (pseudo zone) based memory resource controller is yet
another implementation of the CKRM memory resource controller.
The existing CKRM memory resource controller counts the number of
pages that are allocated for tasks in a class in order to guarantee
and limit memory resources.  This requires changes to the existing
code for page allocation and page reclaim.

This memory resource controller takes a different approach aiming at
less impact to the existing Linux kernel code.  The pzone is
introduced to reserve the specified number of pages from the existing
zone.  The pzone uses the existing zone structure but adds several
members.  This enables us smaller impact to the memory management
code; our memory resource controller doesn't require special LRU lists
of pages or addition of a member to the page structure.  Also, it
doesn't require any changes for the algorithms in the memory
management system.

Tasks in a class allocate pages using the zonelist that consists of
pzones.  The memory resource guarantee is achieved by preventing tasks
in other classes from allocating pages from the pzones.  The number of
pages that a class holds can be achieved by limiting page allocations
only from the pzones and disabling page allocations from conventional
zones.

Thus, pages are accounted for the class of tasks that call
__alloc_pages().  Resource guarantee and limit are handled as the
same value.  User-space daemons could be introduced in order to
separate guarantee and limit.

The current implementation doesn't move resource account when the
class of a task is changed.  Moving resource account could be
implemented by using Christoph Lameter's page migration patches.

The patches are against linux-2.6.15, the first patch is for
introducing pzones and the second is for implementing memory resource
controller using pzones.  These patches are not adequately tested yet.
They are still under development and need further work.

Regards,

KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/2] Add the pzone
  2006-01-19  8:04 [PATCH 0/2] Pzone based CKRM memory resource controller KUROSAWA Takahiro
@ 2006-01-19  8:04 ` KUROSAWA Takahiro
  2006-01-19 18:04   ` Andy Whitcroft
  2006-01-20  7:08   ` KAMEZAWA Hiroyuki
  2006-01-19  8:04 ` [PATCH 2/2] Add CKRM memory resource controller using pzones KUROSAWA Takahiro
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2 siblings, 2 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-19  8:04 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch implements the pzone (pseudo zone).  A pzone can be used
for reserving pages in a zone.  Pzones are implemented by extending
the zone structure and act almost the same as the conventional zones;
we can specify pzones in a zonelist for __alloc_pages() and the vmscan
code works on pzones with few modifications.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/gfp.h    |    3 
 include/linux/mm.h     |   49 ++
 include/linux/mmzone.h |  118 ++++++
 include/linux/swap.h   |    2 
 mm/Kconfig             |    6 
 mm/page_alloc.c        |  845 +++++++++++++++++++++++++++++++++++++++++++++----
 mm/shmem.c             |    2 
 mm/vmscan.c            |   75 +++-
 8 files changed, 1020 insertions(+), 80 deletions(-)

diff -urNp a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/gfp.h	2006-01-19 15:23:42.000000000 +0900
@@ -47,6 +47,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_NOLRU      ((__force gfp_t)0x40000u) /* GFP_USER but will not be in LRU lists */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -55,7 +56,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_NOLRU)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
diff -urNp a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/mm.h	2006-01-19 15:23:00.000000000 +0900
@@ -397,6 +397,12 @@ void put_page(struct page *page);
  * with space for node: | SECTION | NODE | ZONE | ... | FLAGS |
  *   no space for node: | SECTION |     ZONE    | ... | FLAGS |
  */
+
+#ifdef CONFIG_PSEUDO_ZONE
+#define PZONE_BIT_WIDTH		1
+#else
+#define PZONE_BIT_WIDTH		0
+#endif
 #ifdef CONFIG_SPARSEMEM
 #define SECTIONS_WIDTH		SECTIONS_SHIFT
 #else
@@ -405,14 +411,15 @@ void put_page(struct page *page);
 
 #define ZONES_WIDTH		ZONES_SHIFT
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= FLAGS_RESERVED
+#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= FLAGS_RESERVED
 #define NODES_WIDTH		NODES_SHIFT
 #else
 #define NODES_WIDTH		0
 #endif
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
-#define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
+/* Page flags: | [PZONE] | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+#define PZONE_BIT_PGOFF		((sizeof(unsigned long)*8) - PZONE_BIT_WIDTH)
+#define SECTIONS_PGOFF		(PZONE_BIT_PGOFF - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 
@@ -431,6 +438,7 @@ void put_page(struct page *page);
  * sections we define the shift as 0; that plus a 0 mask ensures
  * the compiler will optimise away reference to them.
  */
+#define PZONE_BIT_PGSHIFT	(PZONE_BIT_PGOFF * (PZONE_BIT_WIDTH != 0))
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
@@ -443,10 +451,11 @@ void put_page(struct page *page);
 #endif
 #define ZONETABLE_PGSHIFT	ZONES_PGSHIFT
 
-#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
-#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
+#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
+#error PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
 #endif
 
+#define PZONE_BIT_MASK		((1UL << PZONE_BIT_WIDTH) - 1)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
@@ -454,12 +463,38 @@ void put_page(struct page *page);
 
 static inline unsigned long page_zonenum(struct page *page)
 {
-	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
+	return (page->flags >> ZONES_PGSHIFT) & (ZONES_MASK | PZONE_BIT_MASK);
 }
 
 struct zone;
 extern struct zone *zone_table[];
 
+#ifdef CONFIG_PSEUDO_ZONE
+static inline int page_in_pzone(struct page *page)
+{
+	return (page->flags >> PZONE_BIT_PGSHIFT) & PZONE_BIT_MASK;
+}
+
+static inline struct zone *page_zone(struct page *page)
+{
+	int idx;
+
+	idx = (page->flags >> ZONETABLE_PGSHIFT) & ZONETABLE_MASK;
+	if (page_in_pzone(page))
+		return pzone_table[idx].zone;
+	return zone_table[idx];
+}
+
+static inline unsigned long page_to_nid(struct page *page)
+{
+	return page_zone(page)->zone_pgdat->node_id;
+}
+#else
+static inline int page_in_pzone(struct page *page)
+{
+	return 0;
+}
+
 static inline struct zone *page_zone(struct page *page)
 {
 	return zone_table[(page->flags >> ZONETABLE_PGSHIFT) &
@@ -473,6 +508,8 @@ static inline unsigned long page_to_nid(
 	else
 		return page_zone(page)->zone_pgdat->node_id;
 }
+#endif
+
 static inline unsigned long page_to_section(struct page *page)
 {
 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
diff -urNp a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/mmzone.h	2006-01-19 15:23:00.000000000 +0900
@@ -111,6 +111,15 @@ struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+
+#ifdef CONFIG_PSEUDO_ZONE
+	/* Pseudo zone members: children list is protected by nr_zones_lock */
+	struct zone		*parent;
+	struct list_head	children;
+	struct list_head	sibling;
+	int			pzone_idx;
+#endif
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -336,7 +345,71 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+#define zone_idx(zone)		(real_zone(zone) - (zone)->zone_pgdat->node_zones)
+
+#ifdef CONFIG_PSEUDO_ZONE
+#define MAX_NR_PZONES		1024
+
+struct pzone_table {
+	struct zone *zone;
+	struct list_head list;
+};
+
+extern struct pzone_table pzone_table[];
+
+void read_lock_nr_zones(void);
+void read_unlock_nr_zones(void);
+struct zone *pzone_create(struct zone *z, char *name, int npages);
+void pzone_destroy(struct zone *z);
+int pzone_set_numpages(struct zone *z, int npages);
+
+static inline void zone_init_pzone_link(struct zone *z)
+{
+	z->parent = NULL;
+	INIT_LIST_HEAD(&z->children);
+	INIT_LIST_HEAD(&z->sibling);
+	z->pzone_idx = -1;
+}
+
+static inline int zone_is_pseudo(struct zone *z)
+{
+	return (z->parent != NULL);
+}
+
+static inline struct zone *real_zone(struct zone *z)
+{
+	if (z->parent)
+		return z->parent;
+	return z;
+}
+
+static inline struct zone *pzone_next_in_zone(struct zone *z)
+{
+	if (zone_is_pseudo(z)) {
+		if (z->sibling.next == &z->parent->children)
+			z = NULL;
+		else
+			z = list_entry(z->sibling.next, struct zone, sibling);
+	} else {
+		if (list_empty(&z->children))
+			z = NULL;
+		else
+			z = list_entry(z->children.next, struct zone, sibling);
+	}
+
+	return z;
+}
+
+#else
+#define MAX_PSEUDO_ZONES	0
+
+static inline void read_lock_nr_zones(void) {}
+static inline void read_unlock_nr_zones(void) {}
+static inline void zone_init_pzone_link(struct zone *z) {}
+
+static inline int zone_is_pseudo(struct zone *z) { return 0; }
+static inline struct zone *real_zone(struct zone *z) { return z; }
+#endif
 
 /**
  * for_each_pgdat - helper macro to iterate over all nodes
@@ -360,6 +433,19 @@ static inline struct zone *next_zone(str
 {
 	pg_data_t *pgdat = zone->zone_pgdat;
 
+#ifdef CONFIG_PSEUDO_ZONE
+	if (zone_is_pseudo(zone)) {
+		if (zone->sibling.next != &zone->parent->children)
+			return list_entry(zone->sibling.next, struct zone,
+					  sibling);
+		else
+			zone = zone->parent;
+	} else {
+		if (!list_empty(&zone->children))
+			return list_entry(zone->children.next, struct zone,
+					  sibling);
+	}
+#endif
 	if (zone < pgdat->node_zones + MAX_NR_ZONES - 1)
 		zone++;
 	else if (pgdat->pgdat_next) {
@@ -371,6 +457,31 @@ static inline struct zone *next_zone(str
 	return zone;
 }
 
+static inline struct zone *next_zone_in_node(struct zone *zone, int len)
+{
+	pg_data_t *pgdat = zone->zone_pgdat;
+
+#ifdef CONFIG_PSEUDO_ZONE
+	if (zone_is_pseudo(zone)) {
+		if (zone->sibling.next != &zone->parent->children)
+			return list_entry(zone->sibling.next, struct zone,
+					  sibling);
+		else
+			zone = zone->parent;
+	} else {
+		if (!list_empty(&zone->children))
+			return list_entry(zone->children.next, struct zone,
+					  sibling);
+	}
+#endif
+	if (zone < pgdat->node_zones + len - 1)
+		zone++;
+	else
+		zone = NULL;
+
+	return zone;
+}
+
 /**
  * for_each_zone - helper macro to iterate over all memory zones
  * @zone - pointer to struct zone variable
@@ -389,6 +500,9 @@ static inline struct zone *next_zone(str
 #define for_each_zone(zone) \
 	for (zone = pgdat_list->node_zones; zone; zone = next_zone(zone))
 
+#define for_each_zone_in_node(zone, pgdat, len) \
+	for (zone = pgdat->node_zones; zone; zone = next_zone_in_node(zone, len))
+
 static inline int is_highmem_idx(int idx)
 {
 	return (idx == ZONE_HIGHMEM);
@@ -406,11 +520,13 @@ static inline int is_normal_idx(int idx)
  */
 static inline int is_highmem(struct zone *zone)
 {
+	zone = real_zone(zone);
 	return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM;
 }
 
 static inline int is_normal(struct zone *zone)
 {
+	zone = real_zone(zone);
 	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
 }
 
diff -urNp a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/swap.h	2006-01-19 15:23:00.000000000 +0900
@@ -171,6 +171,8 @@ extern int rotate_reclaimable_page(struc
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
+extern int isolate_lru_pages(int, struct list_head *, struct list_head *,
+		int *);
 extern int try_to_free_pages(struct zone **, gfp_t);
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 extern int shrink_all_memory(int);
diff -urNp a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/Kconfig	2006-01-19 15:24:13.000000000 +0900
@@ -132,3 +132,9 @@ config SPLIT_PTLOCK_CPUS
 	default "4096" if ARM && !CPU_CACHE_VIPT
 	default "4096" if PARISC && !PA20
 	default "4"
+
+config PSEUDO_ZONE
+	bool "Pseudo zone support"
+	help
+	  This option provides pseudo zone creation from a non-pseudo zone.
+	  Pseudo zones could be used for memory resource management.
diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-19 15:23:00.000000000 +0900
@@ -309,6 +309,14 @@ static inline void __free_pages_bulk (st
 	BUG_ON(bad_range(zone, page));
 
 	zone->free_pages += order_size;
+
+	/*
+	 * Do not concatenate a page in the pzone.
+	 * Order>0 pages are never allocated from pzones (so far?).
+	 */
+	if (unlikely(page_in_pzone(page)))
+		goto skip_buddy;
+
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
 		struct free_area *area;
@@ -321,6 +329,7 @@ static inline void __free_pages_bulk (st
 			break;
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
+		BUG_ON(page_zone(page) != page_zone(buddy));
 		list_del(&buddy->lru);
 		area = zone->free_area + order;
 		area->nr_free--;
@@ -330,6 +339,8 @@ static inline void __free_pages_bulk (st
 		order++;
 	}
 	set_page_order(page, order);
+
+skip_buddy: /* Keep order and PagePrivate unset for pzone pages. */
 	list_add(&page->lru, &zone->free_area[order].free_list);
 	zone->free_area[order].nr_free++;
 }
@@ -565,6 +576,7 @@ void drain_remote_pages(void)
 	unsigned long flags;
 
 	local_irq_save(flags);
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
 
@@ -582,30 +594,37 @@ void drain_remote_pages(void)
 						&pcp->list, 0);
 		}
 	}
+	read_unlock_nr_zones();
 	local_irq_restore(flags);
 }
 #endif
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
-static void __drain_pages(unsigned int cpu)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PSEUDO_ZONE)
+static void __drain_zone_pages(struct zone *zone, int cpu)
 {
-	struct zone *zone;
+	struct per_cpu_pageset *pset;
 	int i;
 
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset;
-
-		pset = zone_pcp(zone, cpu);
-		for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
-			struct per_cpu_pages *pcp;
+	pset = zone_pcp(zone, cpu);
+	for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
+		struct per_cpu_pages *pcp;
 
-			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
-		}
+		pcp = &pset->pcp[i];
+		pcp->count -= free_pages_bulk(zone, pcp->count,
+					&pcp->list, 0);
 	}
 }
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+
+static void __drain_pages(unsigned int cpu)
+{
+	struct zone *zone;
+
+	read_lock_nr_zones();
+	for_each_zone(zone)
+		__drain_zone_pages(zone, cpu);
+	read_unlock_nr_zones();
+}
+#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU || CONFIG_PSEUDO_ZONE */
 
 #ifdef CONFIG_PM
 
@@ -1080,8 +1099,10 @@ unsigned int nr_free_pages(void)
 	unsigned int sum = 0;
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone)
 		sum += zone->free_pages;
+	read_unlock_nr_zones();
 
 	return sum;
 }
@@ -1331,6 +1352,7 @@ void show_free_areas(void)
 	unsigned long free;
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		show_node(zone);
 		printk("%s per-cpu:", zone->name);
@@ -1427,6 +1449,7 @@ void show_free_areas(void)
 		spin_unlock_irqrestore(&zone->lock, flags);
 		printk("= %lukB\n", K(total));
 	}
+	read_unlock_nr_zones();
 
 	show_swap_cache_info();
 }
@@ -1836,6 +1859,7 @@ static int __devinit process_zones(int c
 {
 	struct zone *zone, *dzone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 
 		zone->pageset[cpu] = kmalloc_node(sizeof(struct per_cpu_pageset),
@@ -1845,6 +1869,7 @@ static int __devinit process_zones(int c
 
 		setup_pageset(zone->pageset[cpu], zone_batchsize(zone));
 	}
+	read_unlock_nr_zones();
 
 	return 0;
 bad:
@@ -1854,6 +1879,7 @@ bad:
 		kfree(dzone->pageset[cpu]);
 		dzone->pageset[cpu] = NULL;
 	}
+	read_unlock_nr_zones();
 	return -ENOMEM;
 }
 
@@ -1862,12 +1888,14 @@ static inline void free_zone_pagesets(in
 #ifdef CONFIG_NUMA
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
 
 		zone_pcp(zone, cpu) = NULL;
 		kfree(pset);
 	}
+	read_unlock_nr_zones();
 #endif
 }
 
@@ -2006,6 +2034,7 @@ static void __init free_area_init_core(s
 
 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
 
+		zone_init_pzone_link(zone);
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
@@ -2111,11 +2140,11 @@ static int frag_show(struct seq_file *m,
 {
 	pg_data_t *pgdat = (pg_data_t *)arg;
 	struct zone *zone;
-	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
 	int order;
 
-	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+	read_lock_nr_zones();
+	for_each_zone_in_node(zone, pgdat, MAX_NR_ZONES) {
 		if (!zone->present_pages)
 			continue;
 
@@ -2126,6 +2155,7 @@ static int frag_show(struct seq_file *m,
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+	read_unlock_nr_zones();
 	return 0;
 }
 
@@ -2143,10 +2173,10 @@ static int zoneinfo_show(struct seq_file
 {
 	pg_data_t *pgdat = arg;
 	struct zone *zone;
-	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
 
-	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; zone++) {
+	read_lock_nr_zones();
+	for_each_zone_in_node(zone, pgdat, MAX_NR_ZONES) {
 		int i;
 
 		if (!zone->present_pages)
@@ -2234,6 +2264,7 @@ static int zoneinfo_show(struct seq_file
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+	read_unlock_nr_zones();
 	return 0;
 }
 
@@ -2414,6 +2445,45 @@ static void setup_per_zone_lowmem_reserv
 	}
 }
 
+static void setup_zone_pages_min(struct zone *zone, unsigned long lowmem_pages)
+{
+	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned long flags;
+	unsigned long tmp;
+
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	tmp = (pages_min * zone->present_pages) / lowmem_pages;
+	if (is_highmem(zone)) {
+		/*
+		 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
+		 * need highmem pages, so cap pages_min to a small
+		 * value here.
+		 *
+		 * The (pages_high-pages_low) and (pages_low-pages_min)
+		 * deltas controls asynch page reclaim, and so should
+		 * not be capped for highmem.
+		 */
+		int min_pages;
+
+		min_pages = zone->present_pages / 1024;
+		if (min_pages < SWAP_CLUSTER_MAX)
+			min_pages = SWAP_CLUSTER_MAX;
+		if (min_pages > 128)
+			min_pages = 128;
+		zone->pages_min = min_pages;
+	} else {
+		/*
+		 * If it's a lowmem zone, reserve a number of pages
+		 * proportionate to the zone's size.
+		 */
+		zone->pages_min = tmp;
+	}
+
+	zone->pages_low   = zone->pages_min + tmp / 4;
+	zone->pages_high  = zone->pages_min + tmp / 2;
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+
 /*
  * setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures 
  *	that the pages_{min,low,high} values for each zone are set correctly 
@@ -2421,51 +2491,19 @@ static void setup_per_zone_lowmem_reserv
  */
 void setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
-	unsigned long flags;
 
+	read_lock_nr_zones();
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
 		if (!is_highmem(zone))
 			lowmem_pages += zone->present_pages;
 	}
 
-	for_each_zone(zone) {
-		unsigned long tmp;
-		spin_lock_irqsave(&zone->lru_lock, flags);
-		tmp = (pages_min * zone->present_pages) / lowmem_pages;
-		if (is_highmem(zone)) {
-			/*
-			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
-			 * need highmem pages, so cap pages_min to a small
-			 * value here.
-			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
-			 * deltas controls asynch page reclaim, and so should
-			 * not be capped for highmem.
-			 */
-			int min_pages;
-
-			min_pages = zone->present_pages / 1024;
-			if (min_pages < SWAP_CLUSTER_MAX)
-				min_pages = SWAP_CLUSTER_MAX;
-			if (min_pages > 128)
-				min_pages = 128;
-			zone->pages_min = min_pages;
-		} else {
-			/*
-			 * If it's a lowmem zone, reserve a number of pages
-			 * proportionate to the zone's size.
-			 */
-			zone->pages_min = tmp;
-		}
-
-		zone->pages_low   = zone->pages_min + tmp / 4;
-		zone->pages_high  = zone->pages_min + tmp / 2;
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
-	}
+	for_each_zone(zone)
+		setup_zone_pages_min(zone, lowmem_pages);
+	read_unlock_nr_zones();
 }
 
 /*
@@ -2629,3 +2667,702 @@ void *__init alloc_large_system_hash(con
 
 	return table;
 }
+
+#ifdef CONFIG_PSEUDO_ZONE
+
+#include <linux/mm_inline.h>
+
+struct pzone_table pzone_table[MAX_NR_PZONES];
+EXPORT_SYMBOL(pzone_table);
+
+static struct list_head pzone_freelist = LIST_HEAD_INIT(pzone_freelist);
+
+/*
+ * Protection between pzone_destroy() and pzone list lookups.
+ * These routines don't guard references from zonelists used in the page
+ * allocator.
+ * pzone maintainer (i.e. the class support routine) should remove the pzone
+ * from a zonelist (and probably make sure that there are no tasks in
+ * that class), then destroy the pzone.
+ */
+static spinlock_t nr_zones_lock = SPIN_LOCK_UNLOCKED;
+static int zones_readers = 0;
+static DECLARE_WAIT_QUEUE_HEAD(zones_waitqueue);
+
+static struct workqueue_struct *pzone_drain_wq;
+static DEFINE_PER_CPU(struct work_struct, pzone_drain_work);
+
+void read_lock_nr_zones(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&nr_zones_lock, flags);
+	zones_readers++;
+	spin_unlock_irqrestore(&nr_zones_lock, flags);
+}
+
+void read_unlock_nr_zones(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&nr_zones_lock, flags);
+	zones_readers--;
+	if ((zones_readers == 0) && waitqueue_active(&zones_waitqueue))
+		wake_up(&zones_waitqueue);
+	spin_unlock_irqrestore(&nr_zones_lock, flags);
+}
+
+static void write_lock_nr_zones(unsigned long *flagsp)
+{
+	DEFINE_WAIT(wait);
+
+	spin_lock_irqsave(&nr_zones_lock, *flagsp);
+	while (zones_readers) {
+		spin_unlock_irqrestore(&nr_zones_lock, *flagsp);
+		prepare_to_wait(&zones_waitqueue, &wait,
+				TASK_UNINTERRUPTIBLE);
+		schedule();
+		finish_wait(&zones_waitqueue, &wait);
+		spin_lock_irqsave(&nr_zones_lock, *flagsp);
+	}
+}
+
+static void write_unlock_nr_zones(unsigned long *flagsp)
+{
+	spin_unlock_irqrestore(&nr_zones_lock, *flagsp);
+}
+
+static int pzone_table_register(struct zone *z)
+{
+	struct pzone_table *t;
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	if (list_empty(&pzone_freelist)) {
+		write_unlock_nr_zones(&flags);
+		return -ENOMEM;
+	}
+
+	t = list_entry(pzone_freelist.next, struct pzone_table, list);
+	list_del(&t->list);
+	z->pzone_idx = t - pzone_table;
+	t->zone = z;
+	write_unlock_nr_zones(&flags);
+
+	return 0;
+}
+
+static void pzone_table_unregister(struct zone *z)
+{
+	struct pzone_table *t;
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	t = &pzone_table[z->pzone_idx];
+	t->zone = NULL;
+	list_add(&t->list, &pzone_freelist);
+	write_unlock_nr_zones(&flags);
+}
+
+static void pzone_parent_register(struct zone *z, struct zone *parent)
+{
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	list_add(&z->sibling, &parent->children);
+	write_unlock_nr_zones(&flags);
+}
+
+static void pzone_parent_unregister(struct zone *z)
+{
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	list_del(&z->sibling);
+	write_unlock_nr_zones(&flags);
+}
+
+/*
+ * pzone alloc/free routines
+ */
+#ifdef CONFIG_NUMA
+static int pzone_setup_pagesets(struct zone *z)
+{
+	struct per_cpu_pageset *pageset;
+	int batch;
+	int nid;
+	int i;
+
+	zone_pcp_init(z);
+
+	nid = z->zone_pgdat->node_id;
+	batch = zone_batchsize(z);
+
+	lock_cpu_hotplug();
+	for_each_online_cpu(i) {
+		pageset = kmalloc_node(sizeof(*pageset), GFP_KERNEL, nid);
+		if (!pageset)
+			goto bad;
+		z->pageset[i] = pageset;
+		setup_pageset(pageset, batch);
+	}
+	unlock_cpu_hotplug();
+
+	return 0;
+bad:
+	for (i = 0; i < NR_CPUS; i++) {
+		if (z->pageset[i] != &boot_pageset[i])
+			kfree(z->pageset[i]);
+		z->pageset[i] = NULL;
+	}
+	unlock_cpu_hotplug();
+
+	return -ENOMEM;
+}
+
+static void pzone_free_pagesets(struct zone *z)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (z->pageset[i] && (zone_pcp(z, i) != &boot_pageset[i])) {
+			BUG_ON(zone_pcp(z, i)->pcp[0].count != 0);
+			BUG_ON(zone_pcp(z, i)->pcp[1].count != 0);
+			kfree(zone_pcp(z, i));
+		}
+		zone_pcp(z, i) = NULL;
+	}
+}
+#else /* !CONFIG_NUMA */
+static inline int pzone_setup_pagesets(struct zone *z)
+{
+	int batch;
+	int i;
+
+	batch = zone_batchsize(z);
+	for (i = 0; i < NR_CPUS; i++)
+		setup_pageset(zone_pcp(z, i), batch);
+
+	return 0;
+}
+
+static inline void pzone_free_pagesets(struct zone *z)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		BUG_ON(zone_pcp(z, i)->pcp[0].count != 0);
+		BUG_ON(zone_pcp(z, i)->pcp[1].count != 0);
+	}
+}
+#endif /* CONFIG_NUMA */
+
+static inline void pzone_setup_page_flags(struct zone *z,
+						struct page *page)
+{
+	page->flags &= ~(ZONETABLE_MASK << ZONETABLE_PGSHIFT);
+	page->flags |= ((unsigned long)z->pzone_idx << ZONETABLE_PGSHIFT);
+	page->flags |= 1UL << PZONE_BIT_PGSHIFT;
+}
+
+static inline void pzone_restore_page_flags(struct zone *parent,
+						struct page *page)
+{
+	set_page_links(page, zone_idx(parent), parent->zone_pgdat->node_id,
+		       page_to_pfn(page));
+	page->flags &= ~(1UL << PZONE_BIT_PGSHIFT);
+}
+
+/*
+ * pzone_bad_range(): implemented for debugging instead of bad_range()
+ * in order to distinguish what causes the crash.
+ */
+static int pzone_bad_range(struct zone *zone, struct page *page)
+{
+	if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
+		BUG();
+	if (page_to_pfn(page) < zone->zone_start_pfn)
+		BUG();
+#ifdef CONFIG_HOLES_IN_ZONE
+	if (!pfn_valid(page_to_pfn(page)))
+		BUG();
+#endif
+	if (zone != page_zone(page))
+		BUG();
+	return 0;
+}
+
+static void pzone_drain(void *arg)
+{
+	lru_add_drain();
+}
+
+static void pzone_punt_drain(void *arg)
+{
+	struct work_struct *wp;
+
+	wp = &get_cpu_var(pzone_drain_work);
+	PREPARE_WORK(wp, pzone_drain, arg);
+	/* queue_work() checks whether the work is used or not. */
+	queue_work(pzone_drain_wq, wp);
+	put_cpu_var(pzone_drain_work);
+}
+
+static void pzone_flush_percpu(void *arg)
+{
+	struct zone *z = arg;
+	unsigned long flags;
+	int cpu;
+
+	/*
+	 * lru_add_drain() must not be called from interrupt context
+	 * (LRU pagevecs are interrupt unsafe).
+	 */
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	pzone_punt_drain(arg);
+	__drain_zone_pages(z, cpu);
+	local_irq_restore(flags);
+}
+
+static int pzone_flush_lru(struct zone *z, struct zone *parent,
+			   struct list_head *clist, unsigned long *cnr,
+			   int block)
+{
+	unsigned long flags;
+	struct page *page;
+	struct list_head list;
+	int n, moved, scan;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irqsave(&z->lru_lock, flags);
+	n = isolate_lru_pages(*cnr, clist, &list, &scan);
+	*cnr -= n;
+	spin_unlock_irqrestore(&z->lru_lock, flags);
+
+	moved = 0;
+	while (!list_empty(&list) && n-- > 0) {
+		page = list_entry(list.prev, struct page, lru);
+		list_del(&page->lru);
+
+		if (block) {
+			lock_page(page);
+			wait_on_page_writeback(page);
+		} else {
+			if (TestSetPageLocked(page))
+				goto goaround;
+
+			/* Make sure the writeback bit being kept zero. */
+			if (PageWriteback(page))
+				goto goaround_pagelocked;
+		}
+
+		/* Now we can safely modify the flags field. */
+		pzone_restore_page_flags(parent, page);
+		unlock_page(page);
+
+		spin_lock_irqsave(&parent->lru_lock, flags);
+		if (TestSetPageLRU(page))
+			BUG();
+
+		__put_page(page);
+		if (PageActive(page))
+			add_page_to_active_list(parent, page);
+		else
+			add_page_to_inactive_list(parent, page);
+		spin_unlock_irqrestore(&parent->lru_lock, flags);
+
+		moved++;
+		continue;
+
+goaround_pagelocked:
+		unlock_page(page);
+goaround:
+		spin_lock_irqsave(&z->lru_lock, flags);
+		__put_page(page);
+		if (TestSetPageLRU(page))
+			BUG();
+		list_add(&page->lru, clist);
+		++*cnr;
+		spin_unlock_irqrestore(&z->lru_lock, flags);
+	}
+
+	return moved;
+}
+
+static void pzone_flush_free_area(struct zone *z)
+{
+	struct free_area *area;
+	struct page *page;
+	struct list_head list;
+	unsigned long flags;
+	int order;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irqsave(&z->lock, flags);
+	area = &z->free_area[0];
+	while (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+		list_del(&page->lru);
+		area->nr_free--;
+		z->free_pages--;
+		z->present_pages--;
+		spin_unlock_irqrestore(&z->lock, flags);
+		pzone_restore_page_flags(z->parent, page);
+		pzone_bad_range(z->parent, page);
+		list_add(&page->lru, &list);
+		free_pages_bulk(z->parent, 1, &list, 0);
+
+		spin_lock_irqsave(&z->lock, flags);
+	}
+
+	BUG_ON(area->nr_free != 0);
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	/* currently pzone only supports order-0 only. do sanity check. */
+	spin_lock_irqsave(&z->lock, flags);
+	for (order = 1; order < MAX_ORDER; order++) {
+		area = &z->free_area[order];
+		BUG_ON(area->nr_free != 0);
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+static int pzone_is_empty(struct zone *z)
+{
+	unsigned long flags;
+	int ret = 0;
+	int i;
+
+	spin_lock_irqsave(&z->lock, flags);
+	ret += z->present_pages;
+	ret += z->free_pages;
+	ret += z->free_area[0].nr_free;
+
+	/* would better use smp_call_function for scanning pcp. */
+	for (i = 0; i < NR_CPUS; i++) {
+#ifdef CONFIG_NUMA
+		if (!zone_pcp(z, i) || (zone_pcp(z, i) == &boot_pageset[i]))
+			continue;
+#endif
+		ret += zone_pcp(z, i)->pcp[0].count;
+		ret += zone_pcp(z, i)->pcp[1].count;
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	spin_lock_irqsave(&z->lru_lock, flags);
+	ret += z->nr_active;
+	ret += z->nr_inactive;
+	spin_unlock_irqrestore(&z->lru_lock, flags);
+
+	return ret == 0;
+}
+
+struct zone *pzone_create(struct zone *parent, char *name, int npages)
+{
+	struct zonelist zonelist;
+	struct zone *z;
+	struct page *page;
+	struct list_head *l;
+	unsigned long flags;
+	int len;
+	int i;
+
+	if (npages > parent->present_pages)
+		return NULL;
+
+	z = kmalloc_node(sizeof(*z), GFP_KERNEL, parent->zone_pgdat->node_id);
+	if (!z)
+		goto bad1;
+	memset(z, 0, sizeof(*z));
+
+	z->present_pages = z->free_pages = npages;
+	z->parent = parent;
+
+	spin_lock_init(&z->lock);
+	spin_lock_init(&z->lru_lock);
+	INIT_LIST_HEAD(&z->active_list);
+	INIT_LIST_HEAD(&z->inactive_list);
+
+	INIT_LIST_HEAD(&z->children);
+	INIT_LIST_HEAD(&z->sibling);
+
+	z->zone_pgdat = parent->zone_pgdat;
+	z->zone_mem_map = parent->zone_mem_map;
+	z->zone_start_pfn = parent->zone_start_pfn;
+	z->spanned_pages = parent->spanned_pages;
+	z->temp_priority = z->prev_priority = DEF_PRIORITY;
+
+	/* use wait_table of parents. */
+	z->wait_table = parent->wait_table;
+	z->wait_table_size = parent->wait_table_size;
+	z->wait_table_bits = parent->wait_table_bits;
+
+	len = strlen(name);
+	z->name = kmalloc_node(len + 1, GFP_KERNEL,
+			       parent->zone_pgdat->node_id);
+	if (!z->name)
+		goto bad2;
+	strcpy(z->name, name);
+
+	if (pzone_setup_pagesets(z) < 0)
+		goto bad3;
+
+	/* no lowmem for the pseudo zone.  leave lowmem_reserve all-0. */
+
+	zone_init_free_lists(z->zone_pgdat, z, z->spanned_pages);
+
+	/* setup a fake zonelist for allocating pages only from the parent. */
+	memset(&zonelist, 0, sizeof(zonelist));
+	zonelist.zones[0] = parent;
+	for (i = 0; i < npages; i++) {
+		page = __alloc_pages(GFP_KERNEL, 0, &zonelist);
+		if (!page)
+			goto bad4;
+		set_page_count(page, 0);
+		list_add(&page->lru, &z->free_area[0].free_list);
+		z->free_area[0].nr_free++;
+	}
+
+	if (pzone_table_register(z))
+		goto bad4;
+
+	list_for_each(l, &z->free_area[0].free_list) {
+		page = list_entry(l, struct page, lru);
+		pzone_setup_page_flags(z, page);
+	}
+
+	spin_lock_irqsave(&parent->lock, flags);
+	parent->present_pages -= npages;
+	spin_unlock_irqrestore(&parent->lock, flags);
+	
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+	pzone_parent_register(z, parent);
+
+	return z;
+bad4:
+	while (!list_empty(&z->free_area[0].free_list)) {
+		page = list_entry(z->free_area[0].free_list.next,
+				  struct page, lru);
+		list_del(&page->lru);
+		pzone_restore_page_flags(parent, page);
+		set_page_count(page, 1);
+		__free_pages(page, 0);
+	}
+
+	pzone_free_pagesets(z);
+bad3:
+	if (z->name)
+		kfree(z->name);
+bad2:
+	kfree(z);
+bad1:
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+
+	return NULL;
+}
+
+#define PZONE_FLUSH_LOOP_COUNT		8
+
+/*
+ * destroying pseudo zone. the caller should make sure that no one references
+ * this pseudo zone.
+ */
+void pzone_destroy(struct zone *z)
+{
+	struct zone *parent;
+	unsigned long flags;
+	unsigned long present;
+	int freed;
+	int retrycnt = 0;
+
+	parent = z->parent;
+	present = z->present_pages;
+	pzone_parent_unregister(z);
+retry:
+	/* drain pages in per-cpu pageset to free_area */
+	smp_call_function(pzone_flush_percpu, z, 0, 1);
+	pzone_flush_percpu(z);
+	
+	/* drain pages in the LRU list. */
+	freed = pzone_flush_lru(z, parent, &z->active_list, &z->nr_active,
+				retrycnt > 0);
+	spin_lock_irqsave(&z->lock, flags);
+	z->present_pages -= freed;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	freed = pzone_flush_lru(z, parent, &z->inactive_list, &z->nr_inactive,
+				retrycnt > 0);
+	spin_lock_irqsave(&z->lock, flags);
+	z->present_pages -= freed;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	pzone_flush_free_area(z);
+
+	if (!pzone_is_empty(z)) {
+		retrycnt++;
+		if (retrycnt > PZONE_FLUSH_LOOP_COUNT) {
+			BUG();
+		} else {
+			flush_workqueue(pzone_drain_wq);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_timeout(HZ);
+			goto retry;
+		}
+	}
+
+	spin_lock_irqsave(&parent->lock, flags);
+	parent->present_pages += present;
+	spin_unlock_irqrestore(&parent->lock, flags);
+
+	flush_workqueue(pzone_drain_wq);
+	pzone_table_unregister(z);
+	pzone_free_pagesets(z);
+	kfree(z->name);
+	kfree(z);
+
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+}
+
+extern int shrink_zone_memory(struct zone *zone, int nr_pages);
+
+static int pzone_move_free_pages(struct zone *dst, struct zone *src,
+					int npages)
+{
+	struct zonelist zonelist;
+	struct list_head pagelist;
+	struct page *page;
+	unsigned long flags;
+	int err;
+	int i;
+
+	err = 0;
+	spin_lock_irqsave(&src->lock, flags);
+	if (npages > src->present_pages)
+		err = -ENOMEM;
+	spin_unlock_irqrestore(&src->lock, flags);
+	if (err)
+		return err;
+
+	smp_call_function(pzone_flush_percpu, src, 0, 1);
+	pzone_flush_percpu(src);
+
+	INIT_LIST_HEAD(&pagelist);
+	memset(&zonelist, 0, sizeof(zonelist));
+	zonelist.zones[0] = src;
+	for (i = 0; i < npages; i++) {
+		/*
+		 * XXX to prevent myself from being arrested by oom-killer...
+		 *     should be replaced to the cleaner code.
+		 */
+		if (src->free_pages < npages - i) {
+			shrink_zone_memory(src, npages - i);
+			smp_call_function(pzone_flush_percpu, src, 0, 1);
+			pzone_flush_percpu(src);
+			blk_congestion_wait(WRITE, HZ/50);
+		}
+
+		page = __alloc_pages(GFP_KERNEL, 0, &zonelist);
+		if (!page) {
+			err = -ENOMEM;
+			goto bad;
+		}
+		list_add(&page->lru, &pagelist);
+	}
+
+	while (!list_empty(&pagelist)) {
+		page = list_entry(pagelist.next, struct page, lru);
+		list_del(&page->lru);
+		if (zone_is_pseudo(dst))
+			pzone_setup_page_flags(dst, page);
+		else
+			pzone_restore_page_flags(dst, page);
+
+		set_page_count(page, 1);
+		spin_lock_irqsave(&dst->lock, flags);
+		dst->present_pages++;
+		spin_unlock_irqrestore(&dst->lock, flags);
+		__free_pages(page, 0);
+	}
+
+	spin_lock_irqsave(&src->lock, flags);
+	src->present_pages -= npages;
+	spin_unlock_irqrestore(&src->lock, flags);
+
+	return 0;
+bad:
+	while (!list_empty(&pagelist)) {
+		page = list_entry(pagelist.next, struct page, lru);
+		list_del(&page->lru);
+		__free_pages(page, 0);
+	}
+
+	return err;
+}
+
+int pzone_set_numpages(struct zone *z, int npages)
+{
+	struct zone *src, *dst;
+	unsigned long flags;
+	int err;
+	int n;
+
+	/*
+	 * This function must not be called simultaneously so far.
+	 * The caller should make sure that.
+	 */
+	if (z->present_pages == npages) {
+		return 0;
+	} else if (z->present_pages > npages) {
+		n = z->present_pages - npages;
+		src = z;
+		dst = z->parent;
+	} else {
+		n = npages - z->present_pages;
+		src = z->parent;
+		dst = z;
+	}
+
+	/* XXX  Preventing oom-killer from complaining */
+	spin_lock_irqsave(&z->lock, flags);
+	z->pages_min = z->pages_low = z->pages_high = 0;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	err = pzone_move_free_pages(dst, src, n);
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+
+	return err;
+}
+
+static int pzone_init(void)
+{
+	struct work_struct *wp;
+	int i;
+
+	pzone_drain_wq = create_workqueue("pzone");
+	if (!pzone_drain_wq) {
+		printk(KERN_ERR "pzone: create_workqueue failed.\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < NR_CPUS; i++) {
+		wp = &per_cpu(pzone_drain_work, i);
+		INIT_WORK(wp, pzone_drain, NULL);
+	}
+
+	for (i = 0; i < MAX_NR_PZONES; i++)
+		list_add_tail(&pzone_table[i].list, &pzone_freelist);
+
+	return 0;
+}
+
+__initcall(pzone_init);
+
+#endif /* CONFIG_PSEUDO_ZONE */
diff -urNp a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/shmem.c	2006-01-19 15:23:00.000000000 +0900
@@ -366,7 +366,7 @@ static swp_entry_t *shmem_swp_alloc(stru
 		}
 
 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO | __GFP_NOLRU);
 		if (page)
 			set_page_private(page, 0);
 		spin_lock(&info->lock);
diff -urNp a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/vmscan.c	2006-01-19 15:23:00.000000000 +0900
@@ -591,8 +591,8 @@ keep:
  *
  * returns how many pages were moved onto *@dst.
  */
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
-			     struct list_head *dst, int *scanned)
+int isolate_lru_pages(int nr_to_scan, struct list_head *src,
+		      struct list_head *dst, int *scanned)
 {
 	int nr_taken = 0;
 	struct page *page;
@@ -1047,6 +1047,7 @@ static int balance_pgdat(pg_data_t *pgda
 	int priority;
 	int i;
 	int total_scanned, total_reclaimed;
+	struct zone *zone;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 
@@ -1060,11 +1061,8 @@ loop_again:
 
 	inc_page_state(pageoutrun);
 
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
+	for_each_zone_in_node(zone, pgdat, pgdat->nr_zones)
 		zone->temp_priority = DEF_PRIORITY;
-	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -1082,7 +1080,24 @@ loop_again:
 			 * zone which needs scanning
 			 */
 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
-				struct zone *zone = pgdat->node_zones + i;
+#ifdef CONFIG_PSEUDO_ZONE
+				for (zone = pgdat->node_zones + i; zone;
+				     zone = pzone_next_in_zone(zone)) {
+					if (zone->present_pages == 0)
+						continue;
+
+					if (zone->all_unreclaimable &&
+							priority != DEF_PRIORITY)
+						continue;
+
+					if (!zone_watermark_ok(zone, order,
+						zone->pages_high, 0, 0)) {
+						end_zone = i;
+						goto scan;
+					}
+				}
+#else /* !CONFIG_PSEUDO_ZONE */
+				zone = pgdat->node_zones + i;
 
 				if (zone->present_pages == 0)
 					continue;
@@ -1096,17 +1111,15 @@ loop_again:
 					end_zone = i;
 					goto scan;
 				}
+#endif /* !CONFIG_PSEUDO_ZONE */
 			}
 			goto out;
 		} else {
 			end_zone = pgdat->nr_zones - 1;
 		}
 scan:
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
+		for_each_zone_in_node(zone, pgdat, end_zone)
 			lru_pages += zone->nr_active + zone->nr_inactive;
-		}
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
@@ -1117,8 +1130,7 @@ scan:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
+		for_each_zone_in_node(zone, pgdat, end_zone) {
 			int nr_slab;
 
 			if (zone->present_pages == 0)
@@ -1183,11 +1195,9 @@ scan:
 			break;
 	}
 out:
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
+	for_each_zone_in_node(zone, pgdat, pgdat->nr_zones)
 		zone->prev_priority = zone->temp_priority;
-	}
+
 	if (!all_zones_ok) {
 		cond_resched();
 		goto loop_again;
@@ -1261,7 +1271,9 @@ static int kswapd(void *p)
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
+		read_lock_nr_zones();
 		balance_pgdat(pgdat, 0, order);
+		read_unlock_nr_zones();
 	}
 	return 0;
 }
@@ -1316,6 +1328,35 @@ int shrink_all_memory(int nr_pages)
 }
 #endif
 
+#ifdef CONFIG_PSEUDO_ZONE
+int shrink_zone_memory(struct zone *zone, int nr_pages)
+{
+	struct scan_control sc;
+
+	sc.gfp_mask = GFP_KERNEL;
+	sc.may_writepage = 1;
+	sc.may_swap = 1;
+	sc.nr_mapped = read_page_state(nr_mapped);
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;
+	sc.priority = 0;
+
+	if (nr_pages < SWAP_CLUSTER_MAX)
+		sc.swap_cluster_max = nr_pages;
+	else
+		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+
+	sc.nr_to_reclaim = sc.swap_cluster_max;
+	sc.nr_to_scan = sc.swap_cluster_max;
+	sc.nr_mapped = total_memory;	/* XXX  to make vmscan aggressive */
+	refill_inactive_zone(zone, &sc);
+	sc.nr_to_scan = sc.swap_cluster_max;
+	shrink_cache(zone, &sc);
+
+	return sc.nr_reclaimed;
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
    not required for correctness.  So if the last cpu in a node goes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] Add CKRM memory resource controller using pzones
  2006-01-19  8:04 [PATCH 0/2] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
@ 2006-01-19  8:04 ` KUROSAWA Takahiro
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-19  8:04 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch implements the CKRM memory resource controller using
pzones.  This patch requires CKRM patched source code.

CKRM patches can be obtained from
 http://sourceforge.net/project/showfiles.php?group_id=85838&package_id=163747

The CKRM patches requires configfs-patched source code:
 http://oss.oracle.com/projects/ocfs2/dist/files/patches/2.6.15-rc5/2005-12-14/01_configfs.patch

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

---
 include/linux/gfp.h |   31 ++
 mm/Kconfig          |    8 
 mm/Makefile         |    2 
 mm/mem_rc_pzone.c   |  597 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c      |   10 
 5 files changed, 645 insertions(+), 3 deletions(-)

diff -urNp b/include/linux/gfp.h c/include/linux/gfp.h
--- b/include/linux/gfp.h	2006-01-17 10:25:44.000000000 +0900
+++ c/include/linux/gfp.h	2006-01-17 10:04:53.000000000 +0900
@@ -104,12 +104,43 @@ static inline void arch_free_page(struct
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+#ifdef CONFIG_MEM_RC
+static inline int mem_rc_available(gfp_t gfp_mask, unsigned int order)
+{
+	gfp_mask &= GFP_LEVEL_MASK & ~__GFP_HIGHMEM;
+	return gfp_mask == GFP_USER && order == 0;
+}
+
+extern struct page *alloc_page_mem_rc(int nid, gfp_t gfp_mask);
+extern struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfp_mask,
+		unsigned int order);
+#else
+static inline int mem_rc_available(gfp_t gfp_mask, unsigned int order)
+{
+	return 0;
+}
+
+static inline struct page *alloc_page_mem_rc(int nid, gfp_t gfp_mask)
+{
+	return NULL;
+}
+
+static inline struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfp_mask,
+		unsigned int order)
+{
+	return NULL;
+}
+#endif
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
 	if (unlikely(order >= MAX_ORDER))
 		return NULL;
 
+	if (mem_rc_available(gfp_mask, order))
+		return alloc_page_mem_rc(nid, gfp_mask);
+
 	return __alloc_pages(gfp_mask, order,
 		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
 }
diff -urNp b/mm/Kconfig c/mm/Kconfig
--- b/mm/Kconfig	2006-01-17 10:12:56.000000000 +0900
+++ c/mm/Kconfig	2006-01-17 10:05:26.000000000 +0900
@@ -138,3 +138,11 @@ config PSEUDO_ZONE
 	help
 	  This option provides pseudo zone creation from a non-pseudo zone.
 	  Pseudo zones could be used for memory resource management.
+
+config MEM_RC
+	bool "Memory resource controller"
+	select PSEUDO_ZONE
+	depends on CPUMETER || CKRM
+	help
+	  This options will let you control the memory resource by using 
+	  the pseudo zone.
diff -urNp b/mm/Makefile c/mm/Makefile
--- b/mm/Makefile	2006-01-17 10:13:22.000000000 +0900
+++ c/mm/Makefile	2006-01-17 10:04:53.000000000 +0900
@@ -20,3 +20,5 @@ obj-$(CONFIG_SHMEM) += shmem.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
+
+obj-$(CONFIG_MEM_RC) += mem_rc_pzone.o
diff -urNp b/mm/mem_rc_pzone.c c/mm/mem_rc_pzone.c
--- b/mm/mem_rc_pzone.c	1970-01-01 09:00:00.000000000 +0900
+++ c/mm/mem_rc_pzone.c	2006-01-17 10:09:46.000000000 +0900
@@ -0,0 +1,597 @@
+/*
+ *  mm/mem_rc_pzone.c
+ *
+ *  Memory resource controller by using pzones.
+ *
+ *  Copyright 2005 FUJITSU LIMITED
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/stddef.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/cpuset.h>
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+#include <linux/ckrm_rc.h>
+
+#include <asm/semaphore.h>
+
+#define MEM_RC_METER_BASE	100
+#define MEM_RC_METER_TO_PAGES(_rcd, _node, _zidx, _val) \
+	((_rcd)->zone_pages[(_node)][(_zidx)] * (_val) / MEM_RC_METER_BASE)
+
+struct mem_rc_domain {
+	struct semaphore sem;
+	nodemask_t nodes;
+	unsigned long *zone_pages[MAX_NUMNODES];
+};
+
+struct mem_rc {
+	unsigned long guarantee;
+	struct mem_rc_domain *rcd;
+	struct zone **zones[MAX_NUMNODES];
+	struct zonelist *zonelists[MAX_NUMNODES];
+};
+
+
+struct ckrm_mem {
+	struct ckrm_class *class;	/* the class I belong to */
+	struct ckrm_class *parent;	/* parent of the class above. */
+	struct ckrm_shares shares;
+	spinlock_t cnt_lock;	/* always grab parent's lock before child's */
+	struct mem_rc	*mem_rc;	/* mem resource controller */
+	int 	cnt_total_guarantee; 	/* total guarantee behind the class */
+};
+
+static struct mem_rc_domain *grcd; /* system wide resource controller domain */
+static struct ckrm_res_ctlr rcbs; /* resource controller callback structure */
+
+static void mem_rc_destroy_rcdomain(void *arg)
+{
+	struct mem_rc_domain *rcd = arg;
+	int node;
+
+	for_each_node_mask(node, rcd->nodes) {
+		if (rcd->zone_pages[node])
+			kfree(rcd->zone_pages[node]);
+	}
+
+	kfree(rcd);
+}
+
+static void *mem_rc_create_rcdomain(struct cpuset *cs,
+					cpumask_t cpus, nodemask_t mems)
+{
+	struct mem_rc_domain *rcd;
+	struct zone *z;
+	pg_data_t *pgdat;
+	unsigned long *pp;
+	int i, node, allocn;
+
+	allocn = first_node(mems);
+	rcd = kmalloc_node(sizeof(*rcd), GFP_KERNEL, allocn);
+	if (!rcd)
+		return NULL;
+
+	memset(rcd, 0, sizeof(*rcd));
+
+	init_MUTEX(&rcd->sem);
+	rcd->nodes = mems;
+	for_each_node_mask(node, mems) {
+		pgdat = NODE_DATA(node);
+
+		pp = kmalloc_node(sizeof(unsigned long) * MAX_NR_ZONES,
+				  GFP_KERNEL, allocn);
+		if (!pp)
+			goto failed;
+
+		rcd->zone_pages[node] = pp;
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			if (i == ZONE_DMA) {
+				pp[i] = 0;
+				continue;
+			}
+			z = pgdat->node_zones + i;
+			pp[i] = z->present_pages;
+		}
+	}
+
+	return rcd;
+
+failed:
+	mem_rc_destroy_rcdomain(rcd);
+
+	return NULL;
+}
+
+
+static void *mem_rc_create(void *arg, char *name)
+{
+	struct mem_rc_domain *rcd = arg;
+	struct mem_rc *mr;
+	struct zonelist *zl, *zl_ref;
+	struct zone *parent, *z, *z_ref;
+	pg_data_t *pgdat;
+	int node, allocn;
+	int i, j;
+
+	allocn = first_node(rcd->nodes);
+	mr = kmalloc_node(sizeof(*mr), GFP_KERNEL, allocn);
+	if (!mr)
+		return NULL;
+
+	memset(mr, 0, sizeof(*mr));
+
+	down(&rcd->sem);
+	mr->rcd = rcd;
+	for_each_node_mask(node, rcd->nodes) {
+		pgdat = NODE_DATA(node);
+
+		mr->zones[node]
+			= kmalloc_node(sizeof(*mr->zones[node]) * MAX_NR_ZONES,
+				       GFP_KERNEL, allocn);
+		if (!mr->zones[node])
+			goto failed;
+
+		memset(mr->zones[node], 0,
+		       sizeof(*mr->zones[node]) * MAX_NR_ZONES);
+
+		mr->zonelists[node]
+			= kmalloc_node(sizeof(*mr->zonelists[node]),
+				       GFP_KERNEL, allocn);
+		if (!mr->zonelists[node])
+			goto failed;
+
+		memset(mr->zonelists[node], 0, sizeof(*mr->zonelists[node]));
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			parent = pgdat->node_zones + i;
+			if (rcd->zone_pages[node][i] == 0)
+				continue;
+
+			z = pzone_create(parent, name, 0);
+			if (!z)
+				goto failed;
+			mr->zones[node][i] = z;
+		}
+	}
+
+	for_each_node_mask(node, rcd->nodes) {
+		/* NORMAL zones and DMA zones also in HIGHMEM zonelist. */
+		zl_ref = NODE_DATA(node)->node_zonelists + __GFP_HIGHMEM;
+		zl = mr->zonelists[node];
+
+		for (j = i = 0; i < ARRAY_SIZE(zl_ref->zones); i++) {
+			z_ref = zl_ref->zones[i];
+			if (!z_ref)
+				break;
+
+			z = mr->zones[node][zone_idx(z_ref)];
+			if (!z)
+				continue;
+			zl->zones[j++] = z;
+		}
+		zl->zones[j] = NULL;
+	}
+	up(&rcd->sem);
+
+	return mr;
+
+failed:
+	for_each_node_mask(node, rcd->nodes) {
+		if (mr->zonelists[node])
+			kfree(mr->zonelists[node]);
+
+		if (!mr->zones[node])
+			continue;
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+			pzone_destroy(z);
+		}
+		kfree(mr->zones[node]);
+	}
+	up(&rcd->sem);
+	kfree(mr);
+
+	return NULL;
+}
+
+static void mem_rc_destroy(void *p)
+{
+	struct mem_rc *mr = p;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	int node, i;
+
+	down(&rcd->sem);
+	for (node = 0; node < MAX_NUMNODES; node++) {
+		if (mr->zonelists[node])
+			kfree(mr->zonelists[node]);
+			
+		if (!mr->zones[node])
+			continue;
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (z)
+				pzone_destroy(z);
+			mr->zones[node][i] = NULL;
+		}
+		kfree(mr->zones[node]);
+	}
+	up(&rcd->sem);
+
+	kfree(mr);
+}
+
+static int mem_rc_set_guar(void *ctldata, unsigned long val)
+{
+	struct mem_rc *mr = ctldata;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	nodemask_t nodes_done;
+	int err;
+	int node;
+	int i;
+
+	down(&rcd->sem);
+	nodes_clear(nodes_done);
+	for_each_node_mask(node, rcd->nodes) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+
+			err = pzone_set_numpages(z,
+					MEM_RC_METER_TO_PAGES(rcd,
+						node, i, val));
+			if (err)
+				goto undo;
+		}
+		node_set(node, nodes_done);
+	}
+
+	mr->guarantee = val;
+	up(&rcd->sem);
+
+	return 0;
+
+undo:
+	for (i--; i >= 0; i--)
+		pzone_set_numpages(z, MEM_RC_METER_TO_PAGES(rcd, node, i, 
+						mr->guarantee));
+		
+	for_each_node_mask(node, nodes_done) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+
+			pzone_set_numpages(z,
+					MEM_RC_METER_TO_PAGES(rcd,
+						node, i, mr->guarantee));
+		}
+	}
+	up(&rcd->sem);
+
+	return err;
+}
+
+static int mem_rc_get_cur(void *ctldata, unsigned long *valp)
+{
+	struct mem_rc *mr = ctldata;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	unsigned long total, used;
+	int node;
+	int i;
+
+	total = used = 0;
+	for_each_node_mask(node, rcd->nodes) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+			total += z->present_pages;
+			used += z->present_pages - z->free_pages;
+		}
+	}
+
+	if (total > 0)
+		*valp = mr->guarantee * used / total;
+	else
+		*valp = 0;
+
+	return 0;
+}
+
+struct mem_rc *mem_rc_get(task_t *tsk)
+{
+	struct ckrm_class *class = tsk->class;
+	struct ckrm_mem *res;
+
+	if (unlikely(class == NULL))
+		return NULL;
+
+	res = ckrm_get_res_class(class, rcbs.resid, struct ckrm_mem);
+
+	if (unlikely(res == NULL))
+		return NULL;
+
+	return res->mem_rc;
+}
+EXPORT_SYMBOL(mem_rc_get);
+
+struct page *alloc_page_mem_rc(int nid, gfp_t gfpmask)
+{
+	struct mem_rc *mr;
+
+	mr = mem_rc_get(current);
+	if (!mr)
+		return __alloc_pages(gfpmask, 0,
+				     NODE_DATA(nid)->node_zonelists
+				     + (gfpmask & GFP_ZONEMASK));
+
+	return __alloc_pages(gfpmask, 0, mr->zonelists[nid]);
+}
+EXPORT_SYMBOL(alloc_page_mem_rc);
+
+struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfpmask,
+				     unsigned int order)
+{
+	struct mem_rc *mr;
+
+	if (!mem_rc_available(gfpmask, order))
+		return NULL;
+
+	mr = mem_rc_get(current);
+	if (!mr)
+		return NULL;
+
+	return mr->zonelists[nd];
+}
+
+static void mem_rc_set_guarantee(struct ckrm_mem *res, int val)
+{
+	int	rc;
+
+	if (res->mem_rc == NULL)
+		return;
+
+	res->mem_rc->guarantee = val;
+	rc = mem_rc_set_guar(res->mem_rc, (unsigned long)val);
+	if (rc)
+		printk("mem_rc_set_guar failed, err = %d\n", rc);
+}
+
+static void mem_res_initcls_one(struct ckrm_mem * res)
+{
+	res->shares.my_guarantee = 0;
+	res->shares.my_limit = CKRM_SHARE_DONTCARE;
+	res->shares.total_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+	res->shares.max_limit = CKRM_SHARE_DONTCARE;
+	res->shares.unused_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+	res->cnt_total_guarantee = 0;
+
+	return;
+}
+
+static void *mem_res_alloc(struct ckrm_class *class,
+				struct ckrm_class *parent)
+{
+	struct ckrm_mem *res;
+
+	res = kmalloc(sizeof(struct ckrm_mem), GFP_ATOMIC);
+
+	if (res) {
+		memset(res, 0, sizeof(struct ckrm_mem));
+		res->class = class;
+		res->parent = parent;
+		mem_res_initcls_one(res);
+		res->cnt_lock = SPIN_LOCK_UNLOCKED;
+		if (!parent)	{	/* root class */
+			res->cnt_total_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+			res->shares.my_guarantee = CKRM_SHARE_DONTCARE;
+		} else {
+			res->mem_rc = (struct mem_rc *)mem_rc_create(grcd, class->name);
+			if (res->mem_rc == NULL)
+				printk(KERN_ERR "mem_rc_create failed\n");
+		}
+	} else {
+		printk(KERN_ERR
+		       "mem_res_alloc: failed GFP_ATOMIC alloc\n");
+	}
+	return res;
+}
+
+static void mem_res_free(void *my_res)
+{
+	struct ckrm_mem *res = my_res, *parres;
+	u64	temp = 0;
+
+	if (!res)
+		return;
+
+	parres = ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+	/* return child's guarantee to parent class */
+	spin_lock(&parres->cnt_lock);
+	ckrm_child_guarantee_changed(&parres->shares, res->shares.my_guarantee, 0);
+	if (parres->shares.total_guarantee) {
+		temp = (u64) parres->shares.unused_guarantee
+				* parres->cnt_total_guarantee;
+		do_div(temp, parres->shares.total_guarantee);
+	}
+	mem_rc_set_guarantee(parres, temp);
+	spin_unlock(&parres->cnt_lock);
+
+	mem_rc_destroy(res->mem_rc);
+	kfree(res);
+	return;
+}
+
+static void
+recalc_and_propagate(struct ckrm_mem * res)
+{
+	struct ckrm_class *child = NULL;
+	struct ckrm_mem *parres, *childres;
+	u64	cnt_total = 0,	cnt_guar = 0;
+
+	parres = ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+
+	if (parres) {
+		struct ckrm_shares *par = &parres->shares;
+		struct ckrm_shares *self = &res->shares;
+
+		/* calculate total and currnet guarantee */
+		if (par->total_guarantee && self->total_guarantee) {
+			cnt_total = (u64) self->my_guarantee
+					 * parres->cnt_total_guarantee;
+			do_div(cnt_total, par->total_guarantee);
+			cnt_guar = (u64) self->unused_guarantee * cnt_total;
+			do_div(cnt_guar, self->total_guarantee);
+		}
+		mem_rc_set_guarantee(res, (int) cnt_guar);
+		res->cnt_total_guarantee = (int ) cnt_total;
+	}
+
+	/* propagate to children */
+	ckrm_lock_hier(res->class);
+	while ((child = ckrm_get_next_child(res->class, child)) != NULL) {
+		childres =
+			ckrm_get_res_class(child, rcbs.resid, struct ckrm_mem);
+		if (childres) {
+		    spin_lock(&childres->cnt_lock);
+		    recalc_and_propagate(childres);
+		    spin_unlock(&childres->cnt_lock);
+		}
+	}
+	ckrm_unlock_hier(res->class);
+	return;
+}
+
+static int mem_set_share_values(void *my_res, struct ckrm_shares *new)
+{
+	struct ckrm_mem *parres, *res = my_res;
+	struct ckrm_shares *cur = &res->shares, *par;
+	int rc = -EINVAL;
+	u64	temp = 0;
+
+	if (!res)
+		return rc;
+
+	if (res->parent) {
+		parres =
+		   ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+		spin_lock(&parres->cnt_lock);
+		spin_lock(&res->cnt_lock);
+		par = &parres->shares;
+	} else {
+		spin_lock(&res->cnt_lock);
+		par = NULL;
+		parres = NULL;
+	}
+
+	rc = ckrm_set_shares(new, cur, par);
+
+	if (rc)
+		goto share_err;
+
+	if (parres) {
+		/* adjust parent's unused guarantee */
+		if (par->total_guarantee) {
+			temp = (u64) par->unused_guarantee
+					* parres->cnt_total_guarantee;
+			do_div(temp, par->total_guarantee);
+		}
+		mem_rc_set_guarantee(parres, temp);
+	} else {
+		/* adjust root class's unused guarantee */
+		temp = (u64) cur->unused_guarantee
+				* CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+		do_div(temp, cur->total_guarantee);
+		mem_rc_set_guarantee(res, temp);
+	}
+	recalc_and_propagate(res);
+
+share_err:
+	spin_unlock(&res->cnt_lock);
+	if (res->parent)
+		spin_unlock(&parres->cnt_lock);
+	return rc;
+}
+
+static int mem_get_share_values(void *my_res, struct ckrm_shares *shares)
+{
+	struct ckrm_mem *res = my_res;
+
+	if (!res)
+		return -EINVAL;
+	*shares = res->shares;
+	return 0;
+}
+
+static ssize_t mem_show_stats(void *my_res, char *buf)
+{
+	struct ckrm_mem *res = my_res;
+	unsigned long val;
+	ssize_t	i;
+
+	if (!res)
+		return -EINVAL;
+
+	if (res->mem_rc == NULL)
+		return 0;
+
+	mem_rc_get_cur(res->mem_rc, &val);
+	i = sprintf(buf, "mem:current=%ld\n", val);
+	return i;
+}
+
+static struct ckrm_res_ctlr rcbs = {
+	.res_name = "mem",
+	.resid = -1,
+	.res_alloc = mem_res_alloc,
+	.res_free = mem_res_free,
+	.set_share_values = mem_set_share_values,
+	.get_share_values = mem_get_share_values,
+	.show_stats = mem_show_stats,
+};
+
+static void init_global_rcd(void)
+{
+	grcd = (struct mem_rc_domain *) mem_rc_create_rcdomain((struct cpuset *)NULL, cpu_online_map, node_online_map);
+	if (grcd == NULL)
+		printk("mem_rc_create_rcdomain failed\n");
+}
+
+int __init init_ckrm_mem_res(void)
+{
+	init_global_rcd();
+	if (rcbs.resid == CKRM_NO_RES)	{
+		ckrm_register_res_ctlr(&rcbs);
+	}
+	return 0;
+}
+
+void __exit exit_ckrm_mem_res(void)
+{
+	ckrm_unregister_res_ctlr(&rcbs);
+	mem_rc_destroy_rcdomain(grcd);
+}
+
+module_init(init_ckrm_mem_res)
+module_exit(exit_ckrm_mem_res)
+
+MODULE_LICENSE("GPL")
diff -urNp b/mm/mempolicy.c c/mm/mempolicy.c
--- b/mm/mempolicy.c	2006-01-03 12:21:10.000000000 +0900
+++ c/mm/mempolicy.c	2006-01-17 10:04:53.000000000 +0900
@@ -726,8 +726,10 @@ get_vma_policy(struct task_struct *task,
 }
 
 /* Return a zonelist representing a mempolicy */
-static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *zonelist_policy(gfp_t gfp, int order,
+		struct mempolicy *policy)
 {
+	struct zonelist *zl;
 	int nd;
 
 	switch (policy->policy) {
@@ -746,6 +748,8 @@ static struct zonelist *zonelist_policy(
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
+		if ((zl = mem_rc_get_zonelist(nd, gfp, order)) != NULL)
+			return zl;
 		break;
 	default:
 		nd = 0;
@@ -844,7 +848,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		}
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages(gfp, 0, zonelist_policy(gfp, 0, pol));
 }
 
 /**
@@ -876,7 +880,7 @@ struct page *alloc_pages_current(gfp_t g
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+	return __alloc_pages(gfp, order, zonelist_policy(gfp, order, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
@ 2006-01-19 18:04   ` Andy Whitcroft
  2006-01-19 23:42     ` KUROSAWA Takahiro
  2006-01-20  7:08   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Whitcroft @ 2006-01-19 18:04 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

KUROSAWA Takahiro wrote:
> This patch implements the pzone (pseudo zone).  A pzone can be used
> for reserving pages in a zone.  Pzones are implemented by extending
> the zone structure and act almost the same as the conventional zones;
> we can specify pzones in a zonelist for __alloc_pages() and the vmscan
> code works on pzones with few modifications.
> 
> Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
[...]
> -/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
> -#define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
> +/* Page flags: | [PZONE] | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
> +#define PZONE_BIT_PGOFF		((sizeof(unsigned long)*8) - PZONE_BIT_WIDTH)
> +#define SECTIONS_PGOFF		(PZONE_BIT_PGOFF - SECTIONS_WIDTH)
>  #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
>  #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)

In general this PZONE bit is really a part of the zone number.  Much of
the order of these bits is chosen to obtain the cheapest extraction of
the most used bits, particularly the node/zone conbination or section
number on the left.  I would say put the PZONE_BIT next to ZONE
probabally to the right of it?  [See below for more reasons to put it
there.]

> @@ -431,6 +438,7 @@ void put_page(struct page *page);
>   * sections we define the shift as 0; that plus a 0 mask ensures
>   * the compiler will optimise away reference to them.
>   */
> +#define PZONE_BIT_PGSHIFT	(PZONE_BIT_PGOFF * (PZONE_BIT_WIDTH != 0))
>  #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
>  #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
>  #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
> @@ -443,10 +451,11 @@ void put_page(struct page *page);
>  #endif
>  #define ZONETABLE_PGSHIFT	ZONES_PGSHIFT
>  
> -#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> -#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> +#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> +#error PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
>  #endif

Do we have any bits left in the reserve on 32 bit machines?  The reserve
at last look was only 8 bits and there was little if any headroom in the
rest of the flags word to extend it; if memory serves at least 22 of the
24 remaining bits was accounted for.  Has this been tested on any such
machines?

> +#define PZONE_BIT_MASK		((1UL << PZONE_BIT_WIDTH) - 1)
>  #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
>  #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
>  #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
[...]
> +#ifdef CONFIG_PSEUDO_ZONE
> +static inline int page_in_pzone(struct page *page)
> +{
> +	return (page->flags >> PZONE_BIT_PGSHIFT) & PZONE_BIT_MASK;
> +}
> +
> +static inline struct zone *page_zone(struct page *page)
> +{
> +	int idx;
> +
> +	idx = (page->flags >> ZONETABLE_PGSHIFT) & ZONETABLE_MASK;
> +	if (page_in_pzone(page))
> +		return pzone_table[idx].zone;
> +	return zone_table[idx];
> +}

Could we not do this all without changing the structure of the zone
table at all by placing the PZONE_BIT either to the left or right of the
ZONE in the flags.  Then ZONETABLE_MASK could be extended to cover it
when pzone's are enabled and the pzone's could be added to zone_table
instead of their own pzone_table?  This would mean the code above could
be unmodified and much simpler.

> +
> +static inline unsigned long page_to_nid(struct page *page)
> +{
> +	return page_zone(page)->zone_pgdat->node_id;
> +}
[...]
> +#ifdef CONFIG_PSEUDO_ZONE
> +#define MAX_NR_PZONES		1024

You seem to be allowing for 1024 pzone's here?  But in
pzone_setup_page_flags() you place the pzone_idx (an offset into the
pzone_table) into the ZONE field of the page flags.  This field is
typically only two bits wide?  I don't see this being increased in this
patch, nor is there space for it generally to get much bigger not on 32
bit kernels anyhow (see comments about bits earlier)?

#define ZONES_SHIFT             2       /* ceil(log2(MAX_NR_ZONES)) */

Cheers.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-19 18:04   ` Andy Whitcroft
@ 2006-01-19 23:42     ` KUROSAWA Takahiro
  2006-01-20  9:17       ` Andy Whitcroft
  0 siblings, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-19 23:42 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: ckrm-tech, linux-mm

On Thu, 19 Jan 2006 18:04:43 +0000
Andy Whitcroft <apw@shadowen.org> wrote:

> > -/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
> > -#define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
> > +/* Page flags: | [PZONE] | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
> > +#define PZONE_BIT_PGOFF		((sizeof(unsigned long)*8) - PZONE_BIT_WIDTH)
> > +#define SECTIONS_PGOFF		(PZONE_BIT_PGOFF - SECTIONS_WIDTH)
> >  #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
> >  #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
> 
> In general this PZONE bit is really a part of the zone number.  Much of
> the order of these bits is chosen to obtain the cheapest extraction of
> the most used bits, particularly the node/zone conbination or section
> number on the left.  I would say put the PZONE_BIT next to ZONE
> probabally to the right of it?  [See below for more reasons to put it
> there.]

Thanks for the comments.  It looks much better to put PZONE_BIT to
that place.

> > @@ -431,6 +438,7 @@ void put_page(struct page *page);
> >   * sections we define the shift as 0; that plus a 0 mask ensures
> >   * the compiler will optimise away reference to them.
> >   */
> > +#define PZONE_BIT_PGSHIFT	(PZONE_BIT_PGOFF * (PZONE_BIT_WIDTH != 0))
> >  #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
> >  #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
> >  #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
> > @@ -443,10 +451,11 @@ void put_page(struct page *page);
> >  #endif
> >  #define ZONETABLE_PGSHIFT	ZONES_PGSHIFT
> >  
> > -#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> > -#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> > +#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> > +#error PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
> >  #endif
> 
> Do we have any bits left in the reserve on 32 bit machines?  The reserve
> at last look was only 8 bits and there was little if any headroom in the
> rest of the flags word to extend it; if memory serves at least 22 of the
> 24 remaining bits was accounted for.  Has this been tested on any such
> machines?

At least it does compile and work on non-NUMA i386 configuration.
But I haven't tested with CONFIG_NUMA or CONFIG_SPARSEMEM enabled.

> > +
> > +static inline unsigned long page_to_nid(struct page *page)
> > +{
> > +	return page_zone(page)->zone_pgdat->node_id;
> > +}
> [...]
> > +#ifdef CONFIG_PSEUDO_ZONE
> > +#define MAX_NR_PZONES		1024
> 
> You seem to be allowing for 1024 pzone's here?  But in
> pzone_setup_page_flags() you place the pzone_idx (an offset into the
> pzone_table) into the ZONE field of the page flags.  This field is
> typically only two bits wide?  I don't see this being increased in this
> patch, nor is there space for it generally to get much bigger not on 32
> bit kernels anyhow (see comments about bits earlier)?

pzone_idx isn't placed on the ZONE field.  The flags field of pzone pages
is as follows:

 Page flags: | [PZONE] | [pzone-idx] | ZONE | ... | FLAGS |

For pzones, the node number should be obtained from parent zone.

Thanks,

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
  2006-01-19 18:04   ` Andy Whitcroft
@ 2006-01-20  7:08   ` KAMEZAWA Hiroyuki
  2006-01-20  8:22     ` KUROSAWA Takahiro
  1 sibling, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20  7:08 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

KUROSAWA Takahiro wrote:
> This patch implements the pzone (pseudo zone).  A pzone can be used
> for reserving pages in a zone.  Pzones are implemented by extending
> the zone structure and act almost the same as the conventional zones;
> we can specify pzones in a zonelist for __alloc_pages() and the vmscan
> code works on pzones with few modifications.
> 
> Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
> 
> ---
>  include/linux/gfp.h    |    3 
>  include/linux/mm.h     |   49 ++
>  include/linux/mmzone.h |  118 ++++++
>  include/linux/swap.h   |    2 
>  mm/Kconfig             |    6 
>  mm/page_alloc.c        |  845 +++++++++++++++++++++++++++++++++++++++++++++----
>  mm/shmem.c             |    2 
>  mm/vmscan.c            |   75 +++-
>  8 files changed, 1020 insertions(+), 80 deletions(-)
Could you divide this *large* patch to several pieces ?

It looks you don't want to use functions based on zones, buddy-system, lru-list etc..
I think what you want is just a hierarchical memory allocator.
Why do you modify zone and make codes complicated ?
Can your memory allocater be implimented like mempool or hugetlb ?
They are not so invasive.

Bye,
-- Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-20  7:08   ` KAMEZAWA Hiroyuki
@ 2006-01-20  8:22     ` KUROSAWA Takahiro
  2006-01-20  8:30       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-20  8:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: ckrm-tech, linux-mm

On Fri, 20 Jan 2006 16:08:28 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >  include/linux/gfp.h    |    3 
> >  include/linux/mm.h     |   49 ++
> >  include/linux/mmzone.h |  118 ++++++
> >  include/linux/swap.h   |    2 
> >  mm/Kconfig             |    6 
> >  mm/page_alloc.c        |  845 +++++++++++++++++++++++++++++++++++++++++++++----
> >  mm/shmem.c             |    2 
> >  mm/vmscan.c            |   75 +++-
> >  8 files changed, 1020 insertions(+), 80 deletions(-)
> Could you divide this *large* patch to several pieces ?

Ok, I'll split the patch.

> It looks you don't want to use functions based on zones, buddy-system, lru-list etc..
> I think what you want is just a hierarchical memory allocator.
> Why do you modify zone and make codes complicated ?
> Can your memory allocater be implimented like mempool or hugetlb ?
> They are not so invasive.

mempool and hugetlb require their own shrinking code, don't they?
I guess that we would need the routines like mm/vmscan.c if we are
going to shrink user pages.  Instead, I'd like to reuse the shrinking
code in mm/vmscan.c.

Thanks,

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-20  8:22     ` KUROSAWA Takahiro
@ 2006-01-20  8:30       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20  8:30 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

KUROSAWA Takahiro wrote:
>> It looks you don't want to use functions based on zones, buddy-system, lru-list etc..
>> I think what you want is just a hierarchical memory allocator.
>> Why do you modify zone and make codes complicated ?
>> Can your memory allocater be implimented like mempool or hugetlb ?
>> They are not so invasive.
> 
> mempool and hugetlb require their own shrinking code, don't they?
> I guess that we would need the routines like mm/vmscan.c if we are
> going to shrink user pages.  Instead, I'd like to reuse the shrinking
> code in mm/vmscan.c.
> 
I think you can reuse shrink_list() at least.
Code duplication you're afraid of can be small.
I think call shrink_list() from your own shrinking fuction will add
good control on CKRM's memory region.

-- Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] Add the pzone
  2006-01-19 23:42     ` KUROSAWA Takahiro
@ 2006-01-20  9:17       ` Andy Whitcroft
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Whitcroft @ 2006-01-20  9:17 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

KUROSAWA Takahiro wrote:
> On Thu, 19 Jan 2006 18:04:43 +0000
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> 
>>>-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
>>>-#define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
>>>+/* Page flags: | [PZONE] | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
>>>+#define PZONE_BIT_PGOFF		((sizeof(unsigned long)*8) - PZONE_BIT_WIDTH)
>>>+#define SECTIONS_PGOFF		(PZONE_BIT_PGOFF - SECTIONS_WIDTH)
>>> #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
>>> #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
>>
>>In general this PZONE bit is really a part of the zone number.  Much of
>>the order of these bits is chosen to obtain the cheapest extraction of
>>the most used bits, particularly the node/zone conbination or section
>>number on the left.  I would say put the PZONE_BIT next to ZONE
>>probabally to the right of it?  [See below for more reasons to put it
>>there.]
> 
> 
> Thanks for the comments.  It looks much better to put PZONE_BIT to
> that place.
> 
> 
>>>@@ -431,6 +438,7 @@ void put_page(struct page *page);
>>>  * sections we define the shift as 0; that plus a 0 mask ensures
>>>  * the compiler will optimise away reference to them.
>>>  */
>>>+#define PZONE_BIT_PGSHIFT	(PZONE_BIT_PGOFF * (PZONE_BIT_WIDTH != 0))
>>> #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
>>> #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
>>> #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
>>>@@ -443,10 +451,11 @@ void put_page(struct page *page);
>>> #endif
>>> #define ZONETABLE_PGSHIFT	ZONES_PGSHIFT
>>> 
>>>-#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
>>>-#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
>>>+#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
>>>+#error PZONE_BIT_WIDTH+SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
>>> #endif
>>
>>Do we have any bits left in the reserve on 32 bit machines?  The reserve
>>at last look was only 8 bits and there was little if any headroom in the
>>rest of the flags word to extend it; if memory serves at least 22 of the
>>24 remaining bits was accounted for.  Has this been tested on any such
>>machines?
> 
> 
> At least it does compile and work on non-NUMA i386 configuration.
> But I haven't tested with CONFIG_NUMA or CONFIG_SPARSEMEM enabled.
> 
> 
>>>+
>>>+static inline unsigned long page_to_nid(struct page *page)
>>>+{
>>>+	return page_zone(page)->zone_pgdat->node_id;
>>>+}
>>
>>[...]
>>
>>>+#ifdef CONFIG_PSEUDO_ZONE
>>>+#define MAX_NR_PZONES		1024
>>
>>You seem to be allowing for 1024 pzone's here?  But in
>>pzone_setup_page_flags() you place the pzone_idx (an offset into the
>>pzone_table) into the ZONE field of the page flags.  This field is
>>typically only two bits wide?  I don't see this being increased in this
>>patch, nor is there space for it generally to get much bigger not on 32
>>bit kernels anyhow (see comments about bits earlier)?
> 
> 
> pzone_idx isn't placed on the ZONE field.  The flags field of pzone pages
> is as follows:
> 
>  Page flags: | [PZONE] | [pzone-idx] | ZONE | ... | FLAGS |
> 
> For pzones, the node number should be obtained from parent zone.

So you are in effect replacing the NODE element with the PZONE-IDX
field.  Firstly you haven't changed the format itself to do that.  If it
were sensible to do that (see below for other issues) I would suggest
you simply extend the ZONE to encompas that entire area and use the
higher zone 'numbers' to represent the pzone's.

Problems with this include:

1) the ZONE number for the standard zones are not unique in a NUMA
system, we need the NUMA node number to make them unique.  We assume we
can locate the zone directly from the struct page, and currently that is
done using the zonetable, using the (NODE,ZONE) tuple or the
(SECTION,ZONE) tuple.  Extending the ZONE over the NODE/SECTION element
and eliminating those will prevent both NUMA and SPARSEMEM from working.

2) On 32 bit this space in total is FLAGS_RESERVED or a max of 9 bits
(in -mm at least) on 32 bits architectures.  You can't wedge three bits
of ZONE, one bit of PZONE and 10 bits of pzone-idx into that space; it
simply doesn't fit?  Even if you just merge the pzone and zone together
there isn't 10 bits available.

All in all it looks like this approach isn't going to work well for 32
bits machines with either NUMA or using SPARSEMEM.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-01-19  8:04 [PATCH 0/2] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
  2006-01-19  8:04 ` [PATCH 2/2] Add CKRM memory resource controller using pzones KUROSAWA Takahiro
@ 2006-01-31  2:30 ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 1/8] Add the __GFP_NOLRU flag KUROSAWA Takahiro
                     ` (9 more replies)
  2 siblings, 10 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

I've split the patches into smaller pieces in order to increase
readability.  The core part of the patchset is the fifth one with
the subject "Add the pzone_create() function."

Changes since the last post:
* Fixed a bug that pages allocated with __GFP_COLD are incorrectly handled.
* Moved the PZONE bit in page flags next to the zone number bits in 
  order to make changes by pzones smaller.
* Moved the nr_zones locking functions outside of the CONFIG_PSEUDO_ZONE
  because they are not directly related to pzones.

Thanks,

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/8] Add the __GFP_NOLRU flag
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31 18:18     ` [ckrm-tech] " Dave Hansen
  2006-01-31  2:30   ` [PATCH 2/8] Keep the number of zones while zone iterator loop KUROSAWA Takahiro
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch adds the __GFP_NOLRU flag.  This option should be used 
for GFP_USER/GFP_HIGHUSER page allocations that are not maintained
in the zone LRU lists.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/gfp.h |    3 ++-
 mm/shmem.c          |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff -urNp a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/gfp.h	2006-01-26 19:14:54.000000000 +0900
@@ -47,6 +47,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_NOLRU      ((__force gfp_t)0x40000u) /* GFP_USER but will not be in LRU lists */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -55,7 +56,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_NOLRU)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
diff -urNp a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/shmem.c	2006-01-26 19:14:54.000000000 +0900
@@ -366,7 +366,7 @@ static swp_entry_t *shmem_swp_alloc(stru
 		}
 
 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO | __GFP_NOLRU);
 		if (page)
 			set_page_private(page, 0);
 		spin_lock(&info->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/8] Keep the number of zones while zone iterator loop
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 1/8] Add the __GFP_NOLRU flag KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 3/8] Add for_each_zone_in_node macro KUROSAWA Takahiro
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch adds locking functions that are used for restricting
addition and removal of zones while looking up zones by for_each_zone
etc.  This feature is required for pzones because zones are added and
removed dynamically in pzones.

for_each_zone and its family should be surrounded by
read_lock_nr_zones and read_unlock_nr_zones.  The code that adds or 
removes zones should call write_lock_nr_zones and write_unlock_nr_zones.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/mmzone.h |    4 ++
 mm/page_alloc.c        |   68 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c            |    2 +
 3 files changed, 74 insertions(+)

diff -urNp linux-2.6.15/include/linux/mmzone.h a/include/linux/mmzone.h
--- linux-2.6.15/include/linux/mmzone.h	2006-01-03 12:21:10.000000000 +0900
+++ a/include/linux/mmzone.h	2006-01-27 10:32:47.000000000 +0900
@@ -322,6 +322,10 @@ void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
+void read_lock_nr_zones(void);
+void read_unlock_nr_zones(void);
+void write_lock_nr_zones(unsigned long *flagsp);
+void write_unlock_nr_zones(unsigned long *flagsp);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
diff -urNp linux-2.6.15/mm/page_alloc.c a/mm/page_alloc.c
--- linux-2.6.15/mm/page_alloc.c	2006-01-03 12:21:10.000000000 +0900
+++ a/mm/page_alloc.c	2006-01-27 10:38:39.000000000 +0900
@@ -565,6 +565,7 @@ void drain_remote_pages(void)
 	unsigned long flags;
 
 	local_irq_save(flags);
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
 
@@ -582,6 +583,7 @@ void drain_remote_pages(void)
 						&pcp->list, 0);
 		}
 	}
+	read_unlock_nr_zones();
 	local_irq_restore(flags);
 }
 #endif
@@ -592,6 +594,7 @@ static void __drain_pages(unsigned int c
 	struct zone *zone;
 	int i;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
 
@@ -604,6 +607,7 @@ static void __drain_pages(unsigned int c
 						&pcp->list, 0);
 		}
 	}
+	read_unlock_nr_zones();
 }
 #endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
 
@@ -1080,8 +1084,10 @@ unsigned int nr_free_pages(void)
 	unsigned int sum = 0;
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone)
 		sum += zone->free_pages;
+	read_unlock_nr_zones();
 
 	return sum;
 }
@@ -1331,6 +1337,7 @@ void show_free_areas(void)
 	unsigned long free;
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		show_node(zone);
 		printk("%s per-cpu:", zone->name);
@@ -1427,6 +1434,7 @@ void show_free_areas(void)
 		spin_unlock_irqrestore(&zone->lock, flags);
 		printk("= %lukB\n", K(total));
 	}
+	read_unlock_nr_zones();
 
 	show_swap_cache_info();
 }
@@ -1836,6 +1844,7 @@ static int __devinit process_zones(int c
 {
 	struct zone *zone, *dzone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 
 		zone->pageset[cpu] = kmalloc_node(sizeof(struct per_cpu_pageset),
@@ -1845,6 +1854,7 @@ static int __devinit process_zones(int c
 
 		setup_pageset(zone->pageset[cpu], zone_batchsize(zone));
 	}
+	read_unlock_nr_zones();
 
 	return 0;
 bad:
@@ -1854,6 +1864,7 @@ bad:
 		kfree(dzone->pageset[cpu]);
 		dzone->pageset[cpu] = NULL;
 	}
+	read_unlock_nr_zones();
 	return -ENOMEM;
 }
 
@@ -1862,12 +1873,14 @@ static inline void free_zone_pagesets(in
 #ifdef CONFIG_NUMA
 	struct zone *zone;
 
+	read_lock_nr_zones();
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
 
 		zone_pcp(zone, cpu) = NULL;
 		kfree(pset);
 	}
+	read_unlock_nr_zones();
 #endif
 }
 
@@ -2115,6 +2128,7 @@ static int frag_show(struct seq_file *m,
 	unsigned long flags;
 	int order;
 
+	read_lock_nr_zones();
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;
@@ -2126,6 +2140,7 @@ static int frag_show(struct seq_file *m,
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+	read_unlock_nr_zones();
 	return 0;
 }
 
@@ -2146,6 +2161,7 @@ static int zoneinfo_show(struct seq_file
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
 
+	read_lock_nr_zones();
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; zone++) {
 		int i;
 
@@ -2234,6 +2250,7 @@ static int zoneinfo_show(struct seq_file
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+	read_unlock_nr_zones();
 	return 0;
 }
 
@@ -2426,6 +2443,7 @@ void setup_per_zone_pages_min(void)
 	struct zone *zone;
 	unsigned long flags;
 
+	read_lock_nr_zones();
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
 		if (!is_highmem(zone))
@@ -2466,6 +2484,7 @@ void setup_per_zone_pages_min(void)
 		zone->pages_high  = zone->pages_min + tmp / 2;
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	read_unlock_nr_zones();
 }
 
 /*
@@ -2629,3 +2648,52 @@ void *__init alloc_large_system_hash(con
 
 	return table;
 }
+
+/*
+ * Avoiding addition/removal of zones while looking up zones by 
+ * for_each_zone etc.  These routines don't guard references from zonelists 
+ * used in the page allocator.
+ */
+static spinlock_t nr_zones_lock = SPIN_LOCK_UNLOCKED;
+static int zones_readers = 0;
+static DECLARE_WAIT_QUEUE_HEAD(zones_waitqueue);
+
+void read_lock_nr_zones(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&nr_zones_lock, flags);
+	zones_readers++;
+	spin_unlock_irqrestore(&nr_zones_lock, flags);
+}
+
+void read_unlock_nr_zones(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&nr_zones_lock, flags);
+	zones_readers--;
+	if ((zones_readers == 0) && waitqueue_active(&zones_waitqueue))
+		wake_up(&zones_waitqueue);
+	spin_unlock_irqrestore(&nr_zones_lock, flags);
+}
+
+void write_lock_nr_zones(unsigned long *flagsp)
+{
+	DEFINE_WAIT(wait);
+
+	spin_lock_irqsave(&nr_zones_lock, *flagsp);
+	while (zones_readers) {
+		spin_unlock_irqrestore(&nr_zones_lock, *flagsp);
+		prepare_to_wait(&zones_waitqueue, &wait,
+				TASK_UNINTERRUPTIBLE);
+		schedule();
+		finish_wait(&zones_waitqueue, &wait);
+		spin_lock_irqsave(&nr_zones_lock, *flagsp);
+	}
+}
+
+void write_unlock_nr_zones(unsigned long *flagsp)
+{
+	spin_unlock_irqrestore(&nr_zones_lock, *flagsp);
+}
diff -urNp linux-2.6.15/mm/vmscan.c a/mm/vmscan.c
--- linux-2.6.15/mm/vmscan.c	2006-01-03 12:21:10.000000000 +0900
+++ a/mm/vmscan.c	2006-01-27 10:32:47.000000000 +0900
@@ -1261,7 +1261,9 @@ static int kswapd(void *p)
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
+		read_lock_nr_zones();
 		balance_pgdat(pgdat, 0, order);
+		read_unlock_nr_zones();
 	}
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/8] Add for_each_zone_in_node macro
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 1/8] Add the __GFP_NOLRU flag KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 2/8] Keep the number of zones while zone iterator loop KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 4/8] Extract zone specific routines as functions KUROSAWA Takahiro
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch adds for_each_zone_in_node macro.  This macro iterates the
block for each zone in the specific node.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/mmzone.h |   15 +++++++++++++++
 mm/page_alloc.c        |    6 ++----
 mm/vmscan.c            |   20 ++++++--------------
 3 files changed, 23 insertions(+), 18 deletions(-)

diff -urNp a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	2006-01-27 12:58:53.000000000 +0900
+++ b/include/linux/mmzone.h	2006-01-27 12:59:45.000000000 +0900
@@ -375,6 +375,18 @@ static inline struct zone *next_zone(str
 	return zone;
 }
 
+static inline struct zone *next_zone_in_node(struct zone *zone, int len)
+{
+	pg_data_t *pgdat = zone->zone_pgdat;
+
+	if (zone < pgdat->node_zones + len - 1)
+		zone++;
+	else
+		zone = NULL;
+
+	return zone;
+}
+
 /**
  * for_each_zone - helper macro to iterate over all memory zones
  * @zone - pointer to struct zone variable
@@ -393,6 +405,9 @@ static inline struct zone *next_zone(str
 #define for_each_zone(zone) \
 	for (zone = pgdat_list->node_zones; zone; zone = next_zone(zone))
 
+#define for_each_zone_in_node(zone, pgdat, len) \
+	for (zone = pgdat->node_zones; zone; zone = next_zone_in_node(zone, len))
+
 static inline int is_highmem_idx(int idx)
 {
 	return (idx == ZONE_HIGHMEM);
diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-27 12:58:53.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-27 12:59:45.000000000 +0900
@@ -2124,12 +2124,11 @@ static int frag_show(struct seq_file *m,
 {
 	pg_data_t *pgdat = (pg_data_t *)arg;
 	struct zone *zone;
-	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
 	int order;
 
 	read_lock_nr_zones();
-	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+	for_each_zone_in_node(zone, pgdat, MAX_NR_ZONES) {
 		if (!zone->present_pages)
 			continue;
 
@@ -2158,11 +2157,10 @@ static int zoneinfo_show(struct seq_file
 {
 	pg_data_t *pgdat = arg;
 	struct zone *zone;
-	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
 
 	read_lock_nr_zones();
-	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; zone++) {
+	for_each_zone_in_node(zone, pgdat, MAX_NR_ZONES) {
 		int i;
 
 		if (!zone->present_pages)
diff -urNp a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	2006-01-27 12:58:53.000000000 +0900
+++ b/mm/vmscan.c	2006-01-27 12:59:45.000000000 +0900
@@ -1047,6 +1047,7 @@ static int balance_pgdat(pg_data_t *pgda
 	int priority;
 	int i;
 	int total_scanned, total_reclaimed;
+	struct zone *zone;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 
@@ -1060,11 +1061,8 @@ loop_again:
 
 	inc_page_state(pageoutrun);
 
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
+	for_each_zone_in_node(zone, pgdat, pgdat->nr_zones)
 		zone->temp_priority = DEF_PRIORITY;
-	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -1102,11 +1100,8 @@ loop_again:
 			end_zone = pgdat->nr_zones - 1;
 		}
 scan:
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
+		for_each_zone_in_node(zone, pgdat, end_zone)
 			lru_pages += zone->nr_active + zone->nr_inactive;
-		}
 
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
@@ -1117,8 +1112,7 @@ scan:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
+		for_each_zone_in_node(zone, pgdat, end_zone) {
 			int nr_slab;
 
 			if (zone->present_pages == 0)
@@ -1183,11 +1177,9 @@ scan:
 			break;
 	}
 out:
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
+	for_each_zone_in_node(zone, pgdat, pgdat->nr_zones)
 		zone->prev_priority = zone->temp_priority;
-	}
+
 	if (!all_zones_ok) {
 		cond_resched();
 		goto loop_again;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/8] Extract zone specific routines as functions
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (2 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 3/8] Add for_each_zone_in_node macro KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 5/8] Add the pzone_create() function KUROSAWA Takahiro
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch extract per-zone parts from __drain_pages() and
__setup_per_zone_pages_min() as functions.  The extracted functions
will be used by pzone functions.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 page_alloc.c |  111 +++++++++++++++++++++++++++++++----------------------------
 1 file changed, 60 insertions(+), 51 deletions(-)

diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-27 15:34:05.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-27 15:29:03.000000000 +0900
@@ -588,28 +599,32 @@ void drain_remote_pages(void)
 }
 #endif
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
-static void __drain_pages(unsigned int cpu)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+static void __drain_zone_pages(struct zone *zone, int cpu)
 {
-	struct zone *zone;
+	struct per_cpu_pageset *pset;
 	int i;
 
-	read_lock_nr_zones();
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset;
-
-		pset = zone_pcp(zone, cpu);
-		for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
-			struct per_cpu_pages *pcp;
+	pset = zone_pcp(zone, cpu);
+	for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
+		struct per_cpu_pages *pcp;
 
-			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
-		}
+		pcp = &pset->pcp[i];
+		pcp->count -= free_pages_bulk(zone, pcp->count,
+					&pcp->list, 0);
 	}
+}
+
+static void __drain_pages(unsigned int cpu)
+{
+	struct zone *zone;
+
+	read_lock_nr_zones();
+	for_each_zone(zone)
+		__drain_zone_pages(zone, cpu);
 	read_unlock_nr_zones();
 }
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
 
 #ifdef CONFIG_PM
 
@@ -2429,6 +2445,45 @@ static void setup_per_zone_lowmem_reserv
 	}
 }
 
+static void setup_zone_pages_min(struct zone *zone, unsigned long lowmem_pages)
+{
+	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned long flags;
+	unsigned long tmp;
+
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	tmp = (pages_min * zone->present_pages) / lowmem_pages;
+	if (is_highmem(zone)) {
+		/*
+		 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
+		 * need highmem pages, so cap pages_min to a small
+		 * value here.
+		 *
+		 * The (pages_high-pages_low) and (pages_low-pages_min)
+		 * deltas controls asynch page reclaim, and so should
+		 * not be capped for highmem.
+		 */
+		int min_pages;
+
+		min_pages = zone->present_pages / 1024;
+		if (min_pages < SWAP_CLUSTER_MAX)
+			min_pages = SWAP_CLUSTER_MAX;
+		if (min_pages > 128)
+			min_pages = 128;
+		zone->pages_min = min_pages;
+	} else {
+		/*
+		 * If it's a lowmem zone, reserve a number of pages
+		 * proportionate to the zone's size.
+		 */
+		zone->pages_min = tmp;
+	}
+
+	zone->pages_low   = zone->pages_min + tmp / 4;
+	zone->pages_high  = zone->pages_min + tmp / 2;
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+
 /*
  * setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures 
  *	that the pages_{min,low,high} values for each zone are set correctly 
@@ -2436,10 +2491,8 @@ static void setup_per_zone_lowmem_reserv
  */
 void setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
-	unsigned long flags;
 
 	read_lock_nr_zones();
 	/* Calculate total number of !ZONE_HIGHMEM pages */
@@ -2448,40 +2501,8 @@ void setup_per_zone_pages_min(void)
 			lowmem_pages += zone->present_pages;
 	}
 
-	for_each_zone(zone) {
-		unsigned long tmp;
-		spin_lock_irqsave(&zone->lru_lock, flags);
-		tmp = (pages_min * zone->present_pages) / lowmem_pages;
-		if (is_highmem(zone)) {
-			/*
-			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
-			 * need highmem pages, so cap pages_min to a small
-			 * value here.
-			 *
-			 * The (pages_high-pages_low) and (pages_low-pages_min)
-			 * deltas controls asynch page reclaim, and so should
-			 * not be capped for highmem.
-			 */
-			int min_pages;
-
-			min_pages = zone->present_pages / 1024;
-			if (min_pages < SWAP_CLUSTER_MAX)
-				min_pages = SWAP_CLUSTER_MAX;
-			if (min_pages > 128)
-				min_pages = 128;
-			zone->pages_min = min_pages;
-		} else {
-			/*
-			 * If it's a lowmem zone, reserve a number of pages
-			 * proportionate to the zone's size.
-			 */
-			zone->pages_min = tmp;
-		}
-
-		zone->pages_low   = zone->pages_min + tmp / 4;
-		zone->pages_high  = zone->pages_min + tmp / 2;
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
-	}
+	for_each_zone(zone)
+		setup_zone_pages_min(zone, lowmem_pages);
 	read_unlock_nr_zones();
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/8] Add the pzone_create() function
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (3 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 4/8] Extract zone specific routines as functions KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 6/8] Add the pzone_destroy() function KUROSAWA Takahiro
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch implements creation of pzones.  A pzone can be used
for reserving pages in a conventional zone.  Pzones are implemented by 
extending the zone structure and act almost the same as the conventional 
zones; we can specify pzones in a zonelist for __alloc_pages() and the 
vmscan code works on pzones with few modifications.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/mm.h     |   49 ++++++++-
 include/linux/mmzone.h |   97 +++++++++++++++++
 mm/Kconfig             |    6 +
 mm/page_alloc.c        |  266 ++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c            |   20 +++
 5 files changed, 430 insertions(+), 8 deletions(-)

diff -urNp a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/mm.h	2006-01-30 14:31:30.000000000 +0900
@@ -397,6 +397,12 @@ void put_page(struct page *page);
  * with space for node: | SECTION | NODE | ZONE | ... | FLAGS |
  *   no space for node: | SECTION |     ZONE    | ... | FLAGS |
  */
+
+#ifdef CONFIG_PSEUDO_ZONE
+#define PZONE_BIT_WIDTH		1
+#else
+#define PZONE_BIT_WIDTH		0
+#endif
 #ifdef CONFIG_SPARSEMEM
 #define SECTIONS_WIDTH		SECTIONS_SHIFT
 #else
@@ -405,16 +411,21 @@ void put_page(struct page *page);
 
 #define ZONES_WIDTH		ZONES_SHIFT
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= FLAGS_RESERVED
+#if PZONE_BIT_WIDTH+SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= FLAGS_RESERVED
 #define NODES_WIDTH		NODES_SHIFT
 #else
 #define NODES_WIDTH		0
 #endif
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/*
+ * Page flags: | [SECTION] | [NODE] | ZONE | [PZONE(0)] | ... | FLAGS |
+ * If PZONE bit is 1, page flags are as follows:
+ * Page flags: |       [PZONE index]       | [PZONE(1)] | ... | FLAGS |
+ */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
+#define PZONE_BIT_PGOFF		(ZONES_PGOFF - PZONE_BIT_WIDTH)
 
 /*
  * We are going to use the flags for the page to node mapping if its in
@@ -434,6 +445,7 @@ void put_page(struct page *page);
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define PZONE_BIT_PGSHIFT	(PZONE_BIT_PGOFF * (PZONE_BIT_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to lookup the zone from a page. */
 #if FLAGS_HAS_NODE
@@ -443,13 +455,14 @@ void put_page(struct page *page);
 #endif
 #define ZONETABLE_PGSHIFT	ZONES_PGSHIFT
 
-#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
-#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > FLAGS_RESERVED
+#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+PZONE_BIT_WIDTH > FLAGS_RESERVED
+#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+PZONE_BIT_WIDTH > FLAGS_RESERVED
 #endif
 
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
+#define PZONE_BIT_MASK		((1UL << PZONE_BIT_WIDTH) - 1)
 #define ZONETABLE_MASK		((1UL << ZONETABLE_SHIFT) - 1)
 
 static inline unsigned long page_zonenum(struct page *page)
@@ -460,6 +473,32 @@ static inline unsigned long page_zonenum
 struct zone;
 extern struct zone *zone_table[];
 
+#ifdef CONFIG_PSEUDO_ZONE
+static inline int page_in_pzone(struct page *page)
+{
+	return (page->flags >> PZONE_BIT_PGSHIFT) & PZONE_BIT_MASK;
+}
+
+static inline struct zone *page_zone(struct page *page)
+{
+	int idx;
+
+	idx = (page->flags >> ZONETABLE_PGSHIFT) & ZONETABLE_MASK;
+	if (page_in_pzone(page))
+		return pzone_table[idx].zone;
+	return zone_table[idx];
+}
+
+static inline unsigned long page_to_nid(struct page *page)
+{
+	return page_zone(page)->zone_pgdat->node_id;
+}
+#else
+static inline int page_in_pzone(struct page *page)
+{
+	return 0;
+}
+
 static inline struct zone *page_zone(struct page *page)
 {
 	return zone_table[(page->flags >> ZONETABLE_PGSHIFT) &
@@ -473,6 +512,8 @@ static inline unsigned long page_to_nid(
 	else
 		return page_zone(page)->zone_pgdat->node_id;
 }
+#endif
+
 static inline unsigned long page_to_section(struct page *page)
 {
 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
diff -urNp a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	2006-01-30 14:23:30.000000000 +0900
+++ b/include/linux/mmzone.h	2006-01-30 14:31:30.000000000 +0900
@@ -111,6 +111,15 @@ struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+
+#ifdef CONFIG_PSEUDO_ZONE
+	/* Pseudo zone members: children list is protected by nr_zones_lock */
+	struct zone		*parent;
+	struct list_head	children;
+	struct list_head	sibling;
+	int			pzone_idx;
+#endif
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -340,7 +349,65 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+#define zone_idx(zone)		(real_zone(zone) - (zone)->zone_pgdat->node_zones)
+
+#ifdef CONFIG_PSEUDO_ZONE
+#define MAX_NR_PZONES		1024
+
+struct pzone_table {
+	struct zone *zone;
+	struct list_head list;
+};
+
+extern struct pzone_table pzone_table[];
+
+struct zone *pzone_create(struct zone *z, char *name, int npages);
+
+static inline void zone_init_pzone_link(struct zone *z)
+{
+	z->parent = NULL;
+	INIT_LIST_HEAD(&z->children);
+	INIT_LIST_HEAD(&z->sibling);
+	z->pzone_idx = -1;
+}
+
+static inline int zone_is_pseudo(struct zone *z)
+{
+	return (z->parent != NULL);
+}
+
+static inline struct zone *real_zone(struct zone *z)
+{
+	if (z->parent)
+		return z->parent;
+	return z;
+}
+
+static inline struct zone *pzone_next_in_zone(struct zone *z)
+{
+	if (zone_is_pseudo(z)) {
+		if (z->sibling.next == &z->parent->children)
+			z = NULL;
+		else
+			z = list_entry(z->sibling.next, struct zone, sibling);
+	} else {
+		if (list_empty(&z->children))
+			z = NULL;
+		else
+			z = list_entry(z->children.next, struct zone, sibling);
+	}
+
+	return z;
+}
+
+#else
+#define MAX_PSEUDO_ZONES	0
+
+static inline void zone_init_pzone_link(struct zone *z) {}
+
+static inline int zone_is_pseudo(struct zone *z) { return 0; }
+static inline struct zone *real_zone(struct zone *z) { return z; }
+#endif
 
 /**
  * for_each_pgdat - helper macro to iterate over all nodes
@@ -364,6 +431,19 @@ static inline struct zone *next_zone(str
 {
 	pg_data_t *pgdat = zone->zone_pgdat;
 
+#ifdef CONFIG_PSEUDO_ZONE
+	if (zone_is_pseudo(zone)) {
+		if (zone->sibling.next != &zone->parent->children)
+			return list_entry(zone->sibling.next, struct zone,
+					  sibling);
+		else
+			zone = zone->parent;
+	} else {
+		if (!list_empty(&zone->children))
+			return list_entry(zone->children.next, struct zone,
+					  sibling);
+	}
+#endif
 	if (zone < pgdat->node_zones + MAX_NR_ZONES - 1)
 		zone++;
 	else if (pgdat->pgdat_next) {
@@ -379,6 +459,19 @@ static inline struct zone *next_zone_in_
 {
 	pg_data_t *pgdat = zone->zone_pgdat;
 
+#ifdef CONFIG_PSEUDO_ZONE
+	if (zone_is_pseudo(zone)) {
+		if (zone->sibling.next != &zone->parent->children)
+			return list_entry(zone->sibling.next, struct zone,
+					  sibling);
+		else
+			zone = zone->parent;
+	} else {
+		if (!list_empty(&zone->children))
+			return list_entry(zone->children.next, struct zone,
+					  sibling);
+	}
+#endif
 	if (zone < pgdat->node_zones + len - 1)
 		zone++;
 	else
@@ -425,11 +518,13 @@ static inline int is_normal_idx(int idx)
  */
 static inline int is_highmem(struct zone *zone)
 {
+	zone = real_zone(zone);
 	return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM;
 }
 
 static inline int is_normal(struct zone *zone)
 {
+	zone = real_zone(zone);
 	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
 }
 
diff -urNp a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig	2006-01-03 12:21:10.000000000 +0900
+++ b/mm/Kconfig	2006-01-30 14:31:30.000000000 +0900
@@ -132,3 +132,9 @@ config SPLIT_PTLOCK_CPUS
 	default "4096" if ARM && !CPU_CACHE_VIPT
 	default "4096" if PARISC && !PA20
 	default "4"
+
+config PSEUDO_ZONE
+	bool "Pseudo zone support"
+	help
+	  This option provides pseudo zone creation from a non-pseudo zone.
+	  Pseudo zones could be used for memory resource management.
diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-30 14:23:30.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-30 14:31:30.000000000 +0900
@@ -309,6 +309,14 @@ static inline void __free_pages_bulk (st
 	BUG_ON(bad_range(zone, page));
 
 	zone->free_pages += order_size;
+
+	/*
+	 * Do not concatenate a page in the pzone.
+	 * Order>0 pages are never allocated from pzones (so far?).
+	 */
+	if (unlikely(page_in_pzone(page)))
+		goto skip_buddy;
+
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
 		struct free_area *area;
@@ -321,6 +329,7 @@ static inline void __free_pages_bulk (st
 			break;
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
+		BUG_ON(page_zone(page) != page_zone(buddy));
 		list_del(&buddy->lru);
 		area = zone->free_area + order;
 		area->nr_free--;
@@ -330,6 +339,8 @@ static inline void __free_pages_bulk (st
 		order++;
 	}
 	set_page_order(page, order);
+
+skip_buddy: /* Keep order and PagePrivate unset for pzone pages. */
 	list_add(&page->lru, &zone->free_area[order].free_list);
 	zone->free_area[order].nr_free++;
 }
@@ -588,7 +599,7 @@ void drain_remote_pages(void)
 }
 #endif
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PSEUDO_ZONE)
 static void __drain_zone_pages(struct zone *zone, int cpu)
 {
 	struct per_cpu_pageset *pset;
@@ -613,7 +624,7 @@ static void __drain_pages(unsigned int c
 		__drain_zone_pages(zone, cpu);
 	read_unlock_nr_zones();
 }
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU || CONFIG_PSEUDO_ZONE */
 
 #ifdef CONFIG_PM
 
@@ -2023,6 +2034,7 @@ static void __init free_area_init_core(s
 
 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
 
+		zone_init_pzone_link(zone);
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
@@ -2704,3 +2716,253 @@ void write_unlock_nr_zones(unsigned long
 {
 	spin_unlock_irqrestore(&nr_zones_lock, *flagsp);
 }
+
+
+#ifdef CONFIG_PSEUDO_ZONE
+
+#include <linux/mm_inline.h>
+
+struct pzone_table pzone_table[MAX_NR_PZONES];
+EXPORT_SYMBOL(pzone_table);
+
+static struct list_head pzone_freelist = LIST_HEAD_INIT(pzone_freelist);
+
+static int pzone_table_register(struct zone *z)
+{
+	struct pzone_table *t;
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	if (list_empty(&pzone_freelist)) {
+		write_unlock_nr_zones(&flags);
+		return -ENOMEM;
+	}
+
+	t = list_entry(pzone_freelist.next, struct pzone_table, list);
+	list_del(&t->list);
+	z->pzone_idx = t - pzone_table;
+	t->zone = z;
+	write_unlock_nr_zones(&flags);
+
+	return 0;
+}
+
+static void pzone_parent_register(struct zone *z, struct zone *parent)
+{
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	list_add(&z->sibling, &parent->children);
+	write_unlock_nr_zones(&flags);
+}
+
+/*
+ * pzone alloc/free routines
+ */
+#ifdef CONFIG_NUMA
+static int pzone_setup_pagesets(struct zone *z)
+{
+	struct per_cpu_pageset *pageset;
+	int batch;
+	int nid;
+	int i;
+
+	zone_pcp_init(z);
+
+	nid = z->zone_pgdat->node_id;
+	batch = zone_batchsize(z);
+
+	lock_cpu_hotplug();
+	for_each_online_cpu(i) {
+		pageset = kmalloc_node(sizeof(*pageset), GFP_KERNEL, nid);
+		if (!pageset)
+			goto bad;
+		z->pageset[i] = pageset;
+		setup_pageset(pageset, batch);
+	}
+	unlock_cpu_hotplug();
+
+	return 0;
+bad:
+	for (i = 0; i < NR_CPUS; i++) {
+		if (z->pageset[i] != &boot_pageset[i])
+			kfree(z->pageset[i]);
+		z->pageset[i] = NULL;
+	}
+	unlock_cpu_hotplug();
+
+	return -ENOMEM;
+}
+
+static void pzone_free_pagesets(struct zone *z)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (z->pageset[i] && (zone_pcp(z, i) != &boot_pageset[i])) {
+			BUG_ON(zone_pcp(z, i)->pcp[0].count != 0);
+			BUG_ON(zone_pcp(z, i)->pcp[1].count != 0);
+			kfree(zone_pcp(z, i));
+		}
+		zone_pcp(z, i) = NULL;
+	}
+}
+#else /* !CONFIG_NUMA */
+static inline int pzone_setup_pagesets(struct zone *z)
+{
+	int batch;
+	int i;
+
+	batch = zone_batchsize(z);
+	for (i = 0; i < NR_CPUS; i++)
+		setup_pageset(zone_pcp(z, i), batch);
+
+	return 0;
+}
+
+static inline void pzone_free_pagesets(struct zone *z)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		BUG_ON(zone_pcp(z, i)->pcp[0].count != 0);
+		BUG_ON(zone_pcp(z, i)->pcp[1].count != 0);
+	}
+}
+#endif /* CONFIG_NUMA */
+
+static inline void pzone_setup_page_flags(struct zone *z,
+						struct page *page)
+{
+	page->flags &= ~(ZONETABLE_MASK << ZONETABLE_PGSHIFT);
+	page->flags |= ((unsigned long)z->pzone_idx << ZONETABLE_PGSHIFT);
+	page->flags |= 1UL << PZONE_BIT_PGSHIFT;
+}
+
+static inline void pzone_restore_page_flags(struct zone *parent,
+						struct page *page)
+{
+	set_page_links(page, zone_idx(parent), parent->zone_pgdat->node_id,
+		       page_to_pfn(page));
+	page->flags &= ~(1UL << PZONE_BIT_PGSHIFT);
+}
+
+struct zone *pzone_create(struct zone *parent, char *name, int npages)
+{
+	struct zonelist zonelist;
+	struct zone *z;
+	struct page *page;
+	struct list_head *l;
+	unsigned long flags;
+	int len;
+	int i;
+
+	if (npages > parent->present_pages)
+		return NULL;
+
+	z = kmalloc_node(sizeof(*z), GFP_KERNEL, parent->zone_pgdat->node_id);
+	if (!z)
+		goto bad1;
+	memset(z, 0, sizeof(*z));
+
+	z->present_pages = z->free_pages = npages;
+	z->parent = parent;
+
+	spin_lock_init(&z->lock);
+	spin_lock_init(&z->lru_lock);
+	INIT_LIST_HEAD(&z->active_list);
+	INIT_LIST_HEAD(&z->inactive_list);
+
+	INIT_LIST_HEAD(&z->children);
+	INIT_LIST_HEAD(&z->sibling);
+
+	z->zone_pgdat = parent->zone_pgdat;
+	z->zone_mem_map = parent->zone_mem_map;
+	z->zone_start_pfn = parent->zone_start_pfn;
+	z->spanned_pages = parent->spanned_pages;
+	z->temp_priority = z->prev_priority = DEF_PRIORITY;
+
+	/* use wait_table of parents. */
+	z->wait_table = parent->wait_table;
+	z->wait_table_size = parent->wait_table_size;
+	z->wait_table_bits = parent->wait_table_bits;
+
+	len = strlen(name);
+	z->name = kmalloc_node(len + 1, GFP_KERNEL,
+			       parent->zone_pgdat->node_id);
+	if (!z->name)
+		goto bad2;
+	strcpy(z->name, name);
+
+	if (pzone_setup_pagesets(z) < 0)
+		goto bad3;
+
+	/* no lowmem for the pseudo zone.  leave lowmem_reserve all-0. */
+
+	zone_init_free_lists(z->zone_pgdat, z, z->spanned_pages);
+
+	/* setup a fake zonelist for allocating pages only from the parent. */
+	memset(&zonelist, 0, sizeof(zonelist));
+	zonelist.zones[0] = parent;
+	for (i = 0; i < npages; i++) {
+		page = __alloc_pages(GFP_KERNEL, 0, &zonelist);
+		if (!page)
+			goto bad4;
+		set_page_count(page, 0);
+		list_add(&page->lru, &z->free_area[0].free_list);
+		z->free_area[0].nr_free++;
+	}
+
+	if (pzone_table_register(z))
+		goto bad4;
+
+	list_for_each(l, &z->free_area[0].free_list) {
+		page = list_entry(l, struct page, lru);
+		pzone_setup_page_flags(z, page);
+	}
+
+	spin_lock_irqsave(&parent->lock, flags);
+	parent->present_pages -= npages;
+	spin_unlock_irqrestore(&parent->lock, flags);
+	
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+	pzone_parent_register(z, parent);
+
+	return z;
+bad4:
+	while (!list_empty(&z->free_area[0].free_list)) {
+		page = list_entry(z->free_area[0].free_list.next,
+				  struct page, lru);
+		list_del(&page->lru);
+		pzone_restore_page_flags(parent, page);
+		set_page_count(page, 1);
+		__free_pages(page, 0);
+	}
+
+	pzone_free_pagesets(z);
+bad3:
+	if (z->name)
+		kfree(z->name);
+bad2:
+	kfree(z);
+bad1:
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+
+	return NULL;
+}
+
+static int pzone_init(void)
+{
+	int i;
+
+	for (i = 0; i < MAX_NR_PZONES; i++)
+		list_add_tail(&pzone_table[i].list, &pzone_freelist);
+
+	return 0;
+}
+
+__initcall(pzone_init);
+
+#endif /* CONFIG_PSEUDO_ZONE */
diff -urNp a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	2006-01-30 14:23:30.000000000 +0900
+++ b/mm/vmscan.c	2006-01-30 14:31:30.000000000 +0900
@@ -1080,7 +1080,24 @@ loop_again:
 			 * zone which needs scanning
 			 */
 			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
-				struct zone *zone = pgdat->node_zones + i;
+#ifdef CONFIG_PSEUDO_ZONE
+				for (zone = pgdat->node_zones + i; zone;
+				     zone = pzone_next_in_zone(zone)) {
+					if (zone->present_pages == 0)
+						continue;
+
+					if (zone->all_unreclaimable &&
+							priority != DEF_PRIORITY)
+						continue;
+
+					if (!zone_watermark_ok(zone, order,
+						zone->pages_high, 0, 0)) {
+						end_zone = i;
+						goto scan;
+					}
+				}
+#else /* !CONFIG_PSEUDO_ZONE */
+				zone = pgdat->node_zones + i;
 
 				if (zone->present_pages == 0)
 					continue;
@@ -1094,6 +1111,7 @@ loop_again:
 					end_zone = i;
 					goto scan;
 				}
+#endif /* !CONFIG_PSEUDO_ZONE */
 			}
 			goto out;
 		} else {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 6/8] Add the pzone_destroy() function
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (4 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 5/8] Add the pzone_create() function KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 7/8] Make the number of pages in pzones resizable KUROSAWA Takahiro
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch implements destruction of pzones.  Pages in the destroyed 
pzones return into the parent zone (the zone from that the pzone was 
created).

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/mmzone.h |    1 
 include/linux/swap.h   |    2 
 mm/page_alloc.c        |  287 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c            |    4 
 4 files changed, 292 insertions(+), 2 deletions(-)

diff -urNp a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	2006-01-30 14:33:44.000000000 +0900
+++ b/include/linux/mmzone.h	2006-01-30 14:34:39.000000000 +0900
@@ -362,6 +362,7 @@ struct pzone_table {
 extern struct pzone_table pzone_table[];
 
 struct zone *pzone_create(struct zone *z, char *name, int npages);
+void pzone_destroy(struct zone *z);
 
 static inline void zone_init_pzone_link(struct zone *z)
 {
diff -urNp a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h	2006-01-03 12:21:10.000000000 +0900
+++ b/include/linux/swap.h	2006-01-30 11:23:03.000000000 +0900
@@ -171,6 +171,8 @@ extern int rotate_reclaimable_page(struc
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
+extern int isolate_lru_pages(int, struct list_head *, struct list_head *,
+		int *);
 extern int try_to_free_pages(struct zone **, gfp_t);
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 extern int shrink_all_memory(int);
diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-30 14:33:44.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-30 14:34:39.000000000 +0900
@@ -2727,6 +2727,9 @@ EXPORT_SYMBOL(pzone_table);
 
 static struct list_head pzone_freelist = LIST_HEAD_INIT(pzone_freelist);
 
+static struct workqueue_struct *pzone_drain_wq;
+static DEFINE_PER_CPU(struct work_struct, pzone_drain_work);
+
 static int pzone_table_register(struct zone *z)
 {
 	struct pzone_table *t;
@@ -2747,6 +2750,18 @@ static int pzone_table_register(struct z
 	return 0;
 }
 
+static void pzone_table_unregister(struct zone *z)
+{
+	struct pzone_table *t;
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	t = &pzone_table[z->pzone_idx];
+	t->zone = NULL;
+	list_add(&t->list, &pzone_freelist);
+	write_unlock_nr_zones(&flags);
+}
+
 static void pzone_parent_register(struct zone *z, struct zone *parent)
 {
 	unsigned long flags;
@@ -2756,6 +2771,15 @@ static void pzone_parent_register(struct
 	write_unlock_nr_zones(&flags);
 }
 
+static void pzone_parent_unregister(struct zone *z)
+{
+	unsigned long flags;
+
+	write_lock_nr_zones(&flags);
+	list_del(&z->sibling);
+	write_unlock_nr_zones(&flags);
+}
+
 /*
  * pzone alloc/free routines
  */
@@ -2847,6 +2871,194 @@ static inline void pzone_restore_page_fl
 	page->flags &= ~(1UL << PZONE_BIT_PGSHIFT);
 }
 
+/*
+ * pzone_bad_range(): implemented for debugging instead of bad_range()
+ * in order to distinguish what causes the crash.
+ */
+static int pzone_bad_range(struct zone *zone, struct page *page)
+{
+	if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
+		BUG();
+	if (page_to_pfn(page) < zone->zone_start_pfn)
+		BUG();
+#ifdef CONFIG_HOLES_IN_ZONE
+	if (!pfn_valid(page_to_pfn(page)))
+		BUG();
+#endif
+	if (zone != page_zone(page))
+		BUG();
+	return 0;
+}
+
+static void pzone_drain(void *arg)
+{
+	lru_add_drain();
+}
+
+static void pzone_punt_drain(void *arg)
+{
+	struct work_struct *wp;
+
+	wp = &get_cpu_var(pzone_drain_work);
+	PREPARE_WORK(wp, pzone_drain, arg);
+	/* queue_work() checks whether the work is used or not. */
+	queue_work(pzone_drain_wq, wp);
+	put_cpu_var(pzone_drain_work);
+}
+
+static void pzone_flush_percpu(void *arg)
+{
+	struct zone *z = arg;
+	unsigned long flags;
+	int cpu;
+
+	/*
+	 * lru_add_drain() must not be called from interrupt context
+	 * (LRU pagevecs are interrupt unsafe).
+	 */
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	pzone_punt_drain(arg);
+	__drain_zone_pages(z, cpu);
+	local_irq_restore(flags);
+}
+
+static int pzone_flush_lru(struct zone *z, struct zone *parent,
+			   struct list_head *clist, unsigned long *cnr,
+			   int block)
+{
+	unsigned long flags;
+	struct page *page;
+	struct list_head list;
+	int n, moved, scan;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irqsave(&z->lru_lock, flags);
+	n = isolate_lru_pages(*cnr, clist, &list, &scan);
+	*cnr -= n;
+	spin_unlock_irqrestore(&z->lru_lock, flags);
+
+	moved = 0;
+	while (!list_empty(&list) && n-- > 0) {
+		page = list_entry(list.prev, struct page, lru);
+		list_del(&page->lru);
+
+		if (block) {
+			lock_page(page);
+			wait_on_page_writeback(page);
+		} else {
+			if (TestSetPageLocked(page))
+				goto goaround;
+
+			/* Make sure the writeback bit being kept zero. */
+			if (PageWriteback(page))
+				goto goaround_pagelocked;
+		}
+
+		/* Now we can safely modify the flags field. */
+		pzone_restore_page_flags(parent, page);
+		unlock_page(page);
+
+		spin_lock_irqsave(&parent->lru_lock, flags);
+		if (TestSetPageLRU(page))
+			BUG();
+
+		__put_page(page);
+		if (PageActive(page))
+			add_page_to_active_list(parent, page);
+		else
+			add_page_to_inactive_list(parent, page);
+		spin_unlock_irqrestore(&parent->lru_lock, flags);
+
+		moved++;
+		continue;
+
+goaround_pagelocked:
+		unlock_page(page);
+goaround:
+		spin_lock_irqsave(&z->lru_lock, flags);
+		__put_page(page);
+		if (TestSetPageLRU(page))
+			BUG();
+		list_add(&page->lru, clist);
+		++*cnr;
+		spin_unlock_irqrestore(&z->lru_lock, flags);
+	}
+
+	return moved;
+}
+
+static void pzone_flush_free_area(struct zone *z)
+{
+	struct free_area *area;
+	struct page *page;
+	struct list_head list;
+	unsigned long flags;
+	int order;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irqsave(&z->lock, flags);
+	area = &z->free_area[0];
+	while (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+		list_del(&page->lru);
+		area->nr_free--;
+		z->free_pages--;
+		z->present_pages--;
+		spin_unlock_irqrestore(&z->lock, flags);
+		pzone_restore_page_flags(z->parent, page);
+		pzone_bad_range(z->parent, page);
+		list_add(&page->lru, &list);
+		free_pages_bulk(z->parent, 1, &list, 0);
+
+		spin_lock_irqsave(&z->lock, flags);
+	}
+
+	BUG_ON(area->nr_free != 0);
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	/* currently pzone only supports order-0 only. do sanity check. */
+	spin_lock_irqsave(&z->lock, flags);
+	for (order = 1; order < MAX_ORDER; order++) {
+		area = &z->free_area[order];
+		BUG_ON(area->nr_free != 0);
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+static int pzone_is_empty(struct zone *z)
+{
+	unsigned long flags;
+	int ret = 0;
+	int i;
+
+	spin_lock_irqsave(&z->lock, flags);
+	ret += z->present_pages;
+	ret += z->free_pages;
+	ret += z->free_area[0].nr_free;
+
+	/* would better use smp_call_function for scanning pcp. */
+	for (i = 0; i < NR_CPUS; i++) {
+#ifdef CONFIG_NUMA
+		if (!zone_pcp(z, i) || (zone_pcp(z, i) == &boot_pageset[i]))
+			continue;
+#endif
+		ret += zone_pcp(z, i)->pcp[0].count;
+		ret += zone_pcp(z, i)->pcp[1].count;
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	spin_lock_irqsave(&z->lru_lock, flags);
+	ret += z->nr_active;
+	ret += z->nr_inactive;
+	spin_unlock_irqrestore(&z->lru_lock, flags);
+
+	return ret == 0;
+}
+
 struct zone *pzone_create(struct zone *parent, char *name, int npages)
 {
 	struct zonelist zonelist;
@@ -2953,10 +3165,85 @@ bad1:
 	return NULL;
 }
 
+#define PZONE_FLUSH_LOOP_COUNT		8
+
+/*
+ * destroying pseudo zone. the caller should make sure that no one references
+ * this pseudo zone.
+ */
+void pzone_destroy(struct zone *z)
+{
+	struct zone *parent;
+	unsigned long flags;
+	unsigned long present;
+	int freed;
+	int retrycnt = 0;
+
+	parent = z->parent;
+	present = z->present_pages;
+	pzone_parent_unregister(z);
+retry:
+	/* drain pages in per-cpu pageset to free_area */
+	smp_call_function(pzone_flush_percpu, z, 0, 1);
+	pzone_flush_percpu(z);
+	
+	/* drain pages in the LRU list. */
+	freed = pzone_flush_lru(z, parent, &z->active_list, &z->nr_active,
+				retrycnt > 0);
+	spin_lock_irqsave(&z->lock, flags);
+	z->present_pages -= freed;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	freed = pzone_flush_lru(z, parent, &z->inactive_list, &z->nr_inactive,
+				retrycnt > 0);
+	spin_lock_irqsave(&z->lock, flags);
+	z->present_pages -= freed;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	pzone_flush_free_area(z);
+
+	if (!pzone_is_empty(z)) {
+		retrycnt++;
+		if (retrycnt > PZONE_FLUSH_LOOP_COUNT) {
+			BUG();
+		} else {
+			flush_workqueue(pzone_drain_wq);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_timeout(HZ);
+			goto retry;
+		}
+	}
+
+	spin_lock_irqsave(&parent->lock, flags);
+	parent->present_pages += present;
+	spin_unlock_irqrestore(&parent->lock, flags);
+
+	flush_workqueue(pzone_drain_wq);
+	pzone_table_unregister(z);
+	pzone_free_pagesets(z);
+	kfree(z->name);
+	kfree(z);
+
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+}
+
 static int pzone_init(void)
 {
+	struct work_struct *wp;
 	int i;
 
+	pzone_drain_wq = create_workqueue("pzone");
+	if (!pzone_drain_wq) {
+		printk(KERN_ERR "pzone: create_workqueue failed.\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < NR_CPUS; i++) {
+		wp = &per_cpu(pzone_drain_work, i);
+		INIT_WORK(wp, pzone_drain, NULL);
+	}
+
 	for (i = 0; i < MAX_NR_PZONES; i++)
 		list_add_tail(&pzone_table[i].list, &pzone_freelist);
 
diff -urNp a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	2006-01-30 14:33:44.000000000 +0900
+++ b/mm/vmscan.c	2006-01-30 14:34:39.000000000 +0900
@@ -591,8 +591,8 @@ keep:
  *
  * returns how many pages were moved onto *@dst.
  */
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
-			     struct list_head *dst, int *scanned)
+int isolate_lru_pages(int nr_to_scan, struct list_head *src,
+		      struct list_head *dst, int *scanned)
 {
 	int nr_taken = 0;
 	struct page *page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 7/8] Make the number of pages in pzones resizable
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (5 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 6/8] Add the pzone_destroy() function KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-01-31  2:30   ` [PATCH 8/8] Add a CKRM memory resource controller using pzones KUROSAWA Takahiro
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch makes the number of pages in the pzones resizable by adding
the pzone_set_numpages() function.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>

---
 include/linux/mmzone.h |    1 
 mm/page_alloc.c        |  111 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c            |   29 ++++++++++++
 3 files changed, 141 insertions(+)

diff -urNp a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	2006-01-27 15:30:45.000000000 +0900
+++ b/include/linux/mmzone.h	2006-01-27 15:14:37.000000000 +0900
@@ -363,6 +363,7 @@ extern struct pzone_table pzone_table[];
 
 struct zone *pzone_create(struct zone *z, char *name, int npages);
 void pzone_destroy(struct zone *z);
+int pzone_set_numpages(struct zone *z, int npages);
 
 static inline void zone_init_pzone_link(struct zone *z)
 {
diff -urNp a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	2006-01-27 15:29:03.000000000 +0900
+++ b/mm/page_alloc.c	2006-01-27 15:14:37.000000000 +0900
@@ -3228,6 +3228,117 @@ retry:
 	setup_per_zone_lowmem_reserve();
 }
 
+extern int shrink_zone_memory(struct zone *zone, int nr_pages);
+
+static int pzone_move_free_pages(struct zone *dst, struct zone *src,
+					int npages)
+{
+	struct zonelist zonelist;
+	struct list_head pagelist;
+	struct page *page;
+	unsigned long flags;
+	int err;
+	int i;
+
+	err = 0;
+	spin_lock_irqsave(&src->lock, flags);
+	if (npages > src->present_pages)
+		err = -ENOMEM;
+	spin_unlock_irqrestore(&src->lock, flags);
+	if (err)
+		return err;
+
+	smp_call_function(pzone_flush_percpu, src, 0, 1);
+	pzone_flush_percpu(src);
+
+	INIT_LIST_HEAD(&pagelist);
+	memset(&zonelist, 0, sizeof(zonelist));
+	zonelist.zones[0] = src;
+	for (i = 0; i < npages; i++) {
+		/*
+		 * XXX to prevent myself from being arrested by oom-killer...
+		 *     should be replaced to the cleaner code.
+		 */
+		if (src->free_pages < npages - i) {
+			shrink_zone_memory(src, npages - i);
+			smp_call_function(pzone_flush_percpu, src, 0, 1);
+			pzone_flush_percpu(src);
+			blk_congestion_wait(WRITE, HZ/50);
+		}
+
+		page = __alloc_pages(GFP_KERNEL, 0, &zonelist);
+		if (!page) {
+			err = -ENOMEM;
+			goto bad;
+		}
+		list_add(&page->lru, &pagelist);
+	}
+
+	while (!list_empty(&pagelist)) {
+		page = list_entry(pagelist.next, struct page, lru);
+		list_del(&page->lru);
+		if (zone_is_pseudo(dst))
+			pzone_setup_page_flags(dst, page);
+		else
+			pzone_restore_page_flags(dst, page);
+
+		set_page_count(page, 1);
+		spin_lock_irqsave(&dst->lock, flags);
+		dst->present_pages++;
+		spin_unlock_irqrestore(&dst->lock, flags);
+		__free_pages(page, 0);
+	}
+
+	spin_lock_irqsave(&src->lock, flags);
+	src->present_pages -= npages;
+	spin_unlock_irqrestore(&src->lock, flags);
+
+	return 0;
+bad:
+	while (!list_empty(&pagelist)) {
+		page = list_entry(pagelist.next, struct page, lru);
+		list_del(&page->lru);
+		__free_pages(page, 0);
+	}
+
+	return err;
+}
+
+int pzone_set_numpages(struct zone *z, int npages)
+{
+	struct zone *src, *dst;
+	unsigned long flags;
+	int err;
+	int n;
+
+	/*
+	 * This function must not be called simultaneously so far.
+	 * The caller should make sure that.
+	 */
+	if (z->present_pages == npages) {
+		return 0;
+	} else if (z->present_pages > npages) {
+		n = z->present_pages - npages;
+		src = z;
+		dst = z->parent;
+	} else {
+		n = npages - z->present_pages;
+		src = z->parent;
+		dst = z;
+	}
+
+	/* XXX  Preventing oom-killer from complaining */
+	spin_lock_irqsave(&z->lock, flags);
+	z->pages_min = z->pages_low = z->pages_high = 0;
+	spin_unlock_irqrestore(&z->lock, flags);
+
+	err = pzone_move_free_pages(dst, src, n);
+	setup_per_zone_pages_min();
+	setup_per_zone_lowmem_reserve();
+
+	return err;
+}
+
 static int pzone_init(void)
 {
 	struct work_struct *wp;
diff -urNp a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	2006-01-27 15:29:03.000000000 +0900
+++ b/mm/vmscan.c	2006-01-27 15:14:37.000000000 +0900
@@ -1328,6 +1328,35 @@ int shrink_all_memory(int nr_pages)
 }
 #endif
 
+#ifdef CONFIG_PSEUDO_ZONE
+int shrink_zone_memory(struct zone *zone, int nr_pages)
+{
+	struct scan_control sc;
+
+	sc.gfp_mask = GFP_KERNEL;
+	sc.may_writepage = 1;
+	sc.may_swap = 1;
+	sc.nr_mapped = read_page_state(nr_mapped);
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;
+	sc.priority = 0;
+
+	if (nr_pages < SWAP_CLUSTER_MAX)
+		sc.swap_cluster_max = nr_pages;
+	else
+		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+
+	sc.nr_to_reclaim = sc.swap_cluster_max;
+	sc.nr_to_scan = sc.swap_cluster_max;
+	sc.nr_mapped = total_memory;	/* XXX  to make vmscan aggressive */
+	refill_inactive_zone(zone, &sc);
+	sc.nr_to_scan = sc.swap_cluster_max;
+	shrink_cache(zone, &sc);
+
+	return sc.nr_reclaimed;
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
    not required for correctness.  So if the last cpu in a node goes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 8/8] Add a CKRM memory resource controller using pzones
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (6 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 7/8] Make the number of pages in pzones resizable KUROSAWA Takahiro
@ 2006-01-31  2:30   ` KUROSAWA Takahiro
  2006-02-01  2:58   ` [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller chandra seetharaman
  2006-02-01  3:07   ` chandra seetharaman
  9 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-01-31  2:30 UTC (permalink / raw)
  To: ckrm-tech; +Cc: linux-mm, KUROSAWA Takahiro

This patch implements the CKRM memory resource controller using
pzones.  This patch requires CKRM patched source code.

CKRM patches can be obtained from
 http://sourceforge.net/project/showfiles.php?group_id=85838&package_id=163747

The CKRM patches requires configfs-patched source code:
 http://oss.oracle.com/projects/ocfs2/dist/files/patches/2.6.15-rc5/2005-12-14/0
1_configfs.patch

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

---
 include/linux/gfp.h |   31 ++
 mm/Kconfig          |    8 
 mm/Makefile         |    2 
 mm/mem_rc_pzone.c   |  632 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c      |   10 
 5 files changed, 680 insertions(+), 3 deletions(-)

diff -urNp b/include/linux/gfp.h c/include/linux/gfp.h
--- b/include/linux/gfp.h	2006-01-26 18:08:29.000000000 +0900
+++ c/include/linux/gfp.h	2006-01-26 17:52:06.000000000 +0900
@@ -104,12 +104,43 @@ static inline void arch_free_page(struct
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+#ifdef CONFIG_MEM_RC
+static inline int mem_rc_available(gfp_t gfp_mask, unsigned int order)
+{
+	gfp_mask &= GFP_LEVEL_MASK & ~(__GFP_HIGHMEM | __GFP_COLD);
+	return gfp_mask == GFP_USER && order == 0;
+}
+
+extern struct page *alloc_page_mem_rc(int nid, gfp_t gfp_mask);
+extern struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfp_mask,
+		unsigned int order);
+#else
+static inline int mem_rc_available(gfp_t gfp_mask, unsigned int order)
+{
+	return 0;
+}
+
+static inline struct page *alloc_page_mem_rc(int nid, gfp_t gfp_mask)
+{
+	return NULL;
+}
+
+static inline struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfp_mask,
+		unsigned int order)
+{
+	return NULL;
+}
+#endif
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
 	if (unlikely(order >= MAX_ORDER))
 		return NULL;
 
+	if (mem_rc_available(gfp_mask, order))
+		return alloc_page_mem_rc(nid, gfp_mask);
+
 	return __alloc_pages(gfp_mask, order,
 		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
 }
diff -urNp b/mm/Kconfig c/mm/Kconfig
--- b/mm/Kconfig	2006-01-26 18:09:01.000000000 +0900
+++ c/mm/Kconfig	2006-01-26 18:07:11.000000000 +0900
@@ -138,3 +138,11 @@ config PSEUDO_ZONE
 	help
 	  This option provides pseudo zone creation from a non-pseudo zone.
 	  Pseudo zones could be used for memory resource management.
+
+config MEM_RC
+	bool "Memory resource controller"
+	select PSEUDO_ZONE
+	depends on CPUMETER || CKRM
+	help
+	  This option will let you control the memory resource by using
+	  the pseudo zone.
diff -urNp b/mm/Makefile c/mm/Makefile
--- b/mm/Makefile	2006-01-26 18:09:11.000000000 +0900
+++ c/mm/Makefile	2006-01-26 17:52:06.000000000 +0900
@@ -20,3 +20,5 @@ obj-$(CONFIG_SHMEM) += shmem.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
+
+obj-$(CONFIG_MEM_RC) += mem_rc_pzone.o
diff -urNp b/mm/mem_rc_pzone.c c/mm/mem_rc_pzone.c
--- b/mm/mem_rc_pzone.c	1970-01-01 09:00:00.000000000 +0900
+++ c/mm/mem_rc_pzone.c	2006-01-26 18:07:11.000000000 +0900
@@ -0,0 +1,632 @@
+/*
+ *  mm/mem_rc_pzone.c
+ *
+ *  Memory resource controller by using pzones.
+ *
+ *  Copyright 2005 FUJITSU LIMITED
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/stddef.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/cpuset.h>
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+#include <linux/ckrm_rc.h>
+
+#include <asm/semaphore.h>
+
+#define MEM_RC_METER_BASE	100
+#define MEM_RC_METER_TO_PAGES(_rcd, _node, _zidx, _val) \
+	((_rcd)->zone_pages[(_node)][(_zidx)] * (_val) / MEM_RC_METER_BASE)
+
+struct mem_rc_domain {
+	struct semaphore sem;
+	nodemask_t nodes;
+	unsigned long *zone_pages[MAX_NUMNODES];
+};
+
+struct mem_rc {
+	unsigned long guarantee;
+	struct mem_rc_domain *rcd;
+	struct zone **zones[MAX_NUMNODES];
+	struct zonelist *zonelists[MAX_NUMNODES];
+};
+
+
+struct ckrm_mem {
+	struct ckrm_class *class;	/* the class I belong to */
+	struct ckrm_class *parent;	/* parent of the class above. */
+	struct ckrm_shares shares;
+	spinlock_t cnt_lock;	/* always grab parent's lock before child's */
+	struct mem_rc	*mem_rc;	/* mem resource controller */
+	int 	cnt_total_guarantee; 	/* total guarantee behind the class */
+};
+
+static struct mem_rc_domain *grcd; /* system wide resource controller domain */
+static struct ckrm_res_ctlr rcbs; /* resource controller callback structure */
+
+static void mem_rc_destroy_rcdomain(void *arg)
+{
+	struct mem_rc_domain *rcd = arg;
+	int node;
+
+	for_each_node_mask(node, rcd->nodes) {
+		if (rcd->zone_pages[node])
+			kfree(rcd->zone_pages[node]);
+	}
+
+	kfree(rcd);
+}
+
+static void *mem_rc_create_rcdomain(struct cpuset *cs,
+					cpumask_t cpus, nodemask_t mems)
+{
+	struct mem_rc_domain *rcd;
+	struct zone *z;
+	pg_data_t *pgdat;
+	unsigned long *pp;
+	int i, node, allocn;
+
+	allocn = first_node(mems);
+	rcd = kmalloc_node(sizeof(*rcd), GFP_KERNEL, allocn);
+	if (!rcd)
+		return NULL;
+
+	memset(rcd, 0, sizeof(*rcd));
+
+	init_MUTEX(&rcd->sem);
+	rcd->nodes = mems;
+	for_each_node_mask(node, mems) {
+		pgdat = NODE_DATA(node);
+
+		pp = kmalloc_node(sizeof(unsigned long) * MAX_NR_ZONES,
+				  GFP_KERNEL, allocn);
+		if (!pp)
+			goto failed;
+
+		rcd->zone_pages[node] = pp;
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			if (i == ZONE_DMA) {
+				pp[i] = 0;
+				continue;
+			}
+			z = pgdat->node_zones + i;
+			pp[i] = z->present_pages;
+		}
+	}
+
+	return rcd;
+
+failed:
+	mem_rc_destroy_rcdomain(rcd);
+
+	return NULL;
+}
+
+
+static void *mem_rc_create(void *arg, char *name)
+{
+	struct mem_rc_domain *rcd = arg;
+	struct mem_rc *mr;
+	struct zonelist *zl, *zl_ref;
+	struct zone *parent, *z, *z_ref;
+	pg_data_t *pgdat;
+	int node, allocn;
+	int i, j;
+
+	allocn = first_node(rcd->nodes);
+	mr = kmalloc_node(sizeof(*mr), GFP_KERNEL, allocn);
+	if (!mr)
+		return NULL;
+
+	memset(mr, 0, sizeof(*mr));
+
+	down(&rcd->sem);
+	mr->rcd = rcd;
+	for_each_node_mask(node, rcd->nodes) {
+		pgdat = NODE_DATA(node);
+
+		mr->zones[node]
+			= kmalloc_node(sizeof(*mr->zones[node]) * MAX_NR_ZONES,
+				       GFP_KERNEL, allocn);
+		if (!mr->zones[node])
+			goto failed;
+
+		memset(mr->zones[node], 0,
+		       sizeof(*mr->zones[node]) * MAX_NR_ZONES);
+
+		mr->zonelists[node]
+			= kmalloc_node(sizeof(*mr->zonelists[node]),
+				       GFP_KERNEL, allocn);
+		if (!mr->zonelists[node])
+			goto failed;
+
+		memset(mr->zonelists[node], 0, sizeof(*mr->zonelists[node]));
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			parent = pgdat->node_zones + i;
+			if (rcd->zone_pages[node][i] == 0)
+				continue;
+
+			z = pzone_create(parent, name, 0);
+			if (!z)
+				goto failed;
+			mr->zones[node][i] = z;
+		}
+	}
+
+	for_each_node_mask(node, rcd->nodes) {
+		/* NORMAL zones and DMA zones also in HIGHMEM zonelist. */
+		zl_ref = NODE_DATA(node)->node_zonelists + __GFP_HIGHMEM;
+		zl = mr->zonelists[node];
+
+		for (j = i = 0; i < ARRAY_SIZE(zl_ref->zones); i++) {
+			z_ref = zl_ref->zones[i];
+			if (!z_ref)
+				break;
+
+			z = mr->zones[node][zone_idx(z_ref)];
+			if (!z)
+				continue;
+			zl->zones[j++] = z;
+		}
+		zl->zones[j] = NULL;
+	}
+	up(&rcd->sem);
+
+	return mr;
+
+failed:
+	for_each_node_mask(node, rcd->nodes) {
+		if (mr->zonelists[node])
+			kfree(mr->zonelists[node]);
+
+		if (!mr->zones[node])
+			continue;
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+			pzone_destroy(z);
+		}
+		kfree(mr->zones[node]);
+	}
+	up(&rcd->sem);
+	kfree(mr);
+
+	return NULL;
+}
+
+static void mem_rc_destroy(void *p)
+{
+	struct mem_rc *mr = p;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	int node, i;
+
+	down(&rcd->sem);
+	for (node = 0; node < MAX_NUMNODES; node++) {
+		if (mr->zonelists[node])
+			kfree(mr->zonelists[node]);
+			
+		if (!mr->zones[node])
+			continue;
+
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (z)
+				pzone_destroy(z);
+			mr->zones[node][i] = NULL;
+		}
+		kfree(mr->zones[node]);
+	}
+	up(&rcd->sem);
+
+	kfree(mr);
+}
+
+static int mem_rc_set_guar(void *ctldata, unsigned long val)
+{
+	struct mem_rc *mr = ctldata;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	nodemask_t nodes_done;
+	int err;
+	int node;
+	int i;
+
+	down(&rcd->sem);
+	nodes_clear(nodes_done);
+	for_each_node_mask(node, rcd->nodes) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+
+			err = pzone_set_numpages(z,
+					MEM_RC_METER_TO_PAGES(rcd,
+						node, i, val));
+			if (err)
+				goto undo;
+		}
+		node_set(node, nodes_done);
+	}
+
+	mr->guarantee = val;
+	up(&rcd->sem);
+
+	return 0;
+
+undo:
+	for (i--; i >= 0; i--)
+		pzone_set_numpages(z, MEM_RC_METER_TO_PAGES(rcd, node, i,
+						mr->guarantee));
+		
+	for_each_node_mask(node, nodes_done) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+
+			pzone_set_numpages(z,
+					MEM_RC_METER_TO_PAGES(rcd,
+						node, i, mr->guarantee));
+		}
+	}
+	up(&rcd->sem);
+
+	return err;
+}
+
+static int mem_rc_get_cur(void *ctldata, unsigned long *valp)
+{
+	struct mem_rc *mr = ctldata;
+	struct mem_rc_domain *rcd = mr->rcd;
+	struct zone *z;
+	unsigned long total, used;
+	int node;
+	int i;
+
+	total = used = 0;
+	for_each_node_mask(node, rcd->nodes) {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			z = mr->zones[node][i];
+			if (!z)
+				continue;
+			total += z->present_pages;
+			used += z->present_pages - z->free_pages;
+		}
+	}
+
+	if (total > 0)
+		*valp = mr->guarantee * used / total;
+	else
+		*valp = 0;
+
+	return 0;
+}
+
+#if 0
+static struct cpumeter_ctlr mem_rc_ctlr = {
+	.name = "mem",
+	.create_rcdomain = mem_rc_create_rcdomain,
+	.destroy_rcdomain = mem_rc_destroy_rcdomain,
+	.create = mem_rc_create,
+	.destroy = mem_rc_destroy,
+	.set_guar = mem_rc_set_guar,
+	.set_lim = NULL,
+	.get_cur = mem_rc_get_cur,
+};
+
+int mem_rc_init(void)
+{
+	int err;
+
+	err = cpumeter_register_controller(&mem_rc_ctlr);
+	if (err)
+		printk(KERN_ERR "mem_rc: register to cpumeter failed\n");
+	else
+		printk(KERN_INFO
+		       "mem_rc: memory resource controller registered.\n");
+
+	return err;
+}
+
+__initcall(mem_rc_init);
+
+void *mem_rc_get(task_t *tsk)
+{
+	return cpumeter_get_controller_data(tsk->cpuset, &mem_rc_ctlr);
+}
+
+#endif
+
+struct mem_rc *mem_rc_get(task_t *tsk)
+{
+	struct ckrm_class *class = tsk->class;
+	struct ckrm_mem *res;
+
+	if (unlikely(class == NULL))
+		return NULL;
+
+	res = ckrm_get_res_class(class, rcbs.resid, struct ckrm_mem);
+
+	if (unlikely(res == NULL))
+		return NULL;
+
+	return res->mem_rc;
+}
+EXPORT_SYMBOL(mem_rc_get);
+
+struct page *alloc_page_mem_rc(int nid, gfp_t gfpmask)
+{
+	struct mem_rc *mr;
+
+	mr = mem_rc_get(current);
+	if (!mr)
+		return __alloc_pages(gfpmask, 0,
+				     NODE_DATA(nid)->node_zonelists
+				     + (gfpmask & GFP_ZONEMASK));
+
+	return __alloc_pages(gfpmask, 0, mr->zonelists[nid]);
+}
+EXPORT_SYMBOL(alloc_page_mem_rc);
+
+struct zonelist *mem_rc_get_zonelist(int nd, gfp_t gfpmask,
+				     unsigned int order)
+{
+	struct mem_rc *mr;
+
+	if (!mem_rc_available(gfpmask, order))
+		return NULL;
+
+	mr = mem_rc_get(current);
+	if (!mr)
+		return NULL;
+
+	return mr->zonelists[nd];
+}
+
+static void mem_rc_set_guarantee(struct ckrm_mem *res, int val)
+{
+	int	rc;
+
+	if (res->mem_rc == NULL)
+		return;
+
+	res->mem_rc->guarantee = val;
+	rc = mem_rc_set_guar(res->mem_rc, (unsigned long)val);
+	if (rc)
+		printk("mem_rc_set_guar failed, err = %d\n", rc);
+}
+
+static void mem_res_initcls_one(struct ckrm_mem * res)
+{
+	res->shares.my_guarantee = 0;
+	res->shares.my_limit = CKRM_SHARE_DONTCARE;
+	res->shares.total_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+	res->shares.max_limit = CKRM_SHARE_DONTCARE;
+	res->shares.unused_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+	res->cnt_total_guarantee = 0;
+
+	return;
+}
+
+static void *mem_res_alloc(struct ckrm_class *class,
+				struct ckrm_class *parent)
+{
+	struct ckrm_mem *res;
+
+	res = kmalloc(sizeof(struct ckrm_mem), GFP_ATOMIC);
+
+	if (res) {
+		memset(res, 0, sizeof(struct ckrm_mem));
+		res->class = class;
+		res->parent = parent;
+		mem_res_initcls_one(res);
+		res->cnt_lock = SPIN_LOCK_UNLOCKED;
+		if (!parent)	{	/* root class */
+			res->cnt_total_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+			res->shares.my_guarantee = CKRM_SHARE_DONTCARE;
+		} else {
+			res->mem_rc = (struct mem_rc *)mem_rc_create(grcd, class->name);
+			if (res->mem_rc == NULL)
+				printk(KERN_ERR "mem_rc_create failed\n");
+		}
+	} else {
+		printk(KERN_ERR
+		       "mem_res_alloc: failed GFP_ATOMIC alloc\n");
+	}
+	return res;
+}
+
+static void mem_res_free(void *my_res)
+{
+	struct ckrm_mem *res = my_res, *parres;
+	u64	temp = 0;
+
+	if (!res)
+		return;
+
+	parres = ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+	/* return child's guarantee to parent class */
+	spin_lock(&parres->cnt_lock);
+	ckrm_child_guarantee_changed(&parres->shares, res->shares.my_guarantee, 0);
+	if (parres->shares.total_guarantee) {
+		temp = (u64) parres->shares.unused_guarantee
+				* parres->cnt_total_guarantee;
+		do_div(temp, parres->shares.total_guarantee);
+	}
+	mem_rc_set_guarantee(parres, temp);
+	spin_unlock(&parres->cnt_lock);
+
+	mem_rc_destroy(res->mem_rc);
+	kfree(res);
+	return;
+}
+
+static void
+recalc_and_propagate(struct ckrm_mem * res)
+{
+	struct ckrm_class *child = NULL;
+	struct ckrm_mem *parres, *childres;
+	u64	cnt_total = 0,	cnt_guar = 0;
+
+	parres = ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+
+	if (parres) {
+		struct ckrm_shares *par = &parres->shares;
+		struct ckrm_shares *self = &res->shares;
+
+		/* calculate total and currnet guarantee */
+		if (par->total_guarantee && self->total_guarantee) {
+			cnt_total = (u64) self->my_guarantee
+					 * parres->cnt_total_guarantee;
+			do_div(cnt_total, par->total_guarantee);
+			cnt_guar = (u64) self->unused_guarantee * cnt_total;
+			do_div(cnt_guar, self->total_guarantee);
+		}
+		mem_rc_set_guarantee(res, (int) cnt_guar);
+		res->cnt_total_guarantee = (int ) cnt_total;
+	}
+
+	/* propagate to children */
+	ckrm_lock_hier(res->class);
+	while ((child = ckrm_get_next_child(res->class, child)) != NULL) {
+		childres =
+			ckrm_get_res_class(child, rcbs.resid, struct ckrm_mem);
+		if (childres) {
+		    spin_lock(&childres->cnt_lock);
+		    recalc_and_propagate(childres);
+		    spin_unlock(&childres->cnt_lock);
+		}
+	}
+	ckrm_unlock_hier(res->class);
+	return;
+}
+
+static int mem_set_share_values(void *my_res, struct ckrm_shares *new)
+{
+	struct ckrm_mem *parres, *res = my_res;
+	struct ckrm_shares *cur = &res->shares, *par;
+	int rc = -EINVAL;
+	u64	temp = 0;
+
+	if (!res)
+		return rc;
+
+	if (res->parent) {
+		parres =
+		   ckrm_get_res_class(res->parent, rcbs.resid, struct ckrm_mem);
+		spin_lock(&parres->cnt_lock);
+		spin_lock(&res->cnt_lock);
+		par = &parres->shares;
+	} else {
+		spin_lock(&res->cnt_lock);
+		par = NULL;
+		parres = NULL;
+	}
+
+	rc = ckrm_set_shares(new, cur, par);
+
+	if (rc)
+		goto share_err;
+
+	if (parres) {
+		/* adjust parent's unused guarantee */
+		if (par->total_guarantee) {
+			temp = (u64) par->unused_guarantee
+					* parres->cnt_total_guarantee;
+			do_div(temp, par->total_guarantee);
+		}
+		mem_rc_set_guarantee(parres, temp);
+	} else {
+		/* adjust root class's unused guarantee */
+		temp = (u64) cur->unused_guarantee
+				* CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
+		do_div(temp, cur->total_guarantee);
+		mem_rc_set_guarantee(res, temp);
+	}
+	recalc_and_propagate(res);
+
+share_err:
+	spin_unlock(&res->cnt_lock);
+	if (res->parent)
+		spin_unlock(&parres->cnt_lock);
+	return rc;
+}
+
+static int mem_get_share_values(void *my_res, struct ckrm_shares *shares)
+{
+	struct ckrm_mem *res = my_res;
+
+	if (!res)
+		return -EINVAL;
+	*shares = res->shares;
+	return 0;
+}
+
+static ssize_t mem_show_stats(void *my_res, char *buf)
+{
+	struct ckrm_mem *res = my_res;
+	unsigned long val;
+	ssize_t	i;
+
+	if (!res)
+		return -EINVAL;
+
+	if (res->mem_rc == NULL)
+		return 0;
+
+	mem_rc_get_cur(res->mem_rc, &val);
+	i = sprintf(buf, "mem:current=%ld\n", val);
+	return i;
+}
+
+static struct ckrm_res_ctlr rcbs = {
+	.res_name = "mem",
+	.resid = -1,
+	.res_alloc = mem_res_alloc,
+	.res_free = mem_res_free,
+	.set_share_values = mem_set_share_values,
+	.get_share_values = mem_get_share_values,
+	.show_stats = mem_show_stats,
+};
+
+static void init_global_rcd(void)
+{
+	grcd = (struct mem_rc_domain *) mem_rc_create_rcdomain((struct cpuset *)NULL, cpu_online_map, node_online_map);
+	if (grcd == NULL)
+		printk("mem_rc_create_rcdomain failed\n");
+}
+
+int __init init_ckrm_mem_res(void)
+{
+	init_global_rcd();
+	if (rcbs.resid == CKRM_NO_RES)	{
+		ckrm_register_res_ctlr(&rcbs);
+	}
+	return 0;
+}
+
+void __exit exit_ckrm_mem_res(void)
+{
+	ckrm_unregister_res_ctlr(&rcbs);
+	mem_rc_destroy_rcdomain(grcd);
+}
+
+module_init(init_ckrm_mem_res)
+module_exit(exit_ckrm_mem_res)
+
+MODULE_LICENSE("GPL")
diff -urNp b/mm/mempolicy.c c/mm/mempolicy.c
--- b/mm/mempolicy.c	2006-01-03 12:21:10.000000000 +0900
+++ c/mm/mempolicy.c	2006-01-26 17:52:06.000000000 +0900
@@ -726,8 +726,10 @@ get_vma_policy(struct task_struct *task,
 }
 
 /* Return a zonelist representing a mempolicy */
-static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *zonelist_policy(gfp_t gfp, int order,
+		struct mempolicy *policy)
 {
+	struct zonelist *zl;
 	int nd;
 
 	switch (policy->policy) {
@@ -746,6 +748,8 @@ static struct zonelist *zonelist_policy(
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
+		if ((zl = mem_rc_get_zonelist(nd, gfp, order)) != NULL)
+			return zl;
 		break;
 	default:
 		nd = 0;
@@ -844,7 +848,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		}
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages(gfp, 0, zonelist_policy(gfp, 0, pol));
 }
 
 /**
@@ -876,7 +880,7 @@ struct page *alloc_pages_current(gfp_t g
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+	return __alloc_pages(gfp, order, zonelist_policy(gfp, order, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 1/8] Add the __GFP_NOLRU flag
  2006-01-31  2:30   ` [PATCH 1/8] Add the __GFP_NOLRU flag KUROSAWA Takahiro
@ 2006-01-31 18:18     ` Dave Hansen
  2006-02-01  5:06       ` KUROSAWA Takahiro
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2006-01-31 18:18 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

On Tue, 2006-01-31 at 11:30 +0900, KUROSAWA Takahiro wrote:
> This patch adds the __GFP_NOLRU flag.  This option should be used 
> for GFP_USER/GFP_HIGHUSER page allocations that are not maintained
> in the zone LRU lists.

Is this simply to mark pages which will never end up in the LRU?  Why is
this important?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (7 preceding siblings ...)
  2006-01-31  2:30   ` [PATCH 8/8] Add a CKRM memory resource controller using pzones KUROSAWA Takahiro
@ 2006-02-01  2:58   ` chandra seetharaman
  2006-02-01  5:39     ` KUROSAWA Takahiro
  2006-02-01  3:07   ` chandra seetharaman
  9 siblings, 1 reply; 32+ messages in thread
From: chandra seetharaman @ 2006-02-01  2:58 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

Kurosawa,

I like the idea of multiple controllers for a resource. Users will have
options to choose from. Thanks for doing it.

I have few questions:
 - how are shared pages handled ?
 - what is the plan to support "limit" ?
 - can you provide more information in stats ?
 - is it designed to work with cpumeter alone (i.e without ckrm) ?

comment/suggestion:
 - IMO, moving pages from a class at time of reclassification would be
   the right thing to do. May be we have to add a pointer to Chris patch
   and make sure it works as we expect.
 - instead of adding the pseudo zone related code to the core memory
   files, you can put them in a separate file.
 - Documentation on how to configure and use it would be good.
  
regards,

chandra
  
On Tue, 2006-01-31 at 11:30 +0900, KUROSAWA Takahiro wrote:
> I've split the patches into smaller pieces in order to increase
> readability.  The core part of the patchset is the fifth one with
> the subject "Add the pzone_create() function."
> 
> Changes since the last post:
> * Fixed a bug that pages allocated with __GFP_COLD are incorrectly handled.
> * Moved the PZONE bit in page flags next to the zone number bits in 
>   order to make changes by pzones smaller.
> * Moved the nr_zones locking functions outside of the CONFIG_PSEUDO_ZONE
>   because they are not directly related to pzones.
> 
> Thanks,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
                     ` (8 preceding siblings ...)
  2006-02-01  2:58   ` [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller chandra seetharaman
@ 2006-02-01  3:07   ` chandra seetharaman
  2006-02-01  5:54     ` KUROSAWA Takahiro
  2006-02-03  1:33     ` KUROSAWA Takahiro
  9 siblings, 2 replies; 32+ messages in thread
From: chandra seetharaman @ 2006-02-01  3:07 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

Hi KUROSAWA,

I tried to use the controller but having some problems.

- Created class a,
- set guarantee to 50(with parent having 100, i expected class a to get 
  50% of memory in the system). 
- moved my shell to class a. 
- Issued a make in the kernel tree.
It consistently fails with 
-----------
make: getcwd: : Cannot allocate memory
Makefile:313: /scripts/Kbuild.include: No such file or directory
Makefile:532: /arch/i386/Makefile: No such file or directory
Can't open perl script "/scripts/setlocalversion": No such file or
directory
make: *** No rule to make target `/arch/i386/Makefile'.  Stop.
-----------
Note that the compilation succeeds if I move my shell to the default
class.

I got a oops too:
------------------------------
kernel BUG at mm/page_alloc.c:1074!
invalid operand: 0000 [#1]
SMP
Modules linked in:
CPU:    1
EIP:    0060:[<c013768d>]    Not tainted VLI
EFLAGS: 00010256   (2.6.15n)
EIP is at __free_pages+0x17/0x42
eax: 00000000   ebx: 00000000   ecx: c17f8b80   edx: c17f8b80
esi: f7c85578   edi: c1931e20   ebp: c1931a20   esp: d9799f98
ds: 007b   es: 007b   ss: 0068
Process make (pid: 12576, threadinfo=d9798000 task=f6324530)
Stack: c1931e20 c01637d1 ffc5c000 0000001b bfe6c930 bfe6c930 00001000
d9798000
       c01026fb bfe6c930 00001000 40143f0c bfe6c930 00001000 bfe6c098
000000b7
       0000007b c010007b 000000b7 ffffe410 00000073 00000286 bfe6c06c
0000007b
Call Trace:
 [<c01637d1>] sys_getcwd+0x17f/0x18a
 [<c01026fb>] sysenter_past_esp+0x54/0x79
Code: 4b 78 0e 8b 56 04 8b 44 9e 08 e8 da f8 ff ff eb ef 5b 5e c3 53 89
c1 89 d3 89 c2 8b 00 f6 c4 40 74 03 8b 51 0c 8b 42 04 40 75 08 <0f> 0b
32 04 45 72 30 c0 f0 83 41 04 ff 0f 98 c0 84 c0 74 15 85
-------------------------------------
Note: "if (put_page_testzero(page)) {" is line 1074 in my source tree

Also, I do not see a mem= line in the stats file for the default class.

chandra

On Tue, 2006-01-31 at 11:30 +0900, KUROSAWA Takahiro wrote:
> I've split the patches into smaller pieces in order to increase
> readability.  The core part of the patchset is the fifth one with
> the subject "Add the pzone_create() function."
> 
> Changes since the last post:
> * Fixed a bug that pages allocated with __GFP_COLD are incorrectly handled.
> * Moved the PZONE bit in page flags next to the zone number bits in 
>   order to make changes by pzones smaller.
> * Moved the nr_zones locking functions outside of the CONFIG_PSEUDO_ZONE
>   because they are not directly related to pzones.
> 
> Thanks,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 1/8] Add the __GFP_NOLRU flag
  2006-01-31 18:18     ` [ckrm-tech] " Dave Hansen
@ 2006-02-01  5:06       ` KUROSAWA Takahiro
  0 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-01  5:06 UTC (permalink / raw)
  To: Dave Hansen; +Cc: ckrm-tech, linux-mm

Hi,

On Tue, 31 Jan 2006 10:18:53 -0800
Dave Hansen <haveblue@us.ibm.com> wrote:

> > This patch adds the __GFP_NOLRU flag.  This option should be used 
> > for GFP_USER/GFP_HIGHUSER page allocations that are not maintained
> > in the zone LRU lists.
> 
> Is this simply to mark pages which will never end up in the LRU?  Why is
> this important?

The resource controller assumes that pages in pzones are linked to
LRU lists or free lists in order to simplify the cleanup of pzones
in classes.  Cleaning up a pzone needs to know all the pages that
belong to the pzone.  So the resource controller is designed not to 
allocate pages from pzones that will never end up in the LRU.

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-01  2:58   ` [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller chandra seetharaman
@ 2006-02-01  5:39     ` KUROSAWA Takahiro
  2006-02-01  6:16       ` Hirokazu Takahashi
  2006-02-02  1:26       ` chandra seetharaman
  0 siblings, 2 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-01  5:39 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

Chandra,

On Tue, 31 Jan 2006 18:58:18 -0800
chandra seetharaman <sekharan@us.ibm.com> wrote:

> I like the idea of multiple controllers for a resource. Users will have
> options to choose from. Thanks for doing it.

You are welcome.  Thanks for the comments.

> I have few questions:
>  - how are shared pages handled ?

Shared pages are accounted to the class that a task in it allocate 
the pages.  This behavior is different from the memory resource 
controller in CKRM.

>  - what is the plan to support "limit" ?

To be honest, I don't have any specific idea to support "limit" currently.
Probably the userspace daemon that enlarge "guarantee" to the specified
"limit" might support the "limit", because "guarantee" in the pzone based 
memory resource controller also works as "limit".

>  - can you provide more information in stats ?

Ok, I'll do that.

>  - is it designed to work with cpumeter alone (i.e without ckrm) ?

Maybe it works with cpumeter.

> comment/suggestion:
>  - IMO, moving pages from a class at time of reclassification would be
>    the right thing to do. May be we have to add a pointer to Chris patch
>    and make sure it works as we expect.
> 
>  - instead of adding the pseudo zone related code to the core memory
>    files, you can put them in a separate file.

That's right.  But I guess that several static functions in 
mm/page_alloc.c would need to be exported.

>  - Documentation on how to configure and use it would be good.

I think so too.  I'll write some documents.

Thanks,

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-01  3:07   ` chandra seetharaman
@ 2006-02-01  5:54     ` KUROSAWA Takahiro
  2006-02-03  1:33     ` KUROSAWA Takahiro
  1 sibling, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-01  5:54 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

On Tue, 31 Jan 2006 19:07:35 -0800
chandra seetharaman <sekharan@us.ibm.com> wrote:

> I tried to use the controller but having some problems.
> 
> - Created class a,
> - set guarantee to 50(with parent having 100, i expected class a to get 
>   50% of memory in the system). 
> - moved my shell to class a. 
> - Issued a make in the kernel tree.
> It consistently fails with 
> -----------
> make: getcwd: : Cannot allocate memory
> Makefile:313: /scripts/Kbuild.include: No such file or directory
> Makefile:532: /arch/i386/Makefile: No such file or directory
> Can't open perl script "/scripts/setlocalversion": No such file or
> directory
> make: *** No rule to make target `/arch/i386/Makefile'.  Stop.
> -----------
> Note that the compilation succeeds if I move my shell to the default
> class.

Hmm... That should be a bug in the pzones because the default class
has the same number of pages as the class a.

Could you show me the output of "cat /proc/zoneinfo" after setting up 
the class a?  The information would help me debugging.

> I got a oops too:
> ------------------------------
> kernel BUG at mm/page_alloc.c:1074!
> invalid operand: 0000 [#1]
> SMP
> Modules linked in:
> CPU:    1
> EIP:    0060:[<c013768d>]    Not tainted VLI
> EFLAGS: 00010256   (2.6.15n)
> EIP is at __free_pages+0x17/0x42
> eax: 00000000   ebx: 00000000   ecx: c17f8b80   edx: c17f8b80
> esi: f7c85578   edi: c1931e20   ebp: c1931a20   esp: d9799f98
> ds: 007b   es: 007b   ss: 0068
> Process make (pid: 12576, threadinfo=d9798000 task=f6324530)
> Stack: c1931e20 c01637d1 ffc5c000 0000001b bfe6c930 bfe6c930 00001000
> d9798000
>        c01026fb bfe6c930 00001000 40143f0c bfe6c930 00001000 bfe6c098
> 000000b7
>        0000007b c010007b 000000b7 ffffe410 00000073 00000286 bfe6c06c
> 0000007b
> Call Trace:
>  [<c01637d1>] sys_getcwd+0x17f/0x18a
>  [<c01026fb>] sysenter_past_esp+0x54/0x79
> Code: 4b 78 0e 8b 56 04 8b 44 9e 08 e8 da f8 ff ff eb ef 5b 5e c3 53 89
> c1 89 d3 89 c2 8b 00 f6 c4 40 74 03 8b 51 0c 8b 42 04 40 75 08 <0f> 0b
> 32 04 45 72 30 c0 f0 83 41 04 ff 0f 98 c0 84 c0 74 15 85
> -------------------------------------
> Note: "if (put_page_testzero(page)) {" is line 1074 in my source tree
> 
> Also, I do not see a mem= line in the stats file for the default class.

My source tree has the same line.  I'll investigate the oops.


Thanks for the report,

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-01  5:39     ` KUROSAWA Takahiro
@ 2006-02-01  6:16       ` Hirokazu Takahashi
  2006-02-02  1:26       ` chandra seetharaman
  1 sibling, 0 replies; 32+ messages in thread
From: Hirokazu Takahashi @ 2006-02-01  6:16 UTC (permalink / raw)
  To: sekharan; +Cc: kurosawa, ckrm-tech, linux-mm

Hi Chandra,

>You are welcome.  Thanks for the comments.
>
>> I have few questions:
>>  - how are shared pages handled ?
>
>Shared pages are accounted to the class that a task in it allocate 
>the pages.  This behavior is different from the memory resource 
>controller in CKRM.
>
>>  - what is the plan to support "limit" ?
>
>To be honest, I don't have any specific idea to support "limit" currently.
>Probably the userspace daemon that enlarge "guarantee" to the specified
>"limit" might support the "limit", because "guarantee" in the pzone based 
>memory resource controller also works as "limit".

I think that this should be placed outside of Kurosawa's controller.
That would be a kernel thread between CKRM API and the controller,
which collects statistics about pzones to change the size of them
between "guarantee" and "limit". An userspace daemon would be also OK.

To separate balancing pzones code from VM will keep the code simple.

>>  - can you provide more information in stats ?
>
>Ok, I'll do that.
>
>>  - is it designed to work with cpumeter alone (i.e without ckrm) ?
>
>Maybe it works with cpumeter.
>
>> comment/suggestion:
>>  - IMO, moving pages from a class at time of reclassification would be
>>    the right thing to do. May be we have to add a pointer to Chris patch
>>    and make sure it works as we expect.
>> 
>>  - instead of adding the pseudo zone related code to the core memory
>>    files, you can put them in a separate file.
>
>That's right.  But I guess that several static functions in 
>mm/page_alloc.c would need to be exported.
>
>>  - Documentation on how to configure and use it would be good.
>
>I think so too.  I'll write some documents.

Thanks,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-01  5:39     ` KUROSAWA Takahiro
  2006-02-01  6:16       ` Hirokazu Takahashi
@ 2006-02-02  1:26       ` chandra seetharaman
  2006-02-02  3:54         ` KUROSAWA Takahiro
  1 sibling, 1 reply; 32+ messages in thread
From: chandra seetharaman @ 2006-02-02  1:26 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

On Wed, 2006-02-01 at 14:39 +0900, KUROSAWA Takahiro wrote:
> Chandra,
> 
> On Tue, 31 Jan 2006 18:58:18 -0800
> chandra seetharaman <sekharan@us.ibm.com> wrote:
> 
> > I like the idea of multiple controllers for a resource. Users will have
> > options to choose from. Thanks for doing it.
> 
> You are welcome.  Thanks for the comments.
> 
> > I have few questions:
> >  - how are shared pages handled ?
> 
> Shared pages are accounted to the class that a task in it allocate 
> the pages.  This behavior is different from the memory resource 
> controller in CKRM.

all others get a free access ? It may not be a good option. SHared pages
either have to be accounted separately or shared between the classes
that use them.

current memrc also charges to the class that brings the page in. Valerie
is in the process of making changes to make the shared pages belong to a
separate class.

> 
> >  - what is the plan to support "limit" ?
> 
> To be honest, I don't have any specific idea to support "limit" currently.
> Probably the userspace daemon that enlarge "guarantee" to the specified
> "limit" might support the "limit", because "guarantee" in the pzone based 
> memory resource controller also works as "limit".

I am not able to visualize how this will work.

In simple terms, sum of guarantees should _not_ exceed the amount of
available memory but, sum of limits _can_ exceed the amount of available
memory. As far as i understand your implementation, guarantee is
translated to present_pages of the pseudo zone (and is subtracted from
paren't present_pages). How can one set limit to be same as guarantee ?

> 
> >  - can you provide more information in stats ?
> 
> Ok, I'll do that.
> 
> >  - is it designed to work with cpumeter alone (i.e without ckrm) ?
> 
> Maybe it works with cpumeter.

have you tested it without ckrm (i mean only with cpumeter)
> 
> > comment/suggestion:
> >  - IMO, moving pages from a class at time of reclassification would be
> >    the right thing to do. May be we have to add a pointer to Chris patch
> >    and make sure it works as we expect.
> > 
> >  - instead of adding the pseudo zone related code to the core memory
> >    files, you can put them in a separate file.
> 
> That's right.  But I guess that several static functions in 
> mm/page_alloc.c would need to be exported.

it will be a lot cleaner.
> 
> >  - Documentation on how to configure and use it would be good.
> 
> I think so too.  I'll write some documents.
> 
> Thanks,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-02  1:26       ` chandra seetharaman
@ 2006-02-02  3:54         ` KUROSAWA Takahiro
  2006-02-03  0:37           ` chandra seetharaman
  0 siblings, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-02  3:54 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

On Wed, 01 Feb 2006 17:26:00 -0800
chandra seetharaman <sekharan@us.ibm.com> wrote:

> > >  - what is the plan to support "limit" ?
> > 
> > To be honest, I don't have any specific idea to support "limit" currently.
> > Probably the userspace daemon that enlarge "guarantee" to the specified
> > "limit" might support the "limit", because "guarantee" in the pzone based 
> > memory resource controller also works as "limit".
> 
> I am not able to visualize how this will work.
> 
> In simple terms, sum of guarantees should _not_ exceed the amount of
> available memory but, sum of limits _can_ exceed the amount of available
> memory. As far as i understand your implementation, guarantee is
> translated to present_pages of the pseudo zone (and is subtracted from
> paren't present_pages). How can one set limit to be same as guarantee ?

The number of pages in the pseudo zones can also be considered as limit
because tasks in a class can't allocate beyond the number of the pages
that are allocated to the pseudo zones.  

> > >  - can you provide more information in stats ?
> > 
> > Ok, I'll do that.
> > 
> > >  - is it designed to work with cpumeter alone (i.e without ckrm) ?
> > 
> > Maybe it works with cpumeter.
> 
> have you tested it without ckrm (i mean only with cpumeter)

The patches I had sent don't include a harness to cpumeter, so 
we can't run with cpumeter.  I suppose we need little work for modifying
mm/mem_rc_pzone.c to work with cpumeter because the file was originally 
written for cpumeter.

-- 
KUROSAWA, Takahiro

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-02  3:54         ` KUROSAWA Takahiro
@ 2006-02-03  0:37           ` chandra seetharaman
  2006-02-03  0:51             ` KUROSAWA Takahiro
  0 siblings, 1 reply; 32+ messages in thread
From: chandra seetharaman @ 2006-02-03  0:37 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

On Thu, 2006-02-02 at 12:54 +0900, KUROSAWA Takahiro wrote:
> On Wed, 01 Feb 2006 17:26:00 -0800
> chandra seetharaman <sekharan@us.ibm.com> wrote:
> 
> > > >  - what is the plan to support "limit" ?
> > > 
> > > To be honest, I don't have any specific idea to support "limit" currently.
> > > Probably the userspace daemon that enlarge "guarantee" to the specified
> > > "limit" might support the "limit", because "guarantee" in the pzone based 
> > > memory resource controller also works as "limit".
> > 
> > I am not able to visualize how this will work.
> > 
> > In simple terms, sum of guarantees should _not_ exceed the amount of
> > available memory but, sum of limits _can_ exceed the amount of available
> > memory. As far as i understand your implementation, guarantee is
> > translated to present_pages of the pseudo zone (and is subtracted from
> > paren't present_pages). How can one set limit to be same as guarantee ?
> 
> The number of pages in the pseudo zones can also be considered as limit
> because tasks in a class can't allocate beyond the number of the pages
> that are allocated to the pseudo zones.  

Yes. but, it is true only when limit and guarantee are the same.

Consider the following scenario:
 A system with 1024MB of memory.

I want to create 6 classes:
 - 4 of which has guarantee of 128MB and limit of 512MB
 - 2 of which has guarantee of 256MB and limit of 768MB

We cannot do this with this memrc. Can you explain how a userspace
program can help me do this.

thanks,

chandra 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-03  0:37           ` chandra seetharaman
@ 2006-02-03  0:51             ` KUROSAWA Takahiro
  2006-02-03  1:01               ` chandra seetharaman
  0 siblings, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-03  0:51 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

On Thu, 02 Feb 2006 16:37:37 -0800
chandra seetharaman <sekharan@us.ibm.com> wrote:

> > > > >  - what is the plan to support "limit" ?
> > > > 
> > > > To be honest, I don't have any specific idea to support "limit" currently.
> > > > Probably the userspace daemon that enlarge "guarantee" to the specified
> > > > "limit" might support the "limit", because "guarantee" in the pzone based 
> > > > memory resource controller also works as "limit".
> > > 
> > > I am not able to visualize how this will work.
> > > 
> > > In simple terms, sum of guarantees should _not_ exceed the amount of
> > > available memory but, sum of limits _can_ exceed the amount of available
> > > memory. As far as i understand your implementation, guarantee is
> > > translated to present_pages of the pseudo zone (and is subtracted from
> > > paren't present_pages). How can one set limit to be same as guarantee ?
> > 
> > The number of pages in the pseudo zones can also be considered as limit
> > because tasks in a class can't allocate beyond the number of the pages
> > that are allocated to the pseudo zones.  
> 
> Yes. but, it is true only when limit and guarantee are the same.
> 
> Consider the following scenario:
>  A system with 1024MB of memory.
> 
> I want to create 6 classes:
>  - 4 of which has guarantee of 128MB and limit of 512MB
>  - 2 of which has guarantee of 256MB and limit of 768MB
> 
> We cannot do this with this memrc. Can you explain how a userspace
> program can help me do this.

Our memrc with a userspace program doesn't help this case.
If you'd like to setup classes like this, you can select the memory
resource controller in current CKRM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-03  0:51             ` KUROSAWA Takahiro
@ 2006-02-03  1:01               ` chandra seetharaman
  0 siblings, 0 replies; 32+ messages in thread
From: chandra seetharaman @ 2006-02-03  1:01 UTC (permalink / raw)
  To: KUROSAWA Takahiro; +Cc: ckrm-tech, linux-mm

On Fri, 2006-02-03 at 09:51 +0900, KUROSAWA Takahiro wrote:
> On Thu, 02 Feb 2006 16:37:37 -0800
> chandra seetharaman <sekharan@us.ibm.com> wrote:
> 
> > > > > >  - what is the plan to support "limit" ?
> > > > > 
> > > > > To be honest, I don't have any specific idea to support "limit" currently.
> > > > > Probably the userspace daemon that enlarge "guarantee" to the specified
> > > > > "limit" might support the "limit", because "guarantee" in the pzone based 
> > > > > memory resource controller also works as "limit".
> > > > 
> > > > I am not able to visualize how this will work.
> > > > 
> > > > In simple terms, sum of guarantees should _not_ exceed the amount of
> > > > available memory but, sum of limits _can_ exceed the amount of available
> > > > memory. As far as i understand your implementation, guarantee is
> > > > translated to present_pages of the pseudo zone (and is subtracted from
> > > > paren't present_pages). How can one set limit to be same as guarantee ?
> > > 
> > > The number of pages in the pseudo zones can also be considered as limit
> > > because tasks in a class can't allocate beyond the number of the pages
> > > that are allocated to the pseudo zones.  
> > 
> > Yes. but, it is true only when limit and guarantee are the same.
> > 
> > Consider the following scenario:
> >  A system with 1024MB of memory.
> > 
> > I want to create 6 classes:
> >  - 4 of which has guarantee of 128MB and limit of 512MB
> >  - 2 of which has guarantee of 256MB and limit of 768MB
> > 
> > We cannot do this with this memrc. Can you explain how a userspace
> > program can help me do this.
> 
> Our memrc with a userspace program doesn't help this case.
> If you'd like to setup classes like this, you can select the memory
> resource controller in current CKRM.

That is how we intended guarantee and limit to work. What was your
understanding, and what one can do through the userspace support ?

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-01  3:07   ` chandra seetharaman
  2006-02-01  5:54     ` KUROSAWA Takahiro
@ 2006-02-03  1:33     ` KUROSAWA Takahiro
  2006-02-03  9:37       ` KUROSAWA Takahiro
  1 sibling, 1 reply; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-03  1:33 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

On Tue, 31 Jan 2006 19:07:35 -0800
chandra seetharaman <sekharan@us.ibm.com> wrote:

> I tried to use the controller but having some problems.
> 
> - Created class a,
> - set guarantee to 50(with parent having 100, i expected class a to get 
>   50% of memory in the system). 
> - moved my shell to class a. 
> - Issued a make in the kernel tree.
> It consistently fails with 
> -----------
> make: getcwd: : Cannot allocate memory
> Makefile:313: /scripts/Kbuild.include: No such file or directory
> Makefile:532: /arch/i386/Makefile: No such file or directory
> Can't open perl script "/scripts/setlocalversion": No such file or
> directory
> make: *** No rule to make target `/arch/i386/Makefile'.  Stop.
> -----------
> Note that the compilation succeeds if I move my shell to the default
> class.

I could reproduce this problem.  Could you try the attached patch?

> I got a oops too:

I can't reproduce the oops so far.  Does this oops also occur constantly?

[-- Attachment #2: memrc-pzone-gfp-fix.diff --]
[-- Type: text/plain, Size: 3792 bytes --]

Index: mm/mem_rc_pzone.c
===================================================================
RCS file: /cvsroot/ckrm/memrc-pzone/mm/mem_rc_pzone.c,v
retrieving revision 1.9
diff -u -p -r1.9 mem_rc_pzone.c
--- mm/mem_rc_pzone.c	19 Jan 2006 05:40:13 -0000	1.9
+++ mm/mem_rc_pzone.c	3 Feb 2006 01:01:22 -0000
@@ -38,7 +38,7 @@ struct mem_rc {
 	unsigned long guarantee;
 	struct mem_rc_domain *rcd;
 	struct zone **zones[MAX_NUMNODES];
-	struct zonelist *zonelists[MAX_NUMNODES];
+	struct zonelist *zonelists[MAX_NUMNODES][GFP_ZONETYPES];
 };
 
 
@@ -109,7 +109,7 @@ static void *mem_rc_create(void *arg, st
 	struct zone *parent, *z, *z_ref;
 	pg_data_t *pgdat;
 	int node, allocn;
-	int i, j;
+	int i, j, k;
 
 	allocn = first_node(rcd->nodes);
 	mr = kmalloc_node(sizeof(*mr), GFP_KERNEL, allocn);
@@ -132,13 +132,16 @@ static void *mem_rc_create(void *arg, st
 		memset(mr->zones[node], 0,
 		       sizeof(*mr->zones[node]) * MAX_NR_ZONES);
 
-		mr->zonelists[node]
-			= kmalloc_node(sizeof(*mr->zonelists[node]),
-				       GFP_KERNEL, allocn);
-		if (!mr->zonelists[node])
-			goto failed;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			mr->zonelists[node][i]
+				= kmalloc_node(sizeof(*mr->zonelists[node]),
+					       GFP_KERNEL, allocn);
+			if (!mr->zonelists[node][i])
+				goto failed;
 
-		memset(mr->zonelists[node], 0, sizeof(*mr->zonelists[node]));
+			memset(mr->zonelists[node][i], 0,
+			       sizeof(*mr->zonelists[node][i]));
+		}
 
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			parent = pgdat->node_zones + i;
@@ -153,21 +156,22 @@ static void *mem_rc_create(void *arg, st
 	}
 
 	for_each_node_mask(node, rcd->nodes) {
-		/* NORMAL zones and DMA zones also in HIGHMEM zonelist. */
-		zl_ref = NODE_DATA(node)->node_zonelists + __GFP_HIGHMEM;
-		zl = mr->zonelists[node];
-
-		for (j = i = 0; i < ARRAY_SIZE(zl_ref->zones); i++) {
-			z_ref = zl_ref->zones[i];
-			if (!z_ref)
-				break;
-
-			z = mr->zones[node][zone_idx(z_ref)];
-			if (!z)
-				continue;
-			zl->zones[j++] = z;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			zl_ref = NODE_DATA(node)->node_zonelists + i;
+			zl = mr->zonelists[node][i];
+
+			for (j = k = 0; k < ARRAY_SIZE(zl_ref->zones); k++) {
+				z_ref = zl_ref->zones[k];
+				if (!z_ref)
+					break;
+
+				z = mr->zones[node][zone_idx(z_ref)];
+				if (!z)
+					continue;
+				zl->zones[j++] = z;
+			}
+			zl->zones[j] = NULL;
 		}
-		zl->zones[j] = NULL;
 	}
 	up(&rcd->sem);
 
@@ -175,8 +179,10 @@ static void *mem_rc_create(void *arg, st
 
 failed:
 	for_each_node_mask(node, rcd->nodes) {
-		if (mr->zonelists[node])
-			kfree(mr->zonelists[node]);
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			if (mr->zonelists[node][i])
+				kfree(mr->zonelists[node][i]);
+		}
 
 		if (!mr->zones[node])
 			continue;
@@ -204,8 +210,10 @@ static void mem_rc_destroy(void *p)
 
 	down(&rcd->sem);
 	for (node = 0; node < MAX_NUMNODES; node++) {
-		if (mr->zonelists[node])
-			kfree(mr->zonelists[node]);
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			if (mr->zonelists[node][i])
+				kfree(mr->zonelists[node][i]);
+		}
 			
 		if (!mr->zones[node])
 			continue;
@@ -341,14 +349,15 @@ EXPORT_SYMBOL(mem_rc_get);
 struct page *alloc_page_mem_rc(int nid, gfp_t gfpmask)
 {
 	struct mem_rc *mr;
+	gfp_t zoneidx = gfpmask & GFP_ZONEMASK;
 
 	mr = mem_rc_get(current);
 	if (!mr)
 		return __alloc_pages(gfpmask, 0,
 				     NODE_DATA(nid)->node_zonelists
-				     + (gfpmask & GFP_ZONEMASK));
+				     + zoneidx);
 
-	return __alloc_pages(gfpmask, 0, mr->zonelists[nid]);
+	return __alloc_pages(gfpmask, 0, mr->zonelists[nid][zoneidx]);
 }
 EXPORT_SYMBOL(alloc_page_mem_rc);
 
@@ -364,5 +373,5 @@ struct zonelist *mem_rc_get_zonelist(int
 	if (!mr)
 		return NULL;
 
-	return mr->zonelists[nd];
+	return mr->zonelists[nd][gfpmask & GFP_ZONEMASK];
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller
  2006-02-03  1:33     ` KUROSAWA Takahiro
@ 2006-02-03  9:37       ` KUROSAWA Takahiro
  0 siblings, 0 replies; 32+ messages in thread
From: KUROSAWA Takahiro @ 2006-02-03  9:37 UTC (permalink / raw)
  To: sekharan; +Cc: ckrm-tech, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On Fri, 3 Feb 2006 10:33:58 +0900
KUROSAWA Takahiro <kurosawa@valinux.co.jp> wrote:
> On Tue, 31 Jan 2006 19:07:35 -0800
> chandra seetharaman <sekharan@us.ibm.com> wrote:
> 
> > I tried to use the controller but having some problems.
> > 
> > - Created class a,
> > - set guarantee to 50(with parent having 100, i expected class a to get 
> >   50% of memory in the system). 
> > - moved my shell to class a. 
> > - Issued a make in the kernel tree.
> > It consistently fails with 
> > -----------
> > make: getcwd: : Cannot allocate memory
> > Makefile:313: /scripts/Kbuild.include: No such file or directory
> > Makefile:532: /arch/i386/Makefile: No such file or directory
> > Can't open perl script "/scripts/setlocalversion": No such file or
> > directory
> > make: *** No rule to make target `/arch/i386/Makefile'.  Stop.
> > -----------
> > Note that the compilation succeeds if I move my shell to the default
> > class.
> 
> I could reproduce this problem.  Could you try the attached patch?

I'm sorry, the patch attached to my previous mail has a severe bug.
Could you try this patch instead?
Also, the code still doesn't work if you enable preemption because of
a locking problem so far...

[-- Attachment #2: memrc-pzone-gfp-fix2.diff --]
[-- Type: text/plain, Size: 3817 bytes --]

Index: mm/mem_rc_pzone.c
===================================================================
RCS file: /cvsroot/ckrm/memrc-pzone/mm/mem_rc_pzone.c,v
retrieving revision 1.9
diff -u -p -r1.9 mem_rc_pzone.c
--- mm/mem_rc_pzone.c	19 Jan 2006 05:40:13 -0000	1.9
+++ mm/mem_rc_pzone.c	3 Feb 2006 08:30:15 -0000
@@ -38,7 +38,7 @@ struct mem_rc {
 	unsigned long guarantee;
 	struct mem_rc_domain *rcd;
 	struct zone **zones[MAX_NUMNODES];
-	struct zonelist *zonelists[MAX_NUMNODES];
+	struct zonelist *zonelists[MAX_NUMNODES][GFP_ZONETYPES];
 };
 
 
@@ -109,7 +109,7 @@ static void *mem_rc_create(void *arg, st
 	struct zone *parent, *z, *z_ref;
 	pg_data_t *pgdat;
 	int node, allocn;
-	int i, j;
+	int i, j, k;
 
 	allocn = first_node(rcd->nodes);
 	mr = kmalloc_node(sizeof(*mr), GFP_KERNEL, allocn);
@@ -132,13 +132,16 @@ static void *mem_rc_create(void *arg, st
 		memset(mr->zones[node], 0,
 		       sizeof(*mr->zones[node]) * MAX_NR_ZONES);
 
-		mr->zonelists[node]
-			= kmalloc_node(sizeof(*mr->zonelists[node]),
-				       GFP_KERNEL, allocn);
-		if (!mr->zonelists[node])
-			goto failed;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			mr->zonelists[node][i]
+				= kmalloc_node(sizeof(*mr->zonelists[node][i]),
+					       GFP_KERNEL, allocn);
+			if (!mr->zonelists[node][i])
+				goto failed;
 
-		memset(mr->zonelists[node], 0, sizeof(*mr->zonelists[node]));
+			memset(mr->zonelists[node][i], 0,
+			       sizeof(*mr->zonelists[node][i]));
+		}
 
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			parent = pgdat->node_zones + i;
@@ -153,21 +156,22 @@ static void *mem_rc_create(void *arg, st
 	}
 
 	for_each_node_mask(node, rcd->nodes) {
-		/* NORMAL zones and DMA zones also in HIGHMEM zonelist. */
-		zl_ref = NODE_DATA(node)->node_zonelists + __GFP_HIGHMEM;
-		zl = mr->zonelists[node];
-
-		for (j = i = 0; i < ARRAY_SIZE(zl_ref->zones); i++) {
-			z_ref = zl_ref->zones[i];
-			if (!z_ref)
-				break;
-
-			z = mr->zones[node][zone_idx(z_ref)];
-			if (!z)
-				continue;
-			zl->zones[j++] = z;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			zl_ref = NODE_DATA(node)->node_zonelists + i;
+			zl = mr->zonelists[node][i];
+
+			for (j = k = 0; k < ARRAY_SIZE(zl_ref->zones); k++) {
+				z_ref = zl_ref->zones[k];
+				if (!z_ref)
+					break;
+
+				z = mr->zones[z_ref->zone_pgdat->node_id][zone_idx(z_ref)];
+				if (!z)
+					continue;
+				zl->zones[j++] = z;
+			}
+			zl->zones[j] = NULL;
 		}
-		zl->zones[j] = NULL;
 	}
 	up(&rcd->sem);
 
@@ -175,8 +179,10 @@ static void *mem_rc_create(void *arg, st
 
 failed:
 	for_each_node_mask(node, rcd->nodes) {
-		if (mr->zonelists[node])
-			kfree(mr->zonelists[node]);
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			if (mr->zonelists[node][i])
+				kfree(mr->zonelists[node][i]);
+		}
 
 		if (!mr->zones[node])
 			continue;
@@ -204,8 +210,10 @@ static void mem_rc_destroy(void *p)
 
 	down(&rcd->sem);
 	for (node = 0; node < MAX_NUMNODES; node++) {
-		if (mr->zonelists[node])
-			kfree(mr->zonelists[node]);
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			if (mr->zonelists[node][i])
+				kfree(mr->zonelists[node][i]);
+		}
 			
 		if (!mr->zones[node])
 			continue;
@@ -341,14 +349,15 @@ EXPORT_SYMBOL(mem_rc_get);
 struct page *alloc_page_mem_rc(int nid, gfp_t gfpmask)
 {
 	struct mem_rc *mr;
+	gfp_t zoneidx = gfpmask & GFP_ZONEMASK;
 
 	mr = mem_rc_get(current);
 	if (!mr)
 		return __alloc_pages(gfpmask, 0,
 				     NODE_DATA(nid)->node_zonelists
-				     + (gfpmask & GFP_ZONEMASK));
+				     + zoneidx);
 
-	return __alloc_pages(gfpmask, 0, mr->zonelists[nid]);
+	return __alloc_pages(gfpmask, 0, mr->zonelists[nid][zoneidx]);
 }
 EXPORT_SYMBOL(alloc_page_mem_rc);
 
@@ -364,5 +373,5 @@ struct zonelist *mem_rc_get_zonelist(int
 	if (!mr)
 		return NULL;
 
-	return mr->zonelists[nd];
+	return mr->zonelists[nd][gfpmask & GFP_ZONEMASK];
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-02-03  9:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-19  8:04 [PATCH 0/2] Pzone based CKRM memory resource controller KUROSAWA Takahiro
2006-01-19  8:04 ` [PATCH 1/2] Add the pzone KUROSAWA Takahiro
2006-01-19 18:04   ` Andy Whitcroft
2006-01-19 23:42     ` KUROSAWA Takahiro
2006-01-20  9:17       ` Andy Whitcroft
2006-01-20  7:08   ` KAMEZAWA Hiroyuki
2006-01-20  8:22     ` KUROSAWA Takahiro
2006-01-20  8:30       ` KAMEZAWA Hiroyuki
2006-01-19  8:04 ` [PATCH 2/2] Add CKRM memory resource controller using pzones KUROSAWA Takahiro
2006-01-31  2:30 ` [PATCH 0/8] Pzone based CKRM memory resource controller KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 1/8] Add the __GFP_NOLRU flag KUROSAWA Takahiro
2006-01-31 18:18     ` [ckrm-tech] " Dave Hansen
2006-02-01  5:06       ` KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 2/8] Keep the number of zones while zone iterator loop KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 3/8] Add for_each_zone_in_node macro KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 4/8] Extract zone specific routines as functions KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 5/8] Add the pzone_create() function KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 6/8] Add the pzone_destroy() function KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 7/8] Make the number of pages in pzones resizable KUROSAWA Takahiro
2006-01-31  2:30   ` [PATCH 8/8] Add a CKRM memory resource controller using pzones KUROSAWA Takahiro
2006-02-01  2:58   ` [ckrm-tech] [PATCH 0/8] Pzone based CKRM memory resource controller chandra seetharaman
2006-02-01  5:39     ` KUROSAWA Takahiro
2006-02-01  6:16       ` Hirokazu Takahashi
2006-02-02  1:26       ` chandra seetharaman
2006-02-02  3:54         ` KUROSAWA Takahiro
2006-02-03  0:37           ` chandra seetharaman
2006-02-03  0:51             ` KUROSAWA Takahiro
2006-02-03  1:01               ` chandra seetharaman
2006-02-01  3:07   ` chandra seetharaman
2006-02-01  5:54     ` KUROSAWA Takahiro
2006-02-03  1:33     ` KUROSAWA Takahiro
2006-02-03  9:37       ` KUROSAWA Takahiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox