[PATCH] memory hotremoval for linux-2.6.7 [0/16]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] memory hotremoval for linux-2.6.7 [0/16]
@ 2004-07-14 13:41 Hirokazu Takahashi
  2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 13:41 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

Hi,

I'm pleased to say I've cleaned up the memory hotremoval patch
Mr. Iwamoto implemented. Part of ugly code has gone.

Main changes are:

  - Replaced the name remap with mmigrate as it was used for
    another fuctionality.

  - Made some of the memory hotremoval code share with the swapout-code.

  - Added many comments to describe the design of the memory hotremoval.

  - Added a basic funtion to support for memsection.
    try_to_migrate_page() is it. It continues to get a proper page
    in a specified section and migrate it while there remain pages
    in the section.

The patches are against linux-2.6.7.

Note that some patches are to fix bugs. Without the patches hugetlbpage
migration won't work.

Thanks,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] memory hotremoval for linux-2.6.7 [1/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
@ 2004-07-14 14:02 ` Hirokazu Takahashi
  2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:02 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7.ORG/include/linux/mm_inline.h	Sat Jul 10 12:42:43 2032
+++ linux-2.6.7/include/linux/mm_inline.h	Sat Jul 10 12:34:19 2032
@@ -38,3 +38,42 @@ del_page_from_lru(struct zone *zone, str
 		zone->nr_inactive--;
 	}
 }
+
+static inline struct page *
+steal_page_from_lru(struct zone *zone, struct page *page)
+{
+	if (!TestClearPageLRU(page))
+		BUG();
+	list_del(&page->lru);
+	if (get_page_testone(page)) {
+		/*
+		 * It was already free!  release_pages() or put_page()
+		 * are about to remove it from the LRU and free it. So
+		 * put the refcount back and put the page back on the
+		 * LRU
+		 */
+		__put_page(page);
+		SetPageLRU(page);
+		if (PageActive(page))
+			list_add(&page->lru, &zone->active_list);
+		else
+			list_add(&page->lru, &zone->inactive_list);
+		return NULL;
+	}
+	if (PageActive(page))
+		zone->nr_active--;
+	else
+		zone->nr_inactive--;
+	return page;
+}
+
+static inline void
+putback_page_to_lru(struct zone *zone, struct page *page)
+{
+	if (TestSetPageLRU(page))
+		BUG();
+	if (PageActive(page))
+		add_page_to_active_list(zone, page);
+	else
+		add_page_to_inactive_list(zone, page);
+}
--- linux-2.6.7.ORG/mm/vmscan.c	Sat Jul 10 12:42:43 2032
+++ linux-2.6.7/mm/vmscan.c	Sat Jul 10 12:41:29 2032
@@ -557,23 +557,11 @@ static void shrink_cache(struct zone *zo
 
 			prefetchw_prev_lru_page(page,
 						&zone->inactive_list, flags);
-
-			if (!TestClearPageLRU(page))
-				BUG();
-			list_del(&page->lru);
-			if (get_page_testone(page)) {
-				/*
-				 * It is being freed elsewhere
-				 */
-				__put_page(page);
-				SetPageLRU(page);
-				list_add(&page->lru, &zone->inactive_list);
+			if (steal_page_from_lru(zone, page) == NULL)
 				continue;
-			}
 			list_add(&page->lru, &page_list);
 			nr_taken++;
 		}
-		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_taken;
 		spin_unlock_irq(&zone->lru_lock);
 
@@ -596,13 +584,8 @@ static void shrink_cache(struct zone *zo
 		 */
 		while (!list_empty(&page_list)) {
 			page = lru_to_page(&page_list);
-			if (TestSetPageLRU(page))
-				BUG();
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			putback_page_to_lru(zone, page);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -655,26 +638,12 @@ refill_inactive_zone(struct zone *zone, 
 	while (pgscanned < nr_pages && !list_empty(&zone->active_list)) {
 		page = lru_to_page(&zone->active_list);
 		prefetchw_prev_lru_page(page, &zone->active_list, flags);
-		if (!TestClearPageLRU(page))
-			BUG();
-		list_del(&page->lru);
-		if (get_page_testone(page)) {
-			/*
-			 * It was already free!  release_pages() or put_page()
-			 * are about to remove it from the LRU and free it. So
-			 * put the refcount back and put the page back on the
-			 * LRU
-			 */
-			__put_page(page);
-			SetPageLRU(page);
-			list_add(&page->lru, &zone->active_list);
-		} else {
+		if (steal_page_from_lru(zone, page) != NULL) {
 			list_add(&page->lru, &l_hold);
 			pgmoved++;
 		}
 		pgscanned++;
 	}
-	zone->nr_active -= pgmoved;
 	spin_unlock_irq(&zone->lru_lock);
 
 	/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] memory hotremoval for linux-2.6.7 [2/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
  2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
@ 2004-07-14 14:02 ` Hirokazu Takahashi
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:02 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7.ORG/include/linux/swap.h	Sat Jul 10 12:30:17 2032
+++ linux-2.6.7/include/linux/swap.h	Sat Jul 10 13:47:57 2032
@@ -174,6 +174,17 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
 extern int shrink_all_memory(int);
+typedef enum {
+	/* failed to write page out, page is locked */
+	PAGE_KEEP,
+	/* move page to the active list, page is locked */
+	PAGE_ACTIVATE,
+	/* page has been sent to the disk successfully, page is unlocked */
+	PAGE_SUCCESS,
+	/* page is clean and locked */
+	PAGE_CLEAN,
+} pageout_t;
+extern pageout_t pageout(struct page *, struct address_space *);
 extern int vm_swappiness;
 
 #ifdef CONFIG_MMU
--- linux-2.6.7.ORG/mm/vmscan.c	Sat Jul 10 15:13:47 2032
+++ linux-2.6.7/mm/vmscan.c	Sat Jul 10 13:48:42 2032
@@ -236,22 +241,10 @@ static void handle_write_error(struct ad
 	unlock_page(page);
 }
 
-/* possible outcome of pageout() */
-typedef enum {
-	/* failed to write page out, page is locked */
-	PAGE_KEEP,
-	/* move page to the active list, page is locked */
-	PAGE_ACTIVATE,
-	/* page has been sent to the disk successfully, page is unlocked */
-	PAGE_SUCCESS,
-	/* page is clean and locked */
-	PAGE_CLEAN,
-} pageout_t;
-
 /*
  * pageout is called by shrink_list() for each dirty page. Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+pageout_t pageout(struct page *page, struct address_space *mapping)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [3/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
  2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
  2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7.ORG/arch/i386/Kconfig	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/arch/i386/Kconfig	Sun Jul 11 10:04:58 2032
@@ -734,9 +734,19 @@ comment "NUMA (NUMA-Q) requires SMP, 64G
 comment "NUMA (Summit) requires SMP, 64GB highmem support, ACPI"
 	depends on X86_SUMMIT && (!HIGHMEM64G || !ACPI)
 
+config MEMHOTPLUG
+	bool "Memory hotplug test"
+	depends on !X86_PAE
+	default n
+
+config MEMHOTPLUG_BLKSIZE
+	int "Size of a memory hotplug unit (in MB, must be multiple of 256)."
+	range 256 1024
+	depends on MEMHOTPLUG
+
 config DISCONTIGMEM
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUG
 	default y
 
 config HAVE_ARCH_BOOTMEM_NODE
--- linux-2.6.7.ORG/include/linux/gfp.h	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/gfp.h	Sat Jul 10 19:37:22 2032
@@ -11,9 +11,10 @@ struct vm_area_struct;
 /*
  * GFP bitmasks..
  */
-/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
-#define __GFP_DMA	0x01
-#define __GFP_HIGHMEM	0x02
+/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low three bits) */
+#define __GFP_DMA		0x01
+#define __GFP_HIGHMEM		0x02
+#define __GFP_HOTREMOVABLE	0x03
 
 /*
  * Action modifiers - doesn't change the zoning
@@ -51,7 +52,7 @@ struct vm_area_struct;
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_HOTREMOVABLE)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
--- linux-2.6.7.ORG/include/linux/mmzone.h	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/mmzone.h	Sun Jul 11 10:04:13 2032
@@ -65,8 +65,10 @@ struct per_cpu_pageset {
 #define ZONE_DMA		0
 #define ZONE_NORMAL		1
 #define ZONE_HIGHMEM		2
+#define ZONE_HOTREMOVABLE	3	/* only for zonelists */
 
 #define MAX_NR_ZONES		3	/* Sync this with ZONES_SHIFT */
+#define MAX_NR_ZONELISTS	4
 #define ZONES_SHIFT		2	/* ceil(log2(MAX_NR_ZONES)) */
 
 #define GFP_ZONEMASK	0x03
@@ -225,7 +227,7 @@ struct zonelist {
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelists[MAX_NR_ZONELISTS];
 	int nr_zones;
 	struct page *node_mem_map;
 	struct bootmem_data *bdata;
@@ -237,6 +239,7 @@ typedef struct pglist_data {
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+	char removable, enabled;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
--- linux-2.6.7.ORG/include/linux/page-flags.h	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/page-flags.h	Sun Jul 11 10:04:13 2032
@@ -78,6 +78,8 @@
 
 #define PG_anon			20	/* Anonymous: anon_vma in mapping */
 
+#define PG_again		21
+
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -297,6 +299,10 @@ extern unsigned long __read_page_state(u
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define SetPageCompound(page)	set_bit(PG_compound, &(page)->flags)
 #define ClearPageCompound(page)	clear_bit(PG_compound, &(page)->flags)
+
+#define PageAgain(page)	test_bit(PG_again, &(page)->flags)
+#define SetPageAgain(page)	set_bit(PG_again, &(page)->flags)
+#define ClearPageAgain(page)	clear_bit(PG_again, &(page)->flags)
 
 #define PageAnon(page)		test_bit(PG_anon, &(page)->flags)
 #define SetPageAnon(page)	set_bit(PG_anon, &(page)->flags)
--- linux-2.6.7.ORG/include/linux/rmap.h	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/rmap.h	Sat Jul 10 19:37:22 2032
@@ -96,7 +96,7 @@ static inline void page_dup_rmap(struct 
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *);
-int try_to_unmap(struct page *);
+int try_to_unmap(struct page *, struct list_head *);
 
 #else	/* !CONFIG_MMU */
 
@@ -105,7 +105,7 @@ int try_to_unmap(struct page *);
 #define anon_vma_link(vma)	do {} while (0)
 
 #define page_referenced(page)	TestClearPageReferenced(page)
-#define try_to_unmap(page)	SWAP_FAIL
+#define try_to_unmap(page, force)	SWAP_FAIL
 
 #endif	/* CONFIG_MMU */
 
--- linux-2.6.7.ORG/mm/Makefile	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/Makefile	Sat Jul 10 19:37:22 2032
@@ -15,3 +15,5 @@ obj-y			:= bootmem.o filemap.o mempool.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+
+obj-$(CONFIG_MEMHOTPLUG) += memhotplug.o
--- linux-2.6.7.ORG/mm/filemap.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/filemap.c	Sat Jul 10 19:37:22 2032
@@ -250,7 +250,8 @@ int filemap_write_and_wait(struct addres
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, int gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error = radix_tree_preload((gfp_mask & ~GFP_ZONEMASK) |
+	    ((gfp_mask & GFP_ZONEMASK) == __GFP_DMA ? __GFP_DMA : 0));
 
 	if (error == 0) {
 		spin_lock_irq(&mapping->tree_lock);
@@ -495,6 +496,7 @@ repeat:
 				page_cache_release(page);
 				goto repeat;
 			}
+			BUG_ON(PageAgain(page));
 		}
 	}
 	spin_unlock_irq(&mapping->tree_lock);
@@ -738,6 +740,8 @@ page_not_up_to_date:
 			goto page_ok;
 		}
 
+		BUG_ON(PageAgain(page));
+
 readpage:
 		/* ... and start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
@@ -1206,6 +1210,8 @@ page_not_uptodate:
 		goto success;
 	}
 
+	BUG_ON(PageAgain(page));
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1314,6 +1320,8 @@ page_not_uptodate:
 		goto success;
 	}
 
+	BUG_ON(PageAgain(page));
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1518,6 +1526,8 @@ retry:
 		unlock_page(page);
 		goto out;
 	}
+	BUG_ON(PageAgain(page));
+
 	err = filler(data, page);
 	if (err < 0) {
 		page_cache_release(page);
--- linux-2.6.7.ORG/mm/memory.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/memory.c	Sun Jul 11 10:04:42 2032
@@ -1305,6 +1305,7 @@ static int do_swap_page(struct mm_struct
 
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
+again:
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
@@ -1332,6 +1333,12 @@ static int do_swap_page(struct mm_struct
 
 	mark_page_accessed(page);
 	lock_page(page);
+	if (PageAgain(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto again;
+	}
+	BUG_ON(PageAgain(page));
 
 	/*
 	 * Back out if somebody else faulted in this pte while we
--- linux-2.6.7.ORG/mm/page_alloc.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/page_alloc.c	Sun Jul 11 10:04:58 2032
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/suspend.h>
 #include <linux/pagevec.h>
+#include <linux/memhotplug.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/notifier.h>
@@ -231,6 +232,7 @@ static inline void free_pages_check(cons
 			1 << PG_maplock |
 			1 << PG_anon    |
 			1 << PG_swapcache |
+			1 << PG_again |
 			1 << PG_writeback )))
 		bad_page(function, page);
 	if (PageDirty(page))
@@ -341,12 +343,13 @@ static void prep_new_page(struct page *p
 			1 << PG_maplock |
 			1 << PG_anon    |
 			1 << PG_swapcache |
+			1 << PG_again |
 			1 << PG_writeback )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_checked | 1 << PG_mappedtodisk);
+			1 << PG_checked | 1 << PG_mappedtodisk | 1 << PG_again);
 	page->private = 0;
 	set_page_refs(page, order);
 }
@@ -404,7 +407,7 @@ static int rmqueue_bulk(struct zone *zon
 	return allocated;
 }
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMHOTPLUG)
 static void __drain_pages(unsigned int cpu)
 {
 	struct zone *zone;
@@ -447,7 +450,9 @@ int is_head_of_free_region(struct page *
 	spin_unlock_irqrestore(&zone->lock, flags);
         return 0;
 }
+#endif
 
+#if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_MEMHOTPLUG)
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
  */
@@ -847,7 +852,8 @@ unsigned int nr_free_pages(void)
 	struct zone *zone;
 
 	for_each_zone(zone)
-		sum += zone->free_pages;
+		if (zone->zone_pgdat->enabled)
+			sum += zone->free_pages;
 
 	return sum;
 }
@@ -860,7 +866,8 @@ unsigned int nr_used_zone_pages(void)
 	struct zone *zone;
 
 	for_each_zone(zone)
-		pages += zone->nr_active + zone->nr_inactive;
+		if (zone->zone_pgdat->enabled)
+			pages += zone->nr_active + zone->nr_inactive;
 
 	return pages;
 }
@@ -887,6 +894,8 @@ static unsigned int nr_free_zone_pages(i
 		struct zone **zonep = zonelist->zones;
 		struct zone *zone;
 
+		if (!pgdat->enabled)
+			continue;
 		for (zone = *zonep++; zone; zone = *zonep++) {
 			unsigned long size = zone->present_pages;
 			unsigned long high = zone->pages_high;
@@ -921,7 +930,8 @@ unsigned int nr_free_highpages (void)
 	unsigned int pages = 0;
 
 	for_each_pgdat(pgdat)
-		pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+		if (pgdat->enabled)
+			pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
 
 	return pages;
 }
@@ -1171,13 +1181,21 @@ void show_free_areas(void)
 /*
  * Builds allocation fallback zone lists.
  */
-static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
+static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
 {
+
+	if (!pgdat->enabled)
+		return j;
+	if (k != ZONE_HOTREMOVABLE &&
+	    pgdat->removable)
+		return j;
+
 	switch (k) {
 		struct zone *zone;
 	default:
 		BUG();
 	case ZONE_HIGHMEM:
+	case ZONE_HOTREMOVABLE:
 		zone = pgdat->node_zones + ZONE_HIGHMEM;
 		if (zone->present_pages) {
 #ifndef CONFIG_HIGHMEM
@@ -1304,24 +1322,48 @@ static void __init build_zonelists(pg_da
 
 #else	/* CONFIG_NUMA */
 
-static void __init build_zonelists(pg_data_t *pgdat)
+static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
+	int hotremovable;
+#ifdef CONFIG_MEMHOTPLUG
+	struct zone *zone;
+#endif
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i < MAX_NR_ZONELISTS; i++) {
 		struct zonelist *zonelist;
 
 		zonelist = pgdat->node_zonelists + i;
-		memset(zonelist, 0, sizeof(*zonelist));
+		/* memset(zonelist, 0, sizeof(*zonelist)); */
 
 		j = 0;
 		k = ZONE_NORMAL;
-		if (i & __GFP_HIGHMEM)
+		hotremovable = 0;
+		switch (i) {
+		default:
+			BUG();
+			return;
+		case 0:
+			k = ZONE_NORMAL;
+			break;
+		case __GFP_HIGHMEM:
 			k = ZONE_HIGHMEM;
-		if (i & __GFP_DMA)
+			break;
+		case __GFP_DMA:
 			k = ZONE_DMA;
+			break;
+		case __GFP_HOTREMOVABLE:
+#ifdef CONFIG_MEMHOTPLUG
+			k = ZONE_HIGHMEM;
+#else
+			k = ZONE_HOTREMOVABLE;
+#endif
+			hotremovable = 1;
+			break;
+		}
 
+#ifndef CONFIG_MEMHOTPLUG
  		j = build_zonelists_node(pgdat, zonelist, j, k);
  		/*
  		 * Now we build the zonelist so that it contains the zones
@@ -1335,19 +1377,54 @@ static void __init build_zonelists(pg_da
  			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
  		for (node = 0; node < local_node; node++)
  			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- 
-		zonelist->zones[j] = NULL;
-	}
+#else
+		while (hotremovable >= 0) {
+			for(; k >= 0; k--) {
+				zone = pgdat->node_zones + k;
+				for (node = local_node; ;) {
+					if (NODE_DATA(node) == NULL ||
+					    !NODE_DATA(node)->enabled ||
+					    (!!NODE_DATA(node)->removable) !=
+					    (!!hotremovable))
+						goto next;
+					zone = NODE_DATA(node)->node_zones + k;
+					if (zone->present_pages)
+						zonelist->zones[j++] = zone;
+				next:
+					node = (node + 1) % numnodes;
+					if (node == local_node)
+						break;
+				}
+			}
+			if (hotremovable) {
+				/* place non-hotremovable after hotremovable */
+				k = ZONE_HIGHMEM;
+			}
+			hotremovable--;
+		}
+#endif
+		BUG_ON(j > sizeof(zonelist->zones) /
+		    sizeof(zonelist->zones[0]) - 1);
+		for(; j < sizeof(zonelist->zones) /
+		    sizeof(zonelist->zones[0]); j++)
+			zonelist->zones[j] = NULL;
+  	} 
 }
 
 #endif	/* CONFIG_NUMA */
 
-void __init build_all_zonelists(void)
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+build_all_zonelists(void)
 {
 	int i;
 
 	for(i = 0 ; i < numnodes ; i++)
-		build_zonelists(NODE_DATA(i));
+		if (NODE_DATA(i) != NULL)
+			build_zonelists(NODE_DATA(i));
 	printk("Built %i zonelists\n", numnodes);
 }
 
@@ -1419,7 +1496,7 @@ static void __init calculate_zone_totalp
  * up by free_all_bootmem() once the early boot process is
  * done. Non-atomic initialization, single-pass.
  */
-void __init memmap_init_zone(struct page *start, unsigned long size, int nid,
+void memmap_init_zone(struct page *start, unsigned long size, int nid,
 		unsigned long zone, unsigned long start_pfn)
 {
 	struct page *page;
@@ -1457,10 +1534,13 @@ static void __init free_area_init_core(s
 	int cpu, nid = pgdat->node_id;
 	struct page *lmem_map = pgdat->node_mem_map;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+#ifdef CONFIG_MEMHOTPLUG
+	int cold = !nid;
+#endif	
 
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
-	
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize;
@@ -1530,6 +1610,13 @@ static void __init free_area_init_core(s
 		zone->wait_table_size = wait_table_size(size);
 		zone->wait_table_bits =
 			wait_table_bits(zone->wait_table_size);
+#ifdef CONFIG_MEMHOTPLUG
+		if (!cold)
+			zone->wait_table = (wait_queue_head_t *)
+				kmalloc(zone->wait_table_size
+				* sizeof(wait_queue_head_t), GFP_KERNEL);
+		else
+#endif
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, zone->wait_table_size
 						* sizeof(wait_queue_head_t));
@@ -1584,6 +1671,13 @@ static void __init free_area_init_core(s
 			 */
 			bitmap_size = (size-1) >> (i+4);
 			bitmap_size = LONG_ALIGN(bitmap_size+1);
+#ifdef CONFIG_MEMHOTPLUG
+			if (!cold) {
+			zone->free_area[i].map = 
+			  (unsigned long *)kmalloc(bitmap_size, GFP_KERNEL);
+			memset(zone->free_area[i].map, 0, bitmap_size);
+			} else
+#endif
 			zone->free_area[i].map = 
 			  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 		}
@@ -1901,7 +1995,7 @@ static void setup_per_zone_protection(vo
  *	that the pages_{min,low,high} values for each zone are set correctly 
  *	with respect to min_free_kbytes.
  */
-static void setup_per_zone_pages_min(void)
+void setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
--- linux-2.6.7.ORG/mm/rmap.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/rmap.c	Sat Jul 10 19:37:22 2032
@@ -30,6 +30,7 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/memhotplug.h>
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/rmap.h>
@@ -421,7 +422,8 @@ void page_remove_rmap(struct page *page)
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
-static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
+static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
+    struct list_head *force)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -429,6 +431,9 @@ static int try_to_unmap_one(struct page 
 	pmd_t *pmd;
 	pte_t *pte;
 	pte_t pteval;
+#ifdef CONFIG_MEMHOTPLUG
+	struct page_va_list *vlist;
+#endif
 	int ret = SWAP_AGAIN;
 
 	if (!mm->rss)
@@ -466,8 +471,22 @@ static int try_to_unmap_one(struct page 
 	 */
 	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
 			ptep_test_and_clear_young(pte)) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+		if (force == NULL || vma->vm_flags & VM_RESERVED) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+#ifdef CONFIG_MEMHOTPLUG
+		vlist = kmalloc(sizeof(struct page_va_list), GFP_KERNEL);
+		atomic_inc(&mm->mm_count);
+		vlist->mm = mmgrab(mm);
+		if (vlist->mm == NULL) {
+			mmdrop(mm);
+			kfree(vlist);
+		} else {
+			vlist->addr = address;
+			list_add(&vlist->list, force);
+		}
+#endif
 	}
 
 	/*
@@ -620,7 +639,7 @@ out_unlock:
 	return SWAP_AGAIN;
 }
 
-static inline int try_to_unmap_anon(struct page *page)
+static inline int try_to_unmap_anon(struct page *page, struct list_head *force)
 {
 	struct anon_vma *anon_vma = (struct anon_vma *) page->mapping;
 	struct vm_area_struct *vma;
@@ -629,7 +648,7 @@ static inline int try_to_unmap_anon(stru
 	spin_lock(&anon_vma->lock);
 	BUG_ON(list_empty(&anon_vma->head));
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma);
+		ret = try_to_unmap_one(page, vma, force);
 		if (ret == SWAP_FAIL || !page->mapcount)
 			break;
 	}
@@ -649,7 +668,7 @@ static inline int try_to_unmap_anon(stru
  * The spinlock address_space->i_mmap_lock is tried.  If it can't be gotten,
  * return a temporary error.
  */
-static inline int try_to_unmap_file(struct page *page)
+static inline int try_to_unmap_file(struct page *page, struct list_head *force)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -666,7 +685,7 @@ static inline int try_to_unmap_file(stru
 
 	while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap,
 					&iter, pgoff, pgoff)) != NULL) {
-		ret = try_to_unmap_one(page, vma);
+		ret = try_to_unmap_one(page, vma, force);
 		if (ret == SWAP_FAIL || !page->mapcount)
 			goto out;
 	}
@@ -760,7 +779,7 @@ out:
  * SWAP_AGAIN	- we missed a trylock, try again later
  * SWAP_FAIL	- the page is unswappable
  */
-int try_to_unmap(struct page *page)
+int try_to_unmap(struct page *page, struct list_head *force)
 {
 	int ret;
 
@@ -769,9 +788,9 @@ int try_to_unmap(struct page *page)
 	BUG_ON(!page->mapcount);
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page);
+		ret = try_to_unmap_anon(page, force);
 	else
-		ret = try_to_unmap_file(page);
+		ret = try_to_unmap_file(page, force);
 
 	if (!page->mapcount) {
 		if (page_test_and_clear_dirty(page))
--- linux-2.6.7.ORG/mm/swapfile.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/swapfile.c	Sat Jul 10 19:37:22 2032
@@ -662,6 +662,7 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
+	again:
 		page = read_swap_cache_async(entry, NULL, 0);
 		if (!page) {
 			/*
@@ -696,6 +697,11 @@ static int try_to_unuse(unsigned int typ
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
+		if (PageAgain(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
 		wait_on_page_writeback(page);
 
 		/*
@@ -804,6 +810,7 @@ static int try_to_unuse(unsigned int typ
 
 			swap_writepage(page, &wbc);
 			lock_page(page);
+			BUG_ON(PageAgain(page));
 			wait_on_page_writeback(page);
 		}
 		if (PageSwapCache(page)) {
--- linux-2.6.7.ORG/mm/truncate.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/truncate.c	Sat Jul 10 19:37:22 2032
@@ -132,6 +132,8 @@ void truncate_inode_pages(struct address
 			next++;
 			if (TestSetPageLocked(page))
 				continue;
+			/* no PageAgain(page) check; page->mapping check
+			 * is done in truncate_complete_page */
 			if (PageWriteback(page)) {
 				unlock_page(page);
 				continue;
@@ -165,6 +167,24 @@ void truncate_inode_pages(struct address
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
+			if (page->mapping == NULL) {
+				/* XXX Is page->index still valid? */
+				unsigned long index = page->index;
+				int again = PageAgain(page);
+
+				unlock_page(page);
+				put_page(page);
+				page = find_lock_page(mapping, index);
+				if (page == NULL) {
+					BUG_ON(again);
+					/* XXX */
+					if (page->index > next)
+						next = page->index;
+					next++;
+				}
+				BUG_ON(!again);
+				pvec.pages[i] = page;
+			}
 			wait_on_page_writeback(page);
 			if (page->index > next)
 				next = page->index;
@@ -257,14 +277,29 @@ void invalidate_inode_pages2(struct addr
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
-			if (page->mapping == mapping) {	/* truncate race? */
-				wait_on_page_writeback(page);
-				next = page->index + 1;
-				if (page_mapped(page))
-					clear_page_dirty(page);
-				else
-					invalidate_complete_page(mapping, page);
+			while (page->mapping != mapping) {
+				struct page *newpage;
+				unsigned long index = page->index;
+
+				BUG_ON(page->mapping != NULL);
+
+				unlock_page(page);
+				newpage = find_lock_page(mapping, index);
+				if (page == newpage) {
+					put_page(page);
+					break;
+				}
+				BUG_ON(!PageAgain(page));
+				pvec.pages[i] = newpage;
+				put_page(page);
+				page = newpage;
 			}
+			wait_on_page_writeback(page);
+			next = page->index + 1;
+			if (page_mapped(page))
+				clear_page_dirty(page);
+			else
+				invalidate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
--- linux-2.6.7.ORG/mm/vmscan.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/vmscan.c	Sat Jul 10 19:37:22 2032
@@ -32,6 +32,7 @@
 #include <linux/topology.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
+#include <linux/kthread.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
@@ -387,7 +388,7 @@ static int shrink_list(struct list_head 
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page)) {
+			switch (try_to_unmap(page, NULL)) {
 			case SWAP_FAIL:
 				page_map_unlock(page);
 				goto activate_locked;
@@ -1091,6 +1092,8 @@ int kswapd(void *p)
 		if (current->flags & PF_FREEZE)
 			refrigerator(PF_FREEZE);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			return 0;
 		schedule();
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
@@ -1173,5 +1176,15 @@ static int __init kswapd_init(void)
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
+
+#ifdef CONFIG_MEMHOTPLUG
+void
+kswapd_start_one(pg_data_t *pgdat)
+{
+	pgdat->kswapd = kthread_create(kswapd, pgdat, "kswapd%d",
+	    pgdat->node_id);
+	total_memory = nr_free_pagecache_pages();
+}
+#endif
 
 module_init(kswapd_init)
--- linux-2.6.7.ORG/include/linux/memhotplug.h	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/memhotplug.h	Sun Jul 11 10:11:51 2032
@@ -0,0 +1,32 @@
+#ifndef _LINUX_MEMHOTPLUG_H
+#define _LINUX_MEMHOTPLUG_H
+
+#include <linux/config.h>
+#include <linux/mm.h>
+
+#ifdef __KERNEL__
+
+struct page_va_list {
+	struct mm_struct *mm;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct mmigrate_operations {
+	struct page * (*mmigrate_alloc_page)(int);
+	int (*mmigrate_free_page)(struct page *);
+	int (*mmigrate_copy_page)(struct page *, struct page *);
+	int (*mmigrate_lru_add_page)(struct page *, int);
+	int (*mmigrate_release_buffers)(struct page *);
+	int (*mmigrate_prepare)(struct page *page, int fastmode);
+	int (*mmigrate_stick_page)(struct list_head *vlist);
+};
+
+extern int mmigrated(void *p);
+extern int mmigrate_onepage(struct page *, int, int, struct mmigrate_operations *);
+extern int try_to_migrate_pages(struct zone *, int, struct page * (*)(struct zone *, void *), void *);
+
+#define MIGRATE_ANYNODE  (-1)
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MEMHOTPLUG_H */
--- linux-2.6.7.ORG/mm/memhotplug.c	Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/memhotplug.c	Sun Jul 11 10:12:48 2032
@@ -0,0 +1,817 @@
+/*
+ *  linux/mm/memhotplug.c
+ *
+ *  Support of memory hotplug
+ *
+ *  Authors:	Toshihiro Iwamoto, <iwamoto@valinux.co.jp>
+ *		Hirokazu Takahashi, <taka@valinux.co.jp>
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/mm_inline.h>
+#include <linux/rmap.h>
+#include <linux/memhotplug.h>
+
+#ifdef CONFIG_KDB
+#include <linux/kdb.h>
+#endif
+
+/*
+ * The following flow is a way to migrate a oldpage.
+ *  1. allocate a newpage.
+ *  2. lock the newpage and don't set PG_uptodate flag on it.
+ *  3. modify the oldpage entry in the corresponding radix tree with the
+ *     newpage.
+ *  4. clear all PTEs that refer to the oldpage.
+ *  5. wait until all references on the oldpage have gone.
+ *  6. copy from the oldpage to the newpage.
+ *  7. set PG_uptodate flag of the newpage.
+ *  8. release the oldpage.
+ *  9. unlock the newpage and wakeup all waiters.
+ *
+ *
+ *   adress_space                  oldpage
+ *   +-----------+               +---------+
+ *   |           |               |         |        +-----+
+ *   | page_tree------+  -- X -->|         |<-- X --| PTE |.....
+ *   |           |    |          |PG_uptodate       +-----+
+ *   |           |    |          +---------+           :
+ *   +-----------+    |                                :
+ *                    |            newpage         pagefaults
+ *                    |          +---------+           :
+ *                    +--------->|PG_locked|   ........:
+ *                               |         | Blocked
+ *                               |         |   ...........system calls
+ *                               +---------+
+ *
+ *
+ * The key point is to block accesses to the page under operation by
+ * modifying the radix tree. After the radix tree has been modified, no new
+ * access goes to the oldpage.  They will be redirected to the newpage which 
+ * will be blocked until the data is ready because it is locked and not
+ * up to date. Remember that dropping PG_uptodate is important to block
+ * all read accesses, including system call accesses and page fault accesses.
+ *
+ * By this aproach, pages in the swapcache are handled in the same way as
+ * pages in the pagecache are since both pages are on radix trees.
+ * And any kind of pages in the pagecache can be migrated even if they
+ * are not assoiciated with backing store like pages in sysfs, in ramdisk
+ * and so on. We can migrate all pages on the LRU in the same way.
+ */
+
+
+static void
+print_buffer(struct page* page)
+{
+	struct address_space* mapping = page_mapping(page);
+	struct buffer_head *bh, *head;
+
+	spin_lock(&mapping->private_lock);
+	bh = head = page_buffers(page);
+	printk("buffers:");
+	do {
+		printk(" %lx %d", bh->b_state, atomic_read(&bh->b_count));
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	printk("\n");
+	spin_unlock(&mapping->private_lock);
+}
+
+/*
+ * Make pages on the "vlist" mapped or they may be freed
+ * though there are mlocked.
+ */
+static int
+stick_mlocked_page(struct list_head *vlist)
+{
+	struct page_va_list *v1, *v2;
+	struct vm_area_struct *vma;
+	int error;
+
+	list_for_each_entry_safe(v1, v2, vlist, list) {
+		list_del(&v1->list);
+		down_read(&v1->mm->mmap_sem);
+		vma = find_vma(v1->mm, v1->addr);
+		if (vma == NULL || !(vma->vm_flags & VM_LOCKED))
+			goto out;
+		error = get_user_pages(current, v1->mm, v1->addr, PAGE_SIZE,
+		    (vma->vm_flags & VM_WRITE) != 0, 0, NULL, NULL);
+	out:
+		up_read(&v1->mm->mmap_sem);
+		mmput(v1->mm);
+		kfree(v1);
+	}
+	return 0;
+}
+
+/* helper function for mmigrate_onepage */
+#define	REMAPPREP_WB		1
+#define	REMAPPREP_BUFFER	2
+
+/*
+ * Try to free buffers if "page" has them.
+ * 
+ * TODO:
+ * 	It would be possible to migrate a page without pageout
+ *	if a address_space had a page migration method.
+ */
+static int
+mmigrate_preparepage(struct page *page, int fastmode)
+{
+	struct address_space *mapping;
+	int waitcnt = fastmode ? 0 : 100;
+	int res = -REMAPPREP_BUFFER;
+
+	BUG_ON(!PageLocked(page));
+
+	mapping = page_mapping(page);
+
+	if (!PagePrivate(page) && PageWriteback(page) &&
+	    !PageSwapCache(page)) {
+		printk("mmigrate_preparepage: mapping %p page %p\n",
+		    page->mapping, page);
+		return -REMAPPREP_WB;
+	}
+
+	/*
+	 * TODO: wait_on_page_writeback() would be better if it supports
+	 *       timeout.
+	 */
+	while (PageWriteback(page)) {
+		if (!waitcnt)
+			return -REMAPPREP_WB;
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(10);
+		__set_current_state(TASK_RUNNING);
+		waitcnt--;
+	}
+	if (PagePrivate(page)) {
+		if (PageDirty(page)) {
+			switch(pageout(page, mapping)) {
+			case PAGE_ACTIVATE:
+				res = -REMAPPREP_WB;
+				waitcnt = 1;
+			case PAGE_KEEP:
+			case PAGE_CLEAN:
+				break;
+			case PAGE_SUCCESS:
+				lock_page(page);
+				mapping = page_mapping(page);
+				if (!PagePrivate(page))
+					return 0;
+			}
+		}
+
+		while (1) {
+			if (try_to_release_page(page, GFP_KERNEL))
+				break;
+			if (!waitcnt)
+				return res;
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(10);
+			__set_current_state(TASK_RUNNING);
+			waitcnt--;
+			if (!waitcnt)
+				print_buffer(page);
+		}
+	}
+	return 0;
+}
+
+/*
+ * Assign a swap entry to an anonymous page if it doesn't have yet,
+ * so that it can be handled like one in the page cache.
+ */
+static struct address_space *
+make_page_mapped(struct page *page)
+{
+	if (!page_mapped(page)) {
+		if (page_count(page) > 1)
+			printk("page %p not mapped: count %d\n",
+			    page, page_count(page));
+		return NULL;
+	}
+	/* The page is an anon page.  Allocate its swap entry. */
+	page_map_unlock(page);
+	add_to_swap(page);
+	page_map_lock(page);
+	return page_mapping(page);
+}
+
+/*
+ * Replace "page" with "newpage" on the radix tree.  After that, all
+ * new access to "page" will be redirected to "newpage" and it
+ * will be blocked until migrating has been done.
+ */
+static int
+radix_tree_replace_pages(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	if (radix_tree_preload(GFP_KERNEL))
+		return -1;
+
+	if (PagePrivate(page)) /* XXX */
+		BUG();
+
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock_irq(&mapping->tree_lock);
+	if (mapping != page_mapping(page))
+		printk("mapping changed %p -> %p, page %p\n",
+		    mapping, page_mapping(page), page);
+	if (radix_tree_delete(&mapping->page_tree, page_index(page)) == NULL) {
+		/* Page truncated. */
+		spin_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+		return -1;
+	}
+	/* Don't __put_page(page) here.  Truncate may be in progress. */
+	newpage->flags |= page->flags & ~(1 << PG_uptodate) &
+	    ~(1 << PG_highmem) & ~(1 << PG_anon) &
+	    ~(1 << PG_maplock) &
+	    ~(1 << PG_active) & ~(~0UL << NODEZONE_SHIFT);
+
+	radix_tree_insert(&mapping->page_tree, page_index(page), newpage);
+	page_cache_get(newpage);
+	newpage->index = page->index;
+	if  (PageSwapCache(page))
+		newpage->private = page->private;
+	else
+		newpage->mapping = page->mapping;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	return 0;
+}
+
+/*
+ * Remove all PTE mappings to "page".
+ */
+static int
+unmap_page(struct page *page, struct list_head *vlist)
+{
+	int error = SWAP_SUCCESS;
+
+	page_map_lock(page);
+	while (page_mapped(page) &&
+	    (error = try_to_unmap(page, vlist)) == SWAP_AGAIN) {
+		/*
+		 * There may be race condition, just wait for a while
+		 */
+		page_map_unlock(page);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(1);
+		__set_current_state(TASK_RUNNING);
+		page_map_lock(page);
+	}
+	page_map_unlock(page);
+	if (error == SWAP_FAIL) {
+		/* either during mremap or mlocked */
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * Wait for "page" to become free.  Usually this function waits until
+ * the page count drops to 2.  For a truncated page, it waits until
+ * the count drops to 1.
+ * Returns: 0 on success, 1 on page truncation, -1 on error.
+ */
+static int
+wait_on_page_freeable(struct page *page, struct address_space *mapping,
+			struct list_head *vlist, unsigned long index,
+			int nretry, struct mmigrate_operations *ops)
+{
+	struct address_space *mapping1;
+	void *p;
+	int truncated = 0;
+wait_again:
+	while ((truncated + page_count(page)) > 2) {
+		if (nretry <= 0)
+			return -1;
+		/*
+		 * No lock needed while waiting page count.
+		 * Yield CPU to other accesses which may have to lock the
+		 * page to proceed.
+		 */
+		unlock_page(page);
+
+		/*
+		 * Wait until all references has gone.
+		 */
+		while ((truncated + page_count(page)) > 2) {
+			nretry--;
+			current->state = TASK_INTERRUPTIBLE;
+			schedule_timeout(1);
+			if ((nretry % 5000) == 0) {
+				printk("mmigrate_onepage: still waiting on %p %d\n", page, nretry);
+				break;
+			}
+			/*
+			 * Another remaining access to the page may reassign
+			 * buffers or make it mapped again.
+			 */
+			if (PagePrivate(page) || page_mapped(page))
+				break;		/* see below */
+		}
+
+		lock_page(page);
+		BUG_ON(page_count(page) == 0);
+		mapping1 = page_mapping(page);
+		if (mapping != mapping1 && mapping1 != NULL)
+			printk("mapping changed %p -> %p, page %p\n",
+			    mapping, mapping1, page);
+
+		/*
+		 * Free buffers of the page which may have been
+		 * reassigned.
+		 */
+		if (PagePrivate(page))
+			ops->mmigrate_release_buffers(page);
+
+		/*
+		 * Clear all PTE mappings to the page as it may have
+		 * been mapped again.
+		 */
+		unmap_page(page, vlist);
+	}
+	if (PageReclaim(page) || PageWriteback(page) || PagePrivate(page))
+#ifdef CONFIG_KDB
+		KDB_ENTER();
+#else
+		BUG();
+#endif
+	if (page_count(page) == 1)
+		/* page has been truncated. */
+		return 1;
+	spin_lock_irq(&mapping->tree_lock);
+	p = radix_tree_lookup(&mapping->page_tree, index);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (p == NULL) {
+		BUG_ON(page->mapping != NULL);
+		truncated = 1;
+		BUG_ON(page_mapping(page) != NULL);
+		goto wait_again;
+	}
+	
+	return 0;
+}
+
+/*
+ * A file which "page" belongs to has been truncated.  Free both pages.
+ */
+static void
+free_truncated_pages(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	void *p;
+	BUG_ON(page_mapping(page) != NULL);
+	put_page(newpage);
+	if (page_count(newpage) != 1) {
+		printk("newpage count %d != 1, %p\n",
+		    page_count(newpage), newpage);
+		BUG();
+	}
+	newpage->mapping = page->mapping = NULL;
+	ClearPageActive(page);
+	ClearPageActive(newpage);
+	ClearPageSwapCache(page);
+	ClearPageSwapCache(newpage);
+	unlock_page(page);
+	unlock_page(newpage);
+	put_page(newpage);
+}
+
+/*
+ * Roll back a page migration.
+ *
+ * In some cases, a page migration needs to be rolled back and to
+ * be retried later. This is a bit tricky because it is likely that some
+ * processes have already looked up the radix tree and waiting for its
+ * lock. Such processes need to discard a newpage and look up the radix
+ * tree again, as the newpage is now invalid.
+ * A new page flag (PG_again) is used for that purpose.
+ *
+ *   1. Roll back the radix tree change.
+ *   2. Set PG_again flag of the newpage and unlock it.
+ *   3. Woken up processes see the PG_again bit and looks up the radix
+ *      tree again.
+ *   4. Wait until the page count of the newpage falls to 1 (for the
+ *      migrated process).
+ *   5. Roll back is complete. the newpage can be freed.
+ */
+static int
+radix_tree_rewind_page(struct page *page, struct page *newpage,
+		 struct address_space *mapping)
+{
+	int waitcnt;
+	pgoff_t index;
+
+	/*
+	 * Try to unwind by notifying waiters.  If someone misbehaves,
+	 * we die.
+	 */
+	if (radix_tree_preload(GFP_KERNEL))
+		BUG();
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock_irq(&mapping->tree_lock);
+	index = page_index(page);
+	if (radix_tree_delete(&mapping->page_tree, index) == NULL)
+		/* Hold extra count to handle truncate */
+		page_cache_get(newpage);
+	radix_tree_insert(&mapping->page_tree, index, page);
+	/* no page_cache_get(page); needed */
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+
+	/*
+	 * PG_again flag notifies waiters that this newpage isn't what
+	 * the waiters expect.
+	 */
+	SetPageAgain(newpage);
+	newpage->mapping = NULL;
+	/* XXX unmap needed?  No, it shouldn't.  Handled by fault handlers. */
+	unlock_page(newpage);
+
+	/*
+	 *  Some accesses may be blocked on the newpage. Wait until the
+	 *  accesses has gone.
+	 */ 
+	waitcnt = HZ;
+	for(; page_count(newpage) > 2; waitcnt--) {
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(10);
+		if (waitcnt == 0) {
+			printk("You are hosed.\n");
+			printk("newpage %p flags %lx %d %d, page %p flags %lx %d\n",
+			    newpage, newpage->flags, page_count(newpage),
+			    newpage->mapcount,
+			    page, page->flags, page_count(page));
+			BUG();
+		}
+	}
+
+	BUG_ON(PageUptodate(newpage));
+	ClearPageDirty(newpage);
+	ClearPageActive(newpage);
+	spin_lock_irq(&mapping->tree_lock);
+	if (page_count(newpage) == 1) {
+		printk("newpage %p truncated. page %p\n", newpage, page);
+		BUG();
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	unlock_page(page);
+	BUG_ON(page_count(newpage) != 2);
+	ClearPageAgain(newpage);
+	__put_page(newpage);
+	return 1;
+}
+
+/*
+ * Allocate a new page from a specified node.
+ */
+static struct page *
+mmigrate_alloc_page(int nid)
+{
+	if (nid == MIGRATE_ANYNODE)
+		return alloc_page(GFP_HIGHUSER);
+	else
+		return alloc_pages_node(nid, GFP_HIGHUSER, 0);
+}
+
+/*
+ * Release "page" into the Buddy allocator.
+ */
+static int
+mmigrate_free_page(struct page *page)
+{
+	BUG_ON(page_count(page) != 1);
+	put_page(page);
+	return 0;
+}
+
+/*
+ * Copy data from "from" to "to".
+ */
+static int
+mmigrate_copy_page(struct page *to, struct page *from)
+{
+	copy_highpage(to, from);
+	return 0;
+}
+
+/*
+ * Insert "page" into the LRU.
+ */
+static int
+mmigrate_lru_add_page(struct page *page, int active)
+{
+	if (active)
+		lru_cache_add_active(page);
+	else
+		lru_cache_add(page);
+	return 0;
+}
+
+static int
+mmigrate_release_buffer(struct page *page)
+{
+	try_to_release_page(page, GFP_KERNEL);
+	return 0;
+}
+
+/*
+ * This is a migrate-operations for regular pages which include
+ * anonymous pages, pages in the pagecache and pages in the swapcache.
+ */
+static struct mmigrate_operations mmigrate_ops = {
+	.mmigrate_alloc_page	= mmigrate_alloc_page,
+        .mmigrate_free_page	= mmigrate_free_page,
+        .mmigrate_copy_page	= mmigrate_copy_page,
+        .mmigrate_lru_add_page	= mmigrate_lru_add_page,
+        .mmigrate_release_buffers = mmigrate_release_buffer,
+        .mmigrate_prepare	= mmigrate_preparepage,
+        .mmigrate_stick_page	= stick_mlocked_page
+};
+
+/*
+ * Try to migrate one page.  Returns non-zero on failure.
+ */
+int mmigrate_onepage(struct page *page, int nodeid, int fastmode,
+				struct mmigrate_operations *ops)
+{
+	struct page *newpage;
+	struct address_space *mapping;
+	LIST_HEAD(vlist);
+	int nretry = fastmode ? HZ/50: HZ*10; /* XXXX */
+
+	if ((newpage = ops->mmigrate_alloc_page(nodeid)) == NULL)
+		return -ENOMEM;
+
+	/*
+	 * Make sure that the newpage must be locked and keep not up-to-date
+	 * during the page migration, so that it's guaranteed that all
+	 * accesses including read accesses to the newpage will be blocked
+	 * until everything has become ok.
+	 *
+	 * Unlike in the case of swapout mechanism, all accesses which include
+	 * read accesses and write accesses to the page have to be blocked
+	 * since both of the oldpage and the newpage exist at the same time
+	 * and the newpage contains invalid data while some rereferences
+	 * of the oldpage remain.
+	 *
+	 * FYI, swap code allows read accesses during swaping as the
+	 * content of the page is valid and it will never be freed
+	 * while some references of it exist. And write access is also
+	 * possible during swapping, it will pull the page back and
+	 * modify them even if it's under I/O.
+	 */
+	if (TestSetPageLocked(newpage))
+		BUG();
+
+	lock_page(page);
+
+	if (ops->mmigrate_prepare && ops->mmigrate_prepare(page, fastmode))
+		goto radixfail;
+
+	/*
+	 * Put the page in a radix tree if it isn't in the tree yet,
+	 * so that all pages can be handled on radix trees and move
+	 * them in the same way.
+	 */
+	page_map_lock(page);
+	if (PageAnon(page) && !PageSwapCache(page))
+		make_page_mapped(page);
+	mapping = page_mapping(page);
+	page_map_unlock(page);
+
+	if (mapping == NULL)
+		goto radixfail;
+
+	/*
+	 * Replace the oldpage with the newpage in the radix tree,
+	 * after that the newpage can catch all access requests to the
+	 * oldpage instead.
+	 * 
+	 * We cannot leave the oldpage locked in the radix tree because:
+	 *   - It cannot block read access if PG_uptodate is on. PG_uptodate
+	 *     flag cannot be off since it means data in the page is invalid.
+	 *   - Some accesses cannot be finished if someone is holding the
+	 *     lock as they may require the lock to handle the oldpage.
+	 *   - It's hard to determine when the page can be freed if there
+	 *     remain references to the oldpage.
+	 */
+	if (radix_tree_replace_pages(page, newpage, mapping))
+		goto radixfail;
+
+	/*
+	 * With cleared PTEs, any access via PTEs to the oldpages can
+	 * be caught and blocked in a pagefault handler.
+	 */
+	if (unmap_page(page, &vlist))
+		goto unmapfail;
+	if (PagePrivate(page))
+		printk("buffer reappeared\n");
+
+	/*
+	 * We can't proceed if there remain some references on the oldpage.
+	 * 
+	 * This code may sometimes fail because:
+	 *     A page may be grabed twice in the same transaction. During
+	 *     the page migration, the transaction which already have got
+	 *     the oldpage try to grab the newpage, this causes a dead lock.
+	 *
+	 *     The transaction believes both pages are the same, but an access
+	 *     to the newpage is blocked until the oldpage is released.
+	 *
+	 *     Renaming a file in the same directory is a good example.
+	 *     It grabs the same page for the directory twice.
+	 *
+	 *     In this case, try to migrate the page later.
+	 */
+	switch (wait_on_page_freeable(page, mapping, &vlist, page_index(newpage), nretry, ops)) {
+	case 1:
+		/* truncated */
+		free_truncated_pages(page, newpage, mapping);
+		ops->mmigrate_free_page(page);
+		return 0;
+	case -1:
+		/* failed */
+		goto unmapfail;
+	}
+	
+	BUG_ON(mapping != page_mapping(page));
+
+	ops->mmigrate_copy_page(newpage, page);
+
+	if (PageDirty(page))
+		set_page_dirty(newpage);
+	page->mapping = NULL;
+	unlock_page(page);
+	__put_page(page);
+
+	/*
+	 * Finally, the newpage has become ready!
+	 */
+	SetPageUptodate(newpage);
+
+	if (ops->mmigrate_lru_add_page)
+		ops->mmigrate_lru_add_page(newpage, PageActive(page));
+	ClearPageActive(page);
+	ClearPageSwapCache(page);
+
+	ops->mmigrate_free_page(page);
+
+	/*
+ 	 * Wake up all waiters which have been waiting for completion
+	 * of the page migration.
+	 */
+	unlock_page(newpage);
+
+	/*
+	 * Mlock the newpage if the oldpage had been mlocked.
+	 */
+	if (ops->mmigrate_stick_page)
+		ops->mmigrate_stick_page(&vlist);
+	page_cache_release(newpage);
+	return 0;
+
+unmapfail:
+	/*
+	 * Roll back all operations.
+	 */
+	radix_tree_rewind_page(page, newpage, mapping);
+	if (ops->mmigrate_stick_page)
+		ops->mmigrate_stick_page(&vlist);
+	ClearPageActive(newpage);
+	ClearPageSwapCache(newpage);
+	ops->mmigrate_free_page(newpage);
+	return 1;
+
+radixfail:
+	unlock_page(page);
+	unlock_page(newpage);
+	if (ops->mmigrate_stick_page)
+		ops->mmigrate_stick_page(&vlist);
+	ops->mmigrate_free_page(newpage);
+	return 1;
+}
+
+static struct work_struct lru_drain_wq[NR_CPUS];
+static void
+lru_drain_schedule(void *p)
+{
+	int cpu = get_cpu();
+
+	schedule_work(&lru_drain_wq[cpu]);
+	put_cpu();
+}
+
+/*
+ * Find an appropriate page to be migrated on the LRU lists.
+ */
+
+static struct page *
+get_target_page(struct zone *zone, void *arg)
+{
+	struct page *page, *page2;
+	list_for_each_entry_safe(page, page2, &zone->inactive_list, lru) {
+		if (steal_page_from_lru(zone, page) == NULL)
+			continue;
+		return page;
+	}
+	list_for_each_entry_safe(page, page2, &zone->active_list, lru) {
+		if (steal_page_from_lru(zone, page) == NULL)
+			continue;
+		return page;
+	}
+	return NULL;
+}
+
+int try_to_migrate_pages(struct zone *zone, int destnode,
+		struct page * (*func)(struct zone *, void *), void *arg)
+{
+	struct page *page, *page2;
+	int nr_failed = 0;
+	LIST_HEAD(failedp);
+
+	while(nr_failed < 100) {
+		spin_lock_irq(&zone->lru_lock);
+		page = (*func)(zone, arg);
+		spin_unlock_irq(&zone->lru_lock);
+		if (page == NULL)
+			break;
+		if (PageLocked(page) ||
+		    mmigrate_onepage(page, destnode, 1, &mmigrate_ops)) {
+			nr_failed++;
+			list_add(&page->lru, &failedp);
+		}
+	}
+
+	nr_failed = 0;
+	list_for_each_entry_safe(page, page2, &failedp, lru) {
+		list_del(&page->lru);
+		if ( /* !PageLocked(page) && */
+		    !mmigrate_onepage(page, destnode, 0, &mmigrate_ops)) {
+			continue;
+		}
+		nr_failed++;
+		spin_lock_irq(&zone->lru_lock);
+		putback_page_to_lru(zone, page);
+		spin_unlock_irq(&zone->lru_lock);
+		page_cache_release(page);
+	}
+	return nr_failed;
+}
+
+/*
+ * The migrate-daemon, started as a kernel thread on demand.
+ * 
+ * This migrates all pages on a spcified zone one by one. It traverses
+ * the LRU lists of the zone and tries to migrate each page. It doesn't
+ * matter if the page is in the pagecache or in the swapcache or anonymous.
+ * 
+ * TODO:
+ *   Memsection support. The following code assumes that a whole zone are
+ *   going to be removed. You can replace get_target_page() with
+ *   a proper function if you want to remove part of memory in a zone.
+ */
+static DECLARE_MUTEX(mmigrated_sem);
+int mmigrated(void *p)
+{
+	struct zone *zone = p;
+	int nr_failed = 0;
+	LIST_HEAD(failedp);
+
+	daemonize("migrate%d", zone->zone_start_pfn);
+	current->flags |= PF_KSWAPD;	/*  It's fake */
+	if (down_trylock(&mmigrated_sem)) {
+		printk("mmigrated already running\n");
+		return 0;
+	}
+	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+	nr_failed = try_to_migrate_pages(zone, MIGRATE_ANYNODE, get_target_page, NULL);
+/* 	if (nr_failed) */
+/* 		goto retry; */
+	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+	up(&mmigrated_sem);
+	return 0;
+}
+
+static int __init mmigrated_init(void)
+{
+	int i;
+
+	for(i = 0; i < NR_CPUS; i++)
+		INIT_WORK(&lru_drain_wq[i], (void (*)(void *))lru_add_drain, NULL);
+	return 0;
+}
+
+module_init(mmigrated_init);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [4/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (2 preceding siblings ...)
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

$Id: va-emulation_memhotplug.patch,v 1.23 2004/06/17 08:19:45 iwamoto Exp $

diff -dpur linux-2.6.7/arch/i386/Kconfig linux-2.6.7-mh/arch/i386/Kconfig
--- linux-2.6.7/arch/i386/Kconfig	Thu Mar 11 11:55:22 2004
+++ linux-2.6.7-mh/arch/i386/Kconfig	Thu Apr  1 14:46:19 2004
@@ -736,7 +736,7 @@ config DISCONTIGMEM
 
 config HAVE_ARCH_BOOTMEM_NODE
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUG
 	default y
 
 config HIGHPTE
diff -dpur linux-2.6.7/arch/i386/mm/discontig.c linux-2.6.7-mh/arch/i386/mm/discontig.c
--- linux-2.6.7/arch/i386/mm/discontig.c	Sun Apr  4 12:37:23 2004
+++ linux-2.6.7-mh/arch/i386/mm/discontig.c	Tue Apr 27 17:41:22 2004
@@ -64,6 +64,7 @@ unsigned long node_end_pfn[MAX_NUMNODES]
 extern unsigned long find_max_low_pfn(void);
 extern void find_max_pfn(void);
 extern void one_highpage_init(struct page *, int, int);
+static unsigned long calculate_blk_remap_pages(void);
 
 extern struct e820map e820;
 extern unsigned long init_pg_tables_end;
@@ -111,6 +112,51 @@ int __init get_memcfg_numa_flat(void)
 	return 1;
 }
 
+int __init get_memcfg_numa_blks(void)
+{
+	int i, pfn;
+
+	printk("NUMA - single node, flat memory mode, but broken in several blocks\n");
+
+	/* Run the memory configuration and find the top of memory. */
+	find_max_pfn();
+	if (max_pfn & (PTRS_PER_PTE - 1)) {
+		pfn = max_pfn & ~(PTRS_PER_PTE - 1);
+		printk("Rounding down maxpfn %ld -> %d\n", max_pfn, pfn);
+		max_pfn = pfn;
+	}
+	for(i = 0; i < MAX_NUMNODES; i++) {
+		pfn = PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) * i;
+		node_start_pfn[i]  = pfn;
+		printk("node %d start %d\n", i, pfn);
+		pfn += PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+		if (pfn < max_pfn)
+			node_end_pfn[i]	  = pfn;
+		else {
+			node_end_pfn[i]	  = max_pfn;
+			i++;
+			printk("total %d blocks, max %ld\n", i, max_pfn);
+			break;
+		}
+	}
+
+	printk("physnode_map");
+	/* Needed for pfn_to_nid */
+	for (pfn = node_start_pfn[0]; pfn <= max_pfn;
+	       pfn += PAGES_PER_ELEMENT)
+	{
+		physnode_map[pfn / PAGES_PER_ELEMENT] =
+		    pfn / PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+		printk(" %d", physnode_map[pfn / PAGES_PER_ELEMENT]);
+	}
+	printk("\n");
+
+	node_set_online(0);
+	numnodes = i;
+	
+	return 1;
+}
+
 /*
  * Find the highest page frame number we have available for the node
  */
@@ -132,11 +178,21 @@ static void __init find_max_pfn_node(int
  * Allocate memory for the pg_data_t via a crude pre-bootmem method
  * We ought to relocate these onto their own node later on during boot.
  */
-static void __init allocate_pgdat(int nid)
+static void allocate_pgdat(int nid)
 {
-	if (nid)
+	if (nid) {
+#ifndef CONFIG_MEMHOTPLUG
 		NODE_DATA(nid) = (pg_data_t *)node_remap_start_vaddr[nid];
-	else {
+#else
+		int remapsize;
+		unsigned long addr;
+
+		remapsize = calculate_blk_remap_pages();
+		addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+		    (nid - 1) * remapsize));
+		NODE_DATA(nid) = (void *)addr;
+#endif
+	} else {
 		NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
 		min_low_pfn += PFN_UP(sizeof(pg_data_t));
 		memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
@@ -185,6 +241,7 @@ static void __init register_bootmem_low_
 
 void __init remap_numa_kva(void)
 {
+#ifndef CONFIG_MEMHOTPLUG
 	void *vaddr;
 	unsigned long pfn;
 	int node;
@@ -197,6 +254,7 @@ void __init remap_numa_kva(void)
 				PAGE_KERNEL_LARGE);
 		}
 	}
+#endif
 }
 
 static unsigned long calculate_numa_remap_pages(void)
@@ -227,6 +285,21 @@ static unsigned long calculate_numa_rema
 	return reserve_pages;
 }
 
+static unsigned long calculate_blk_remap_pages(void)
+{
+	unsigned long size;
+
+	/* calculate the size of the mem_map needed in bytes */
+	size = (PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) + 1)
+		* sizeof(struct page) + sizeof(pg_data_t);
+	/* convert size to large (pmd size) pages, rounding up */
+	size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
+	/* now the roundup is correct, convert to PAGE_SIZE pages */
+	size = size * PTRS_PER_PTE;
+
+	return size;
+}
+
 unsigned long __init setup_memory(void)
 {
 	int nid;
@@ -234,13 +307,14 @@ unsigned long __init setup_memory(void)
 	unsigned long reserve_pages;
 
 	get_memcfg_numa();
-	reserve_pages = calculate_numa_remap_pages();
+	reserve_pages = calculate_blk_remap_pages() * (numnodes - 1);
 
 	/* partially used pages are not usable - thus round upwards */
 	system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
 
 	find_max_pfn();
-	system_max_low_pfn = max_low_pfn = find_max_low_pfn();
+	system_max_low_pfn = max_low_pfn = (find_max_low_pfn() &
+	    ~(PTRS_PER_PTE - 1));
 #ifdef CONFIG_HIGHMEM
 	highstart_pfn = highend_pfn = max_pfn;
 	if (max_pfn > system_max_low_pfn)
@@ -256,14 +330,19 @@ unsigned long __init setup_memory(void)
 
 	printk("Low memory ends at vaddr %08lx\n",
 			(ulong) pfn_to_kaddr(max_low_pfn));
+#ifdef CONFIG_MEMHOTPLUG
+	for (nid = 1; nid < numnodes; nid++)
+		NODE_DATA(nid) = NULL;
+	nid = 0;
+	{
+#else
 	for (nid = 0; nid < numnodes; nid++) {
+#endif
 		node_remap_start_vaddr[nid] = pfn_to_kaddr(
-			highstart_pfn - node_remap_offset[nid]);
+			max_low_pfn + calculate_blk_remap_pages() * nid);
 		allocate_pgdat(nid);
-		printk ("node %d will remap to vaddr %08lx - %08lx\n", nid,
-			(ulong) node_remap_start_vaddr[nid],
-			(ulong) pfn_to_kaddr(highstart_pfn
-			    - node_remap_offset[nid] + node_remap_size[nid]));
+		printk ("node %d will remap to vaddr %08lx - \n", nid,
+			(ulong) node_remap_start_vaddr[nid]);
 	}
 	printk("High memory starts at vaddr %08lx\n",
 			(ulong) pfn_to_kaddr(highstart_pfn));
@@ -275,9 +354,12 @@ unsigned long __init setup_memory(void)
 	/*
 	 * Initialize the boot-time allocator (with low memory only):
 	 */
-	bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
+	bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0,
+	    (system_max_low_pfn > node_end_pfn[0]) ?
+	    node_end_pfn[0] : system_max_low_pfn);
 
-	register_bootmem_low_pages(system_max_low_pfn);
+	register_bootmem_low_pages((system_max_low_pfn > node_end_pfn[0]) ?
+	    node_end_pfn[0] : system_max_low_pfn);
 
 	/*
 	 * Reserve the bootmem bitmap itself as well. We do this in two
@@ -342,14 +424,26 @@ void __init zone_sizes_init(void)
 	 * Clobber node 0's links and NULL out pgdat_list before starting.
 	 */
 	pgdat_list = NULL;
-	for (nid = numnodes - 1; nid >= 0; nid--) {       
+#ifndef CONFIG_MEMHOTPLUG
+	for (nid = numnodes - 1; nid >= 0; nid--) {
+#else
+	nid = 0;
+	{
+#endif
 		if (nid)
 			memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+		if (nid == 0)
+			NODE_DATA(nid)->enabled = 1;
 		NODE_DATA(nid)->pgdat_next = pgdat_list;
 		pgdat_list = NODE_DATA(nid);
 	}
 
+#ifdef CONFIG_MEMHOTPLUG
+	nid = 0;
+	{
+#else
 	for (nid = 0; nid < numnodes; nid++) {
+#endif
 		unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
 		unsigned long *zholes_size;
 		unsigned int max_dma;
@@ -368,14 +462,17 @@ void __init zone_sizes_init(void)
 		} else {
 			if (low < max_dma)
 				zones_size[ZONE_DMA] = low;
-			else {
+			else if (low <= high) {
 				BUG_ON(max_dma > low);
-				BUG_ON(low > high);
 				zones_size[ZONE_DMA] = max_dma;
 				zones_size[ZONE_NORMAL] = low - max_dma;
 #ifdef CONFIG_HIGHMEM
 				zones_size[ZONE_HIGHMEM] = high - low;
 #endif
+			} else {
+				BUG_ON(max_dma > low);
+				zones_size[ZONE_DMA] = max_dma;
+				zones_size[ZONE_NORMAL] = high - max_dma;
 			}
 		}
 		zholes_size = get_zholes_size(nid);
@@ -405,7 +502,11 @@ void __init set_highmem_pages_init(int b
 #ifdef CONFIG_HIGHMEM
 	int nid;
 
+#ifdef CONFIG_MEMHOTPLUG
+	for (nid = 0; nid < 1; nid++) {
+#else
 	for (nid = 0; nid < numnodes; nid++) {
+#endif
 		unsigned long node_pfn, node_high_size, zone_start_pfn;
 		struct page * zone_mem_map;
 		
@@ -423,12 +524,234 @@ void __init set_highmem_pages_init(int b
 #endif
 }
 
-void __init set_max_mapnr_init(void)
+void set_max_mapnr_init(void)
 {
 #ifdef CONFIG_HIGHMEM
+#ifndef CONFIG_MEMHOTPLUG
 	highmem_start_page = NODE_DATA(0)->node_zones[ZONE_HIGHMEM].zone_mem_map;
+#else
+	struct pglist_data *z = NULL;
+	int i;
+
+	for (i = 0; i < numnodes; i++) {
+		if (NODE_DATA(i) == NULL)
+			continue;
+		z = NODE_DATA(i);
+		highmem_start_page = z->node_zones[ZONE_HIGHMEM].zone_mem_map;
+		if (highmem_start_page != NULL)
+			break;
+	}
+	if (highmem_start_page == NULL)
+		highmem_start_page =
+		    z->node_zones[ZONE_NORMAL].zone_mem_map +
+		    z->node_zones[ZONE_NORMAL].spanned_pages;
+#endif
 	num_physpages = highend_pfn;
 #else
 	num_physpages = max_low_pfn;
 #endif
 }
+
+void
+plug_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
+	unsigned long *zholes_size, addr, pfn;
+	unsigned long remapsize;
+	unsigned long flags;
+	int i, j;
+	struct page *node_mem_map, *page;
+	pg_data_t **pgdat;
+	struct mm_struct *mm;
+
+	unsigned long start = node_start_pfn[nid];
+	unsigned long high = node_end_pfn[nid];
+
+	BUG_ON(nid == 0);
+
+	allocate_pgdat(nid);
+
+	remapsize = calculate_blk_remap_pages();
+	addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+	    (nid - 1) * remapsize));
+	
+	/* shrink size,
+	   which is done in calculate_numa_remap_pages() if normal NUMA */
+	high -= remapsize;
+	BUG_ON(start > high);
+
+	for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE)
+                set_pmd_pfn(addr + (pfn << PAGE_SHIFT), high + pfn,
+                    PAGE_KERNEL_LARGE);
+	spin_lock_irqsave(&pgd_lock, flags);
+	for (page = pgd_list; page; page = (struct page *)page->index) {
+		for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE) {
+			pgd_t *pgd;
+			pmd_t *pmd;
+
+			pgd = (pgd_t *)page_address(page) +
+			    pgd_index(addr + (pfn << PAGE_SHIFT));
+			pmd = pmd_offset(pgd, addr + (pfn << PAGE_SHIFT));
+			set_pmd(pmd, pfn_pmd(high + pfn, PAGE_KERNEL_LARGE));
+		}
+	}
+	spin_unlock_irqrestore(&pgd_lock, flags);
+	flush_tlb_all();
+
+	node_mem_map = (struct page *)((addr + sizeof(pg_data_t) +
+	    PAGE_SIZE - 1) & PAGE_MASK);
+	memset(node_mem_map, 0, (remapsize << PAGE_SHIFT) -
+	    ((char *)node_mem_map - (char *)addr));
+
+	printk("plug_node: %p %p\n", NODE_DATA(nid), node_mem_map);
+	memset(NODE_DATA(nid), 0, sizeof(*NODE_DATA(nid)));
+	printk("zeroed nodedata\n");
+
+	/* XXX defaults to hotremovable */ 
+	NODE_DATA(nid)->removable = 1;
+
+	BUG_ON(virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT > start);
+	if (start <= max_low_pfn)
+		zones_size[ZONE_NORMAL] =
+		    (max_low_pfn > high ? high : max_low_pfn) - start;
+#ifdef CONFIG_HIGHMEM
+	if (high > max_low_pfn)
+		zones_size[ZONE_HIGHMEM] = high -
+		    ((start > max_low_pfn) ? start : max_low_pfn);
+#endif
+	zholes_size = get_zholes_size(nid);
+	free_area_init_node(nid, NODE_DATA(nid), node_mem_map, zones_size,
+	    start, zholes_size);
+
+	/* lock? */
+	for(pgdat = &pgdat_list; *pgdat; pgdat = &(*pgdat)->pgdat_next)
+		if ((*pgdat)->node_id > nid) {
+			NODE_DATA(nid)->pgdat_next = *pgdat;
+			*pgdat = NODE_DATA(nid);
+			break;
+		}
+	if (*pgdat == NULL)
+		*pgdat = NODE_DATA(nid);
+	{
+		struct zone *z;
+		for_each_zone (z)
+			printk("%p ", z);
+		printk("\n");
+	}
+	set_max_mapnr_init();
+
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z;
+		struct page *p;
+
+		z = &NODE_DATA(nid)->node_zones[i];
+
+		for(j = 0; j < z->spanned_pages; j++) {
+			p = &z->zone_mem_map[j];
+			ClearPageReserved(p);
+			if (i == ZONE_HIGHMEM)
+				set_bit(PG_highmem, &p->flags);
+			set_page_count(p, 1);
+			__free_page(p);
+		}
+	}
+	kswapd_start_one(NODE_DATA(nid));
+	setup_per_zone_pages_min();
+}
+
+void
+enable_node(int node)
+{
+	int i;
+	struct zone *z;
+
+	NODE_DATA(node)->enabled = 1;
+	build_all_zonelists();
+
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		z = zone_table[NODEZONE(node, i)];
+		totalram_pages += z->present_pages;
+		if (i == ZONE_HIGHMEM)
+			totalhigh_pages += z->present_pages;
+	}
+}
+
+void
+makepermanent_node(int node)
+{
+
+	NODE_DATA(node)->removable = 0;
+	build_all_zonelists();
+}
+	
+void
+disable_node(int node)
+{
+	int i;
+	struct zone *z;
+
+	NODE_DATA(node)->enabled = 0;
+	build_all_zonelists();
+
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		z = zone_table[NODEZONE(node, i)];
+		totalram_pages -= z->present_pages;
+		if (i == ZONE_HIGHMEM)
+			totalhigh_pages -= z->present_pages;
+	}
+}
+
+int
+unplug_node(int nid)
+{
+	int i;
+	struct zone *z;
+	pg_data_t *pgdat;
+	struct page *page;
+	unsigned long addr, pfn, remapsize;
+	unsigned long flags;
+
+	if (NODE_DATA(nid)->enabled)
+		return -1;
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		z = zone_table[NODEZONE(nid, i)];
+		if (z->present_pages != z->free_pages)
+			return -1;
+	}
+	kthread_stop(NODE_DATA(nid)->kswapd);
+
+	/* lock? */
+	for(pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
+		if (pgdat->pgdat_next == NODE_DATA(nid)) {
+			pgdat->pgdat_next = pgdat->pgdat_next->pgdat_next;
+			break;
+		}
+	BUG_ON(pgdat == NULL);
+
+	for(i = 0; i < MAX_NR_ZONES; i++)
+		zone_table[NODEZONE(nid, i)] = NULL;
+	NODE_DATA(nid) = NULL;
+
+	/* unmap node_mem_map */
+	remapsize = calculate_blk_remap_pages();
+	addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+	    (nid - 1) * remapsize));
+	for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE)
+                set_pmd_pfn(addr + (pfn << PAGE_SHIFT), 0, __pgprot(0));
+	spin_lock_irqsave(&pgd_lock, flags);
+	for (page = pgd_list; page; page = (struct page *)page->index) {
+		for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE) {
+			pgd_t *pgd;
+			pmd_t *pmd;
+
+			pgd = (pgd_t *)page_address(page) +
+			    pgd_index(addr + (pfn << PAGE_SHIFT));
+			pmd = pmd_offset(pgd, addr + (pfn << PAGE_SHIFT));
+			pmd_clear(pmd);
+		}
+	}
+	spin_unlock_irqrestore(&pgd_lock, flags);
+	flush_tlb_all();
+
+	return 0;
+}
diff -dpur linux-2.6.7/arch/i386/mm/init.c linux-2.6.7-mh/arch/i386/mm/init.c
--- linux-2.6.7/arch/i386/mm/init.c	Thu Mar 11 11:55:37 2004
+++ linux-2.6.7-mh/arch/i386/mm/init.c	Wed Mar 31 19:38:26 2004
@@ -43,6 +43,7 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 unsigned long highstart_pfn, highend_pfn;
+extern unsigned long node_end_pfn[MAX_NUMNODES];
 
 static int do_test_wp_bit(void);
 
@@ -481,7 +482,11 @@ void __init mem_init(void)
 	totalram_pages += __free_all_bootmem();
 
 	reservedpages = 0;
+#ifdef CONFIG_MEMHOTPLUG
+	for (tmp = 0; tmp < node_end_pfn[0]; tmp++)
+#else
 	for (tmp = 0; tmp < max_low_pfn; tmp++)
+#endif
 		/*
 		 * Only count reserved RAM pages
 		 */
diff -dpur linux-2.6.7/include/asm-i386/mmzone.h linux-2.6.7-mh/include/asm-i386/mmzone.h
--- linux-2.6.7/include/asm-i386/mmzone.h	Thu Mar 11 11:55:27 2004
+++ linux-2.6.7-mh/include/asm-i386/mmzone.h	Wed Mar 31 19:38:26 2004
@@ -17,7 +17,9 @@
 		#include <asm/srat.h>
 	#endif
 #else /* !CONFIG_NUMA */
+#ifndef CONFIG_MEMHOTPLUG
 	#define get_memcfg_numa get_memcfg_numa_flat
+#endif
 	#define get_zholes_size(n) (0)
 #endif /* CONFIG_NUMA */
 
@@ -41,7 +43,7 @@ extern u8 physnode_map[];
 
 static inline int pfn_to_nid(unsigned long pfn)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined (CONFIG_MEMHOTPLUG)
 	return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
 #else
 	return 0;
@@ -132,6 +134,10 @@ static inline int pfn_valid(int pfn)
 #endif
 
 extern int get_memcfg_numa_flat(void );
+#ifdef CONFIG_MEMHOTPLUG
+extern int get_memcfg_numa_blks(void);
+#endif
+
 /*
  * This allows any one NUMA architecture to be compiled
  * for, and still fall back to the flat function if it
@@ -144,6 +150,9 @@ static inline void get_memcfg_numa(void)
 		return;
 #elif CONFIG_ACPI_SRAT
 	if (get_memcfg_from_srat())
+		return;
+#elif CONFIG_MEMHOTPLUG
+	if (get_memcfg_numa_blks())
 		return;
 #endif
 
diff -dpur linux-2.6.7/include/asm-i386/numnodes.h linux-2.6.7-mh/include/asm-i386/numnodes.h
--- linux-2.6.7/include/asm-i386/numnodes.h	Thu Mar 11 11:55:23 2004
+++ linux-2.6.7-mh/include/asm-i386/numnodes.h	Wed Mar 31 19:38:26 2004
@@ -13,6 +13,8 @@
 /* Max 8 Nodes */
 #define NODES_SHIFT	3
 
-#endif /* CONFIG_X86_NUMAQ */
+#elif defined(CONFIG_MEMHOTPLUG)
+#define NODES_SHIFT	3
+#endif
 
 #endif /* _ASM_MAX_NUMNODES_H */
diff -dpur linux-2.6.7/mm/page_alloc.c linux-2.6.7-mh/mm/page_alloc.c
--- linux-2.6.7/mm/page_alloc.c	Thu Mar 11 11:55:22 2004
+++ linux-2.6.7-mh/mm/page_alloc.c	Thu Apr  1 16:54:26 2004
@@ -1177,7 +1177,12 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+calculate_zone_totalpages(struct pglist_data *pgdat,
 		unsigned long *zones_size, unsigned long *zholes_size)
 {
 	unsigned long realtotalpages, totalpages = 0;
@@ -1231,8 +1236,13 @@ void __init memmap_init_zone(struct page
  *   - mark all memory queues empty
  *   - clear the memory bitmaps
  */
-static void __init free_area_init_core(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+free_area_init_core(struct pglist_data *pgdat,
+	unsigned long *zones_size, unsigned long *zholes_size)
 {
 	unsigned long i, j;
 	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
@@ -1371,7 +1381,12 @@ static void __init free_area_init_core(s
 	}
 }
 
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+free_area_init_node(int nid, struct pglist_data *pgdat,
 		struct page *node_mem_map, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [5/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (3 preceding siblings ...)
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


$Id: va-proc_memhotplug.patch,v 1.14 2004/07/06 09:05:54 taka Exp $

diff -dpur linux-2.6.7/mm/page_alloc.c linux-2.6.7-mh/mm/page_alloc.c
--- linux-2.6.7/mm/page_alloc.c.orig	2004-06-17 16:28:03.000000000 +0900
+++ linux-2.6.7-mh/mm/page_alloc.c	2004-06-17 16:28:34.000000000 +0900
@@ -32,6 +32,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/proc_fs.h>
 
 #include <asm/tlbflush.h>
 
@@ -2120,3 +2121,244 @@ int lower_zone_protection_sysctl_handler
 	setup_per_zone_protection();
 	return 0;
 }
+
+#ifdef CONFIG_MEMHOTPLUG
+static int mhtest_read(char *page, char **start, off_t off, int count,
+    int *eof, void *data)
+{
+	char *p;
+	int i, j, len;
+	const struct pglist_data *pgdat;
+	const struct zone *z;
+
+	p = page;
+	for(i = 0; i < numnodes; i++) {
+		pgdat = NODE_DATA(i);
+		if (pgdat == NULL)
+			continue;
+		len = sprintf(p, "Node %d %sabled %shotremovable\n", i,
+		    pgdat->enabled ? "en" : "dis",
+		    pgdat->removable ? "" : "non");
+		p += len;
+		for (j = 0; j < MAX_NR_ZONES; j++) {
+			z = &pgdat->node_zones[j];
+			if (! z->present_pages)
+				/* skip empty zone */
+				continue;
+			len = sprintf(p,
+			    "\t%s[%d]: free %ld, active %ld, present %ld\n",
+			    z->name, NODEZONE(i, j),
+			    z->free_pages, z->nr_active, z->present_pages);
+			p += len;
+		}
+		*p++ = '\n';
+	}
+	len = p - page;
+
+	if (len <= off + count)
+		*eof = 1;
+	*start = page + off;
+	len -= off;
+	if (len < 0)
+		len = 0;
+	if (len > count)
+		len = count;
+
+	return len;
+}
+
+static void mhtest_enable(int);
+static void mhtest_disable(int);
+static void mhtest_plug(int);
+static void mhtest_unplug(int);
+static void mhtest_purge(int);
+static void mhtest_remap(int);
+static void mhtest_active(int);
+static void mhtest_inuse(int);
+
+const static struct {
+	char *cmd;
+	void (*func)(int);
+	char zone_check;
+} mhtest_cmds[] = {
+	{ "disable", mhtest_disable, 0 },
+	{ "enable", mhtest_enable, 0 },
+	{ "plug", mhtest_plug, 0 },
+	{ "unplug", mhtest_unplug, 0 },
+	{ "purge", mhtest_purge, 1 },
+	{ "remap", mhtest_remap, 1 },
+	{ "active", mhtest_active, 1 },
+	{ "inuse", mhtest_inuse, 1 },
+	{ NULL, NULL }};
+
+static void
+mhtest_disable(int idx) {
+	int i, z;
+
+	printk("disable %d\n", idx);
+	/* XXX */
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0];	/* hot */
+			pcp->low = pcp->high = 0;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1];	/* cold */
+			pcp->low = pcp->high = 0;
+		}
+		zone_table[NODEZONE(idx, z)]->pages_high =
+		    zone_table[NODEZONE(idx, z)]->present_pages;
+	}
+	disable_node(idx);
+}
+static void
+mhtest_enable(int idx) {
+	int i, z;
+
+	printk("enable %d\n", idx);
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		zone_table[NODEZONE(idx, z)]->pages_high = 
+		    zone_table[NODEZONE(idx, z)]->pages_min * 3;
+		/* XXX */
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0];	/* hot */
+			pcp->low = 2 * pcp->batch;
+			pcp->high = 6 * pcp->batch;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1];	/* cold */
+			pcp->high = 2 * pcp->batch;
+		}
+	}
+	enable_node(idx);
+}
+
+static void
+mhtest_plug(int idx) {
+
+	if (NODE_DATA(idx) != NULL) {
+		printk("Already plugged\n");
+		return;
+	}
+	plug_node(idx);
+}
+
+static void
+mhtest_unplug(int idx) {
+
+	unplug_node(idx);
+}
+
+static void
+mhtest_purge(int idx)
+{
+	printk("purge %d\n", idx);
+	wake_up_interruptible(&zone_table[idx]->zone_pgdat->kswapd_wait);
+	/* XXX overkill, but who cares? */
+	on_each_cpu(drain_local_pages, NULL, 1, 1);
+}
+
+static void
+mhtest_remap(int idx) {
+
+	on_each_cpu(drain_local_pages, NULL, 1, 1);
+	kernel_thread(mmigrated, zone_table[idx], CLONE_KERNEL);
+}
+
+static void
+mhtest_active(int idx)
+{
+	struct list_head *l;
+	int i;
+
+	if (zone_table[idx] == NULL)
+		return;
+	spin_lock_irq(&zone_table[idx]->lru_lock);
+	i = 0;
+	list_for_each(l, &zone_table[idx]->active_list) {
+		printk(" %lx", (unsigned long)list_entry(l, struct page, lru));
+		i++;
+		if (i == 10)
+			break;
+	}
+	spin_unlock_irq(&zone_table[idx]->lru_lock);
+	printk("\n");
+}
+
+static void
+mhtest_inuse(int idx)
+{
+	int i;
+
+	if (zone_table[idx] == NULL)
+		return;
+	for(i = 0; i < zone_table[idx]->spanned_pages; i++)
+		if (page_count(&zone_table[idx]->zone_mem_map[i]))
+			printk(" %p", &zone_table[idx]->zone_mem_map[i]);
+	printk("\n");
+}
+
+static int mhtest_write(struct file *file, const char *buffer,
+    unsigned long count, void *data)
+{
+	int idx;
+	char buf[64], *p;
+	int i;
+
+	if (count > sizeof(buf) - 1)
+		count = sizeof(buf) - 1;
+	if (copy_from_user(buf, buffer, count))
+		return -EFAULT;
+
+	buf[count] = 0;
+
+	p = strchr(buf, ' ');
+	if (p == NULL)
+		goto out;
+
+	*p++ = '\0';
+	idx = (int)simple_strtoul(p, NULL, 0);
+
+	if (idx > MAX_NR_ZONES*MAX_NUMNODES) {
+		printk("Argument out of range\n");
+		goto out;
+	}
+
+	for(i = 0; ; i++) {
+		if (mhtest_cmds[i].cmd == NULL)
+			break;
+		if (strcmp(buf, mhtest_cmds[i].cmd) == 0) {
+			if (mhtest_cmds[i].zone_check) {
+				if (zone_table[idx] == NULL) {
+					printk("Zone %d not plugged\n", idx);
+					return count;
+				}
+			} else if (strcmp(buf, "plug") != 0 &&
+			    NODE_DATA(idx) == NULL) {
+				printk("Node %d not plugged\n", idx);
+				return count;
+			}
+			(mhtest_cmds[i].func)(idx);
+			break;
+		}
+	}
+out:
+	return count;
+}
+
+static int __init procmhtest_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("memhotplug", 0, NULL);
+	if (entry == NULL)
+		return -1;
+
+	entry->read_proc = &mhtest_read;
+	entry->write_proc = &mhtest_write;
+	return 0;
+}
+__initcall(procmhtest_init);
+#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [6/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (4 preceding siblings ...)
  2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

$Id: va-aio.patch,v 1.11 2004/06/17 08:19:45 iwamoto Exp $

--- linux-2.6.7.ORG/arch/i386/kernel/sys_i386.c.orig	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/arch/i386/kernel/sys_i386.c	2004-06-17 16:20:12.000000000 +0900
@@ -70,7 +70,7 @@ asmlinkage long sys_mmap2(unsigned long 
 	unsigned long prot, unsigned long flags,
 	unsigned long fd, unsigned long pgoff)
 {
-	return do_mmap2(addr, len, prot, flags, fd, pgoff);
+	return do_mmap2(addr, len, prot, flags & ~MAP_IMMOVABLE, fd, pgoff);
 }
 
 /*
@@ -101,7 +101,8 @@ asmlinkage int old_mmap(struct mmap_arg_
 	if (a.offset & ~PAGE_MASK)
 		goto out;
 
-	err = do_mmap2(a.addr, a.len, a.prot, a.flags, a.fd, a.offset >> PAGE_SHIFT);
+	err = do_mmap2(a.addr, a.len, a.prot, a.flags & ~MAP_IMMOVABLE,
+	    a.fd, a.offset >> PAGE_SHIFT);
 out:
 	return err;
 }
--- linux-2.6.7.ORG/fs/aio.c	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/fs/aio.c	2004-06-17 16:20:12.000000000 +0900
@@ -130,7 +130,8 @@ static int aio_setup_ring(struct kioctx 
 	dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
 	down_write(&ctx->mm->mmap_sem);
 	info->mmap_base = do_mmap(NULL, 0, info->mmap_size, 
-				  PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE,
+				  PROT_READ|PROT_WRITE,
+				  MAP_ANON|MAP_PRIVATE|MAP_IMMOVABLE,
 				  0);
 	if (IS_ERR((void *)info->mmap_base)) {
 		up_write(&ctx->mm->mmap_sem);
--- linux-2.6.7.ORG/include/asm-i386/mman.h	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/asm-i386/mman.h	2004-06-17 16:20:12.000000000 +0900
@@ -22,6 +22,7 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_IMMOVABLE	0x20000
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_INVALIDATE	2		/* invalidate the caches */
--- linux-2.6.7.ORG/include/asm-ia64/mman.h	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/asm-ia64/mman.h	2004-06-17 16:20:12.000000000 +0900
@@ -30,6 +30,7 @@
 #define MAP_NORESERVE	0x04000		/* don't check for reservations */
 #define MAP_POPULATE	0x08000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_IMMOVABLE	0x20000
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_INVALIDATE	2		/* invalidate the caches */
--- linux-2.6.7.ORG/include/linux/mm.h	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/linux/mm.h	2004-06-17 16:20:12.000000000 +0900
@@ -134,6 +134,7 @@ struct vm_area_struct {
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+#define VM_IMMOVABLE	0x01000000	/* Don't place in hot removable area */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
--- linux-2.6.7.ORG/include/linux/mman.h	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/linux/mman.h	2004-06-17 16:20:12.000000000 +0900
@@ -58,7 +58,11 @@ calc_vm_flag_bits(unsigned long flags)
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
 	       _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
-	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    );
+	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    )
+#ifdef CONFIG_MEMHOTPLUG
+	     | _calc_vm_trans(flags, MAP_IMMOVABLE,  VM_IMMOVABLE )
+#endif
+		;
 }
 
 #endif /* _LINUX_MMAN_H */
--- linux-2.6.7.ORG/kernel/fork.c	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/kernel/fork.c	2004-06-17 16:20:12.000000000 +0900
@@ -321,6 +321,9 @@ static inline int dup_mmap(struct mm_str
 			goto fail_nomem_policy;
 		vma_set_policy(tmp, pol);
 		tmp->vm_flags &= ~VM_LOCKED;
+#ifdef CONFIG_MEMHOTPLUG
+		tmp->vm_flags &= ~VM_IMMOVABLE;
+#endif
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
 		anon_vma_link(tmp);
--- linux-2.6.7.ORG/mm/memory.c	2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/mm/memory.c	2004-06-17 16:20:12.000000000 +0900
@@ -1069,7 +1069,13 @@ static int do_wp_page(struct mm_struct *
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+#ifdef CONFIG_MEMHOTPLUG
+	if (vma->vm_flags & VM_IMMOVABLE)
+		new_page = alloc_page_vma(GFP_USER | __GFP_HIGHMEM,
+		    vma, address);
+	else
+#endif
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 	if (!new_page)
 		goto no_new_page;
 	copy_cow_page(old_page,new_page,address);
@@ -1412,6 +1418,12 @@ do_anonymous_page(struct mm_struct *mm, 
 
 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
+#ifdef CONFIG_MEMHOTPLUG
+		if (vma->vm_flags & VM_IMMOVABLE)
+			page = alloc_page_vma(GFP_USER | __GFP_HIGHMEM,
+			    vma, addr);
+		else
+#endif
 		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
 		if (!page)
 			goto no_mem;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [7/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (5 preceding siblings ...)
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

$Id: va-shmem.patch,v 1.5 2004/04/14 06:36:05 iwamoto Exp $

--- linux-2.6.5.ORG/mm/shmem.c	Fri Apr  2 14:05:11 2032
+++ linux-2.6.5/mm/shmem.c	Fri Apr  2 14:43:37 2032
@@ -80,7 +80,13 @@ static inline struct page *shmem_dir_all
 	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 	 * might be reconsidered if it ever diverges from PAGE_SIZE.
 	 */
+#ifdef CONFIG_MEMHOTPLUG
+	return alloc_pages((gfp_mask & GFP_ZONEMASK) == __GFP_HOTREMOVABLE ? 
+	 	(gfp_mask & ~GFP_ZONEMASK) | __GFP_HIGHMEM : gfp_mask, 
+		    PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#else
 	return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#endif
 }
 
 static inline void shmem_dir_free(struct page *page)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [8/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (6 preceding siblings ...)
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7.ORG/include/linux/page-flags.h	Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/include/linux/page-flags.h	Sun Jul 11 10:51:49 2032
@@ -79,6 +79,7 @@
 #define PG_anon			20	/* Anonymous: anon_vma in mapping */
 
 #define PG_again		21
+#define PG_booked		22
 
 
 /*
@@ -303,6 +304,10 @@ extern unsigned long __read_page_state(u
 #define PageAgain(page)	test_bit(PG_again, &(page)->flags)
 #define SetPageAgain(page)	set_bit(PG_again, &(page)->flags)
 #define ClearPageAgain(page)	clear_bit(PG_again, &(page)->flags)
+
+#define PageBooked(page)	test_bit(PG_booked, &(page)->flags)
+#define SetPageBooked(page)	set_bit(PG_booked, &(page)->flags)
+#define ClearPageBooked(page)	clear_bit(PG_booked, &(page)->flags)
 
 #define PageAnon(page)		test_bit(PG_anon, &(page)->flags)
 #define SetPageAnon(page)	set_bit(PG_anon, &(page)->flags)
--- linux-2.6.7.ORG/include/linux/mmzone.h	Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/include/linux/mmzone.h	Sun Jul 11 10:51:49 2032
@@ -187,6 +187,9 @@ struct zone {
 	char			*name;
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		contig_pages_alloc_hint;
+	unsigned long		booked_pages;
+	long			scan_pages;
 } ____cacheline_maxaligned_in_smp;
 
 
--- linux-2.6.7.ORG/mm/page_alloc.c	Sun Jul 11 10:49:53 2032
+++ linux-2.6.7/mm/page_alloc.c	Sun Jul 11 10:53:04 2032
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Dynamic compound page allocation, Hirokazu Takahashi, Jul 2004
  */
 
 #include <linux/config.h>
@@ -25,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/suspend.h>
 #include <linux/pagevec.h>
+#include <linux/mm_inline.h>
 #include <linux/memhotplug.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
@@ -190,7 +192,11 @@ static inline void __free_pages_bulk (st
 		BUG();
 	index = page_idx >> (1 + order);
 
-	zone->free_pages -= mask;
+	if (!PageBooked(page))
+		zone->free_pages -= mask;
+	else {
+		zone->booked_pages -= mask;
+	}
 	while (mask + (1 << (MAX_ORDER-1))) {
 		struct page *buddy1, *buddy2;
 
@@ -209,6 +215,9 @@ static inline void __free_pages_bulk (st
 		buddy2 = base + page_idx;
 		BUG_ON(bad_range(zone, buddy1));
 		BUG_ON(bad_range(zone, buddy2));
+		if (PageBooked(buddy1) != PageBooked(buddy2)) {
+			break;
+		}
 		list_del(&buddy1->lru);
 		mask <<= 1;
 		area++;
@@ -371,7 +380,12 @@ static struct page *__rmqueue(struct zon
 		if (list_empty(&area->free_list))
 			continue;
 
-		page = list_entry(area->free_list.next, struct page, lru);
+		list_for_each_entry(page, &area->free_list, lru) {
+			if (!PageBooked(page))
+				goto gotit;
+		}
+		continue;
+gotit:
 		list_del(&page->lru);
 		index = page - zone->zone_mem_map;
 		if (current_order != MAX_ORDER-1)
@@ -503,6 +517,11 @@ static void fastcall free_hot_cold_page(
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	if (PageBooked(page)) {
+		__free_pages_ok(page, 0);
+		return;
+	}
+
 	kernel_map_pages(page, 1, 0);
 	inc_page_state(pgfree);
 	free_pages_check(__FUNCTION__, page);
@@ -572,6 +591,225 @@ buffered_rmqueue(struct zone *zone, int 
 	return page;
 }
 
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+/* 
+ * Check whether the page is freeable or not.
+ * It might not be free even if this function says OK,
+ * when it is just being allocated.
+ * This check is almost sufficient but not perfect.
+ */
+static inline int is_page_freeable(struct page *page)
+{
+	return (page->mapping || page_mapped(page) || !page_count(page)) &&
+	    !(page->flags & (1<<PG_reserved|1<<PG_compound|1<<PG_booked|1<<PG_slab));
+}
+
+static inline int is_free_page(struct page *page)
+{
+	return !(page_mapped(page) ||
+		page->mapping != NULL ||
+		page_count(page) != 0 ||
+		(page->flags & (
+			1 << PG_reserved|
+			1 << PG_compound|
+			1 << PG_booked	|
+			1 << PG_lru	|
+			1 << PG_private |
+			1 << PG_locked	|
+			1 << PG_active	|
+			1 << PG_reclaim	|
+			1 << PG_dirty	|
+			1 << PG_slab	|
+			1 << PG_writeback )));
+}
+
+static int
+try_to_book_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	struct page	*p;
+	int booked_count = 0;
+	unsigned long	flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for (p = page; p < &page[1<<order]; p++) {
+		if (!is_page_freeable(p))
+			goto out;
+		if (is_free_page(p))
+			booked_count++;
+		SetPageBooked(p);
+	}
+
+	zone->booked_pages = booked_count;
+	zone->free_pages -= booked_count;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return 1;
+out:
+	for (p--; p >= page; p--) {
+		ClearPageBooked(p);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return 0;
+}
+
+/*
+ * Mark PG_booked on all pages in a specified section to reserve 
+ * for future use. These won't be reused until PG_booked is cleared.
+ */
+static struct page *
+book_pages(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+	unsigned long	num = 1<<order;
+	unsigned long	slot = zone->contig_pages_alloc_hint;
+	struct page	*page;
+	
+	slot = (slot + num - 1) & ~(num - 1);	/* align */
+
+	for ( ; zone->scan_pages > 0; slot += num) {
+		zone->scan_pages -= num;
+		if (slot + num > zone->present_pages)
+			slot = 0;
+		page = &zone->zone_mem_map[slot];
+		if (try_to_book_pages(zone, page, order)) {
+			zone->contig_pages_alloc_hint = slot + num;
+			return page;
+		}
+	}
+	return NULL;
+}
+
+static void
+unbook_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	struct page	*p;
+	for (p = page; p < &page[1<<order]; p++) {
+		ClearPageBooked(p);
+	}
+}
+
+struct sweepctl {
+	struct page *start;
+	struct page *end;
+	int rest;
+};
+
+/*
+ * Choose a page among the booked pages.
+ *
+ */
+static struct page*
+get_booked_page(struct zone *zone, void *arg)
+{
+	struct sweepctl *ctl = (struct sweepctl *)arg;
+	struct page *page = ctl->start;
+	struct page *end = ctl->end;
+
+	for (; page <= end; page++) {
+		if (!page_count(page) && !PageLRU(page))
+			continue;
+		if (!PageBooked(page)) {
+			printk(KERN_ERR "ERROR sweepout_pages: page:%p isn't booked.\n", page);
+		}
+		if (!PageLRU(page) || steal_page_from_lru(zone, page) == NULL) {
+			ctl->rest++;
+			continue;
+		}
+		ctl->start = page + 1;
+		return page;
+	}
+	ctl->start = end + 1;
+	return NULL;
+}
+
+/*
+ * sweepout_pages() might not work well as the booked pages 
+ * might include some unfreeable pages.
+ */
+static int
+sweepout_pages(struct zone *zone, struct page *page, int num)
+{
+	struct sweepctl ctl;
+	int failed = 0;
+	int retry = 0;
+again:
+	on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+	ctl.start = page;
+	ctl.end = &page[num - 1];
+	ctl.rest = 0;
+	failed = try_to_migrate_pages(zone, MIGRATE_ANYNODE, get_booked_page, &ctl);
+
+	if (retry != failed || ctl.rest) {
+		retry = failed;
+		schedule_timeout(HZ/4);
+		/* Actually we should wait on the pages */
+		goto again;
+	}
+
+	on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+	return failed;
+}
+
+/*
+ * Allocate contiguous pages even if pages are fragmented in zones.
+ * Page Migration mechanism helps to make enough space in them.
+ */
+static struct page *
+force_alloc_pages(unsigned int gfp_mask, unsigned int order,
+			struct zonelist *zonelist)
+{
+	struct zone **zones = zonelist->zones;
+	struct zone *zone;
+	struct page *page = NULL;
+	unsigned long flags;
+	int i;
+	int ret;
+
+	static DECLARE_MUTEX(bookedpage_sem);
+
+	down(&bookedpage_sem);
+
+	for (i = 0; zones[i] != NULL; i++) {
+		zone = zones[i];
+		zone->scan_pages = zone->present_pages;
+		while (zone->scan_pages > 0) {
+			page = book_pages(zone, gfp_mask, order);
+			if (!page)
+				break;
+			ret = sweepout_pages(zone, page, 1<<order);
+			if (ret) {
+				spin_lock_irqsave(&zone->lock, flags);
+				unbook_pages(zone, page, order);
+				page = NULL;
+
+				zone->free_pages += zone->booked_pages;
+				spin_unlock_irqrestore(&zone->lock, flags);
+				continue;
+			}
+			spin_lock_irqsave(&zone->lock, flags);
+			unbook_pages(zone, page, order);
+			zone->free_pages += zone->booked_pages;
+			page = __rmqueue(zone, order);
+			spin_unlock_irqrestore(&zone->lock, flags);
+			if (page) {
+				prep_compound_page(page, order);
+				up(&bookedpage_sem);
+				return page;
+			}
+		}
+	}
+	up(&bookedpage_sem);
+	return NULL;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
+static inline int
+enough_pages(struct zone *zone, unsigned long min, const int wait)
+{
+	return (long)zone->free_pages - (long)min >= 0 ||
+		(!wait && (long)zone->free_pages - (long)zone->pages_high >= 0);
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  *
@@ -623,8 +861,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 		if (rt_task(p))
 			min -= z->pages_low >> 1;
 
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page) {
 				zone_statistics(zonelist, z);
@@ -648,8 +885,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 		if (rt_task(p))
 			min -= z->pages_low >> 1;
 
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page) {
 				zone_statistics(zonelist, z);
@@ -694,8 +930,7 @@ rebalance:
 
 		min = (1UL << order) + z->protection[alloc_type];
 
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page) {
  				zone_statistics(zonelist, z);
@@ -703,6 +938,20 @@ rebalance:
 			}
 		}
 	}
+
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+	/*
+	 * Defrag pages to allocate large contiguous pages
+	 *
+	 * FIXME: The following code will work only if CONFIG_HUGETLB_PAGE
+	 *        flag is on.
+	 */
+	if (order) {
+		page = force_alloc_pages(gfp_mask, order, zonelist);
+		if (page)
+			goto got_pg;
+	}
+#endif /* CONFIG_HUGETLB_PAGE */
 
 	/*
 	 * Don't let big-order allocations loop unless the caller explicitly
--- linux-2.6.7.ORG/mm/memhotplug.c	Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/mm/memhotplug.c	Sun Jul 11 10:51:49 2032
@@ -240,7 +240,7 @@ radix_tree_replace_pages(struct page *pa
 	}
 	/* Don't __put_page(page) here.  Truncate may be in progress. */
 	newpage->flags |= page->flags & ~(1 << PG_uptodate) &
-	    ~(1 << PG_highmem) & ~(1 << PG_anon) &
+	    ~(1 << PG_highmem) & ~(1 << PG_anon) & ~(1 << PG_booked) &
 	    ~(1 << PG_maplock) &
 	    ~(1 << PG_active) & ~(~0UL << NODEZONE_SHIFT);
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [9/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (7 preceding siblings ...)
  2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7.ORG/include/linux/hugetlb.h	Mon Jul  5 14:01:34 2032
+++ linux-2.6.7/include/linux/hugetlb.h	Mon Jul  5 14:00:53 2032
@@ -25,6 +25,8 @@ struct page *follow_huge_addr(struct mm_
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 				pmd_t *pmd, int write);
+extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
+				int, unsigned long);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
 struct page *alloc_huge_page(void);
@@ -81,6 +83,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0)
 #define alloc_huge_page()			({ NULL; })
 #define free_huge_page(p)			({ (void)(p); BUG(); })
+#define hugetlb_fault(mm, vma, write, addr)	0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.7.ORG/mm/memory.c	Mon Jul  5 14:01:34 2032
+++ linux-2.6.7/mm/memory.c	Mon Jul  5 13:55:53 2032
@@ -1683,7 +1683,7 @@ int handle_mm_fault(struct mm_struct *mm
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return hugetlb_fault(mm, vma, write_access, address);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c	Mon Jul  5 14:01:34 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c	Mon Jul  5 14:02:37 2032
@@ -80,10 +80,12 @@ int copy_hugetlb_page_range(struct mm_st
 			goto nomem;
 		src_pte = huge_pte_offset(src, addr);
 		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
+		if (!pte_none(entry)) {
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+			dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		}
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -111,6 +113,11 @@ follow_hugetlb_page(struct mm_struct *mm
 
 			pte = huge_pte_offset(mm, vaddr);
 
+			if (!pte || pte_none(*pte)) {
+				hugetlb_fault(mm, vma, 0, vaddr);
+				pte = huge_pte_offset(mm, vaddr);
+			}
+
 			/* hugetlb should be locked, and hence, prefaulted */
 			WARN_ON(!pte || pte_none(*pte));
 
@@ -198,6 +205,13 @@ follow_huge_pmd(struct mm_struct *mm, un
 	struct page *page;
 
 	page = pte_page(*(pte_t *)pmd);
+	if (!page) {
+		struct vm_area_struct *vma = find_vma(mm, address);
+		if (!vma)
+			return NULL;
+		hugetlb_fault(mm, vma, write, address);
+		page = pte_page(*(pte_t *)pmd);
+	}
 	if (page)
 		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
 	return page;
@@ -221,11 +235,71 @@ void unmap_hugepage_range(struct vm_area
 			continue;
 		page = pte_page(pte);
 		put_page(page);
+		mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
 	flush_tlb_range(vma, start, end);
 }
 
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+	struct page *page;
+	unsigned long idx;
+	pte_t *pte = huge_pte_alloc(mm, address);
+	int ret;
+
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+	if (!pte) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (!pte_none(*pte)) {
+		ret = VM_FAULT_MINOR;
+		goto out;
+	}
+
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+again:
+	page = find_lock_page(mapping, idx);
+
+	if (!page) {
+		if (hugetlb_get_quota(mapping)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		page = alloc_huge_page();
+		if (!page) {
+			hugetlb_put_quota(mapping);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+		if (ret) {
+			hugetlb_put_quota(mapping);
+			put_page(page);
+			goto again;
+		}
+	}
+	spin_lock(&mm->page_table_lock);
+	if (pte_none(*pte)) {
+		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+		flush_tlb_page(vma, address);
+		update_mmu_cache(vma, address, *pte);
+	} else {
+		put_page(page);
+	}
+	spin_unlock(&mm->page_table_lock);
+	unlock_page(page);
+	ret = VM_FAULT_MINOR;
+out:
+	return ret;
+}
+
 int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = current->mm;
@@ -235,46 +309,26 @@ int hugetlb_prefault(struct address_spac
 	BUG_ON(vma->vm_start & ~HPAGE_MASK);
 	BUG_ON(vma->vm_end & ~HPAGE_MASK);
 
+#if 0
 	spin_lock(&mm->page_table_lock);
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
+		if (addr < vma->vm_start)
+			addr = vma->vm_start;
+		if (addr >= vma->vm_end) {
+			ret = 0;
+			break;
 		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
+		spin_unlock(&mm->page_table_lock);
+		ret = hugetlb_fault(mm, vma, 1, addr);
+		schedule();
+		spin_lock(&mm->page_table_lock);
+		if (ret == VM_FAULT_SIGBUS) {
+			ret = -ENOMEM;
+			break;
 		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+		ret = 0;
 	}
-out:
 	spin_unlock(&mm->page_table_lock);
+#endif
 	return ret;
 }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [10/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (8 preceding siblings ...)
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7.ORG/mm/hugetlb.c	Thu Jun 17 15:17:51 2032
+++ linux-2.6.7/mm/hugetlb.c	Thu Jun 17 15:21:18 2032
@@ -15,8 +15,20 @@ const unsigned long hugetlb_zero = 0, hu
 static unsigned long nr_huge_pages, free_huge_pages;
 unsigned long max_huge_pages;
 static struct list_head hugepage_freelists[MAX_NUMNODES];
+static struct list_head hugepage_alllists[MAX_NUMNODES];
 static spinlock_t hugetlb_lock = SPIN_LOCK_UNLOCKED;
 
+static void register_huge_page(struct page *page)
+{
+	list_add(&page[1].lru,
+		&hugepage_alllists[page_zone(page)->zone_pgdat->node_id]);
+}
+
+static void unregister_huge_page(struct page *page)
+{
+	list_del(&page[1].lru);
+}
+
 static void enqueue_huge_page(struct page *page)
 {
 	list_add(&page->lru,
@@ -90,14 +102,17 @@ static int __init hugetlb_init(void)
 	unsigned long i;
 	struct page *page;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
+	for (i = 0; i < MAX_NUMNODES; ++i) {
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&hugepage_alllists[i]);
+	}
 
 	for (i = 0; i < max_huge_pages; ++i) {
 		page = alloc_fresh_huge_page();
 		if (!page)
 			break;
 		spin_lock(&hugetlb_lock);
+		register_huge_page(page);
 		enqueue_huge_page(page);
 		spin_unlock(&hugetlb_lock);
 	}
@@ -139,6 +154,7 @@ static int try_to_free_low(unsigned long
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
+			unregister_huge_page(page);
 			update_and_free_page(page);
 			--free_huge_pages;
 			if (!--count)
@@ -161,6 +177,7 @@ static unsigned long set_max_huge_pages(
 		if (!page)
 			return nr_huge_pages;
 		spin_lock(&hugetlb_lock);
+		register_huge_page(page);
 		enqueue_huge_page(page);
 		free_huge_pages++;
 		nr_huge_pages++;
@@ -174,6 +191,7 @@ static unsigned long set_max_huge_pages(
 		struct page *page = dequeue_huge_page();
 		if (!page)
 			break;
+		unregister_huge_page(page);
 		update_and_free_page(page);
 	}
 	spin_unlock(&hugetlb_lock);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [11/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (9 preceding siblings ...)
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
  2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7.ORG/include/linux/hugetlb.h	Mon Jul  5 14:05:39 2032
+++ linux-2.6.7/include/linux/hugetlb.h	Mon Jul  5 14:06:19 2032
@@ -27,6 +27,7 @@ struct page *follow_huge_pmd(struct mm_s
 				pmd_t *pmd, int write);
 extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
 				int, unsigned long);
+int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma, struct list_head *force);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
 struct page *alloc_huge_page(void);
@@ -84,6 +85,7 @@ static inline unsigned long hugetlb_tota
 #define alloc_huge_page()			({ NULL; })
 #define free_huge_page(p)			({ (void)(p); BUG(); })
 #define hugetlb_fault(mm, vma, write, addr)	0
+#define try_to_unmap_hugepage(page, vma, force)	0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.7.ORG/mm/rmap.c	Mon Jul  5 14:01:22 2032
+++ linux-2.6.7/mm/rmap.c	Mon Jul  5 14:06:19 2032
@@ -27,6 +27,7 @@
  *   on the mm->page_table_lock
  */
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -441,6 +442,13 @@ static int try_to_unmap_one(struct page 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
+
+	/*
+	 * Is there any better way to check whether the page is
+	 * HugePage or not?
+	 */
+	if (vma && is_vm_hugetlb_page(vma))
+		return try_to_unmap_hugepage(page, vma, force);
 
 	/*
 	 * We need the page_table_lock to protect us from page faults,
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c	Mon Jul  5 14:05:39 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c	Mon Jul  5 14:06:19 2032
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
+#include <linux/rmap.h>
 #include <linux/smp_lock.h>
 #include <linux/slab.h>
 #include <linux/err.h>
@@ -83,6 +84,7 @@ int copy_hugetlb_page_range(struct mm_st
 		if (!pte_none(entry)) {
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			page_dup_rmap(ptepage);
 			dst->rss += (HPAGE_SIZE / PAGE_SIZE);
 		}
 		set_pte(dst_pte, entry);
@@ -234,6 +236,7 @@ void unmap_hugepage_range(struct vm_area
 		if (pte_none(pte))
 			continue;
 		page = pte_page(pte);
+		page_remove_rmap(page);
 		put_page(page);
 		mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
 	}
@@ -288,6 +291,7 @@ again:
 	spin_lock(&mm->page_table_lock);
 	if (pte_none(*pte)) {
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+		page_add_file_rmap(page);
 		flush_tlb_page(vma, address);
 		update_mmu_cache(vma, address, *pte);
 	} else {
@@ -332,3 +336,87 @@ int hugetlb_prefault(struct address_spac
 #endif
 	return ret;
 }
+
+/*
+ * At what user virtual address is page expected in vma?
+ */
+static inline unsigned long
+huge_vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << HPAGE_SHIFT);
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+		/* page should be within any vma from prio_tree_next */
+		BUG_ON(!PageAnon(page));
+		return -EFAULT;
+	}
+	return address;
+}
+
+/*
+ * Try to clear the PTE which map the hugepage.
+ */
+int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma,
+				struct list_head *force)
+{
+	pte_t *pte;
+	pte_t pteval;
+	int ret = SWAP_AGAIN;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+
+	address = huge_vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	/*
+	 * We need the page_table_lock to protect us from page faults,
+	 * munmap, fork, etc...
+	 */
+	if (!spin_trylock(&mm->page_table_lock))
+		goto out;
+
+	pte = huge_pte_offset(mm, address);
+	if (!pte || pte_none(*pte))
+		goto out_unlock;
+	if (!pte_present(*pte))
+		goto out_unlock;
+
+	if (page_to_pfn(page) != pte_pfn(*pte))
+		goto out_unlock;
+
+	BUG_ON(!vma);
+
+#if 0 
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
+			ptep_test_and_clear_young(pte)) {
+		ret = SWAP_FAIL;
+		goto out_unlock;
+	}
+#endif
+
+	/* Nuke the page table entry. */
+	flush_cache_page(vma, address);
+	pteval = ptep_get_and_clear(pte);
+	flush_tlb_range(vma, address, address + HPAGE_SIZE);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
+
+	BUG_ON(PageAnon(page));
+
+	mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
+	BUG_ON(!page->mapcount);
+	page->mapcount--;
+	page_cache_release(page);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+
+out:
+	return ret;
+}
+
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (10 preceding siblings ...)
  2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7/mm/hugetlb.c.save	Wed Jul  7 18:34:06 2032
+++ linux-2.6.7/mm/hugetlb.c	Wed Jul  7 18:35:10 2032
@@ -149,8 +149,8 @@ static int try_to_free_low(unsigned long
 {
 	int i;
 	for (i = 0; i < MAX_NUMNODES; ++i) {
-		struct page *page;
-		list_for_each_entry(page, &hugepage_freelists[i], lru) {
+		struct page *page, *page1;
+		list_for_each_entry_safe(page, page1, &hugepage_freelists[i], lru) {
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [13/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (11 preceding siblings ...)
  2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7.ORG/include/linux/hugetlb.h	Sun Jul 11 11:33:45 2032
+++ linux-2.6.7/include/linux/hugetlb.h	Sun Jul 11 11:34:11 2032
@@ -28,6 +28,7 @@ struct page *follow_huge_pmd(struct mm_s
 extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
 				int, unsigned long);
 int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma, struct list_head *force);
+int mmigrate_hugetlb_pages(struct zone *);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
 struct page *alloc_huge_page(void);
@@ -86,6 +87,7 @@ static inline unsigned long hugetlb_tota
 #define free_huge_page(p)			({ (void)(p); BUG(); })
 #define hugetlb_fault(mm, vma, write, addr)	0
 #define try_to_unmap_hugepage(page, vma, force)	0
+#define mmigrate_hugetlb_pages(zone)		0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c	Sun Jul 11 11:33:45 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c	Sun Jul 11 11:34:11 2032
@@ -288,6 +288,15 @@ again:
 			goto again;
 		}
 	}
+
+	if (page->mapping == NULL) {
+		 BUG_ON(! PageAgain(page));
+		/* This page will go back to freelists[] */
+		put_page(page);	/* XXX */
+		unlock_page(page);
+		goto again;
+	}
+
 	spin_lock(&mm->page_table_lock);
 	if (pte_none(*pte)) {
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
--- linux-2.6.7.ORG/mm/memhotplug.c	Sun Jul 11 12:56:53 2032
+++ linux-2.6.7/mm/memhotplug.c	Sun Jul 11 12:56:17 2032
@@ -17,6 +17,7 @@
 #include <linux/buffer_head.h>
 #include <linux/mm_inline.h>
 #include <linux/rmap.h>
+#include <linux/hugetlb.h>
 #include <linux/memhotplug.h>
 
 #ifdef CONFIG_KDB
@@ -841,6 +842,10 @@ int mmigrated(void *p)
 	current->flags |= PF_KSWAPD;	/*  It's fake */
 	if (down_trylock(&mmigrated_sem)) {
 		printk("mmigrated already running\n");
+		return 0;
+	}
+	if (mmigrate_hugetlb_pages(zone)) {
+		up(&mmigrated_sem);
 		return 0;
 	}
 	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
--- linux-2.6.7.ORG/mm/hugetlb.c	Sun Jul 11 11:30:50 2032
+++ linux-2.6.7/mm/hugetlb.c	Sun Jul 11 13:14:25 2032
@@ -1,6 +1,7 @@
 /*
  * Generic hugetlb support.
  * (C) William Irwin, April 2004
+ * Support of memory hotplug for hugetlbpages, Hirokazu Takahashi, Jul 2004
  */
 #include <linux/gfp.h>
 #include <linux/list.h>
@@ -8,6 +9,8 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/pagemap.h>
+#include <linux/memhotplug.h>
 #include <linux/sysctl.h>
 #include <linux/highmem.h>
 
@@ -58,6 +61,9 @@ static struct page *alloc_fresh_huge_pag
 {
 	static int nid = 0;
 	struct page *page;
+	struct pglist_data *pgdat;
+	while ((pgdat = NODE_DATA(nid)) == NULL || !pgdat->enabled)
+		nid = (nid + 1) % numnodes;
 	page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP,
 					HUGETLB_PAGE_ORDER);
 	nid = (nid + 1) % numnodes;
@@ -91,6 +97,8 @@ struct page *alloc_huge_page(void)
 	free_huge_pages--;
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+			 1 << PG_referenced | 1 << PG_again);
 	page[1].mapping = (void *)free_huge_page;
 	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
 		clear_highpage(&page[i]);
@@ -144,25 +152,36 @@ static void update_and_free_page(struct 
 	__free_pages(page, HUGETLB_PAGE_ORDER);
 }
 
-#ifdef CONFIG_HIGHMEM
-static int try_to_free_low(unsigned long count)
+static int
+try_to_free_hugepages(int idx, unsigned long count, struct zone *zone)
 {
-	int i;
-	for (i = 0; i < MAX_NUMNODES; ++i) {
-		struct page *page, *page1;
-		list_for_each_entry_safe(page, page1, &hugepage_freelists[i], lru) {
+	struct page *page, *page1;
+	list_for_each_entry_safe(page, page1, &hugepage_freelists[idx], lru) {
+		if (zone) {
+			if (page_zone(page) != zone)
+				continue;
+		} else {
 			if (PageHighMem(page))
 				continue;
-			list_del(&page->lru);
-			unregister_huge_page(page);
-			update_and_free_page(page);
-			--free_huge_pages;
-			if (!--count)
-				return 0;
 		}
+		list_del(&page->lru);
+		unregister_huge_page(page);
+		update_and_free_page(page);
+		--free_huge_pages;
+		if (!--count)
+			return 0;
 	}
 	return count;
 }
+
+#ifdef CONFIG_HIGHMEM
+static int try_to_free_low(unsigned long count)
+{
+	int i;
+	for (i = 0; i < MAX_NUMNODES; ++i)
+		count = try_to_free_hugepages(i, count, NULL);
+	return count;
+}
 #else
 static inline int try_to_free_low(unsigned long count)
 {
@@ -250,10 +269,8 @@ unsigned long hugetlb_total_pages(void)
 EXPORT_SYMBOL(hugetlb_total_pages);
 
 /*
- * We cannot handle pagefaults against hugetlb pages at all.  They cause
- * handle_mm_fault() to try to instantiate regular-sized pages in the
- * hugegpage VMA.  do_page_fault() is supposed to trap this, so BUG is we get
- * this far.
+ * hugetlb_nopage() is never called since hugetlb_fault() has
+ * implemented.
  */
 static struct page *hugetlb_nopage(struct vm_area_struct *vma,
 				unsigned long address, int *unused)
@@ -275,3 +292,200 @@ void zap_hugepage_range(struct vm_area_s
 	unmap_hugepage_range(vma, start, start + length);
 	spin_unlock(&mm->page_table_lock);
 }
+
+#ifdef CONFIG_MEMHOTPLUG
+static int copy_hugepage(struct page *to, struct page *from)
+{
+	int size;
+	for (size = 0; size < HPAGE_SIZE; size += PAGE_SIZE) {
+		copy_highpage(to, from);
+		to++;
+		from++;
+	}
+	return 0;
+}
+
+/*
+ * Allocate a hugepage from Buddy allocator directly.
+ */
+static struct page *
+hugepage_mmigrate_alloc(int nid)
+{
+	struct page *page;
+	/* 
+	 * TODO:
+	 * - NUMA aware page allocation is required. we should allocate
+	 *   a hugepage from the node which the process depends on.
+	 * - New hugepages should be preallocated prior to migrating pages
+	 *   so that lack of memory can be found before them.
+	 * - New hugepages should be allocate from the node specified by nid.
+	 */
+	page = alloc_fresh_huge_page();
+	
+	if (page == NULL) {
+		printk(KERN_WARNING "remap: Failed to allocate new hugepage\n");
+	} else {
+		spin_lock(&hugetlb_lock);
+		register_huge_page(page);
+		enqueue_huge_page(page);
+		free_huge_pages++;
+		nr_huge_pages++;
+		spin_unlock(&hugetlb_lock);
+	}
+	page = alloc_huge_page();
+	unregister_huge_page(page);	/* XXXX */
+	return page;
+}
+
+/*
+ * Free a hugepage into Buddy allocator directly.
+ */
+static int
+hugepage_delete(struct page *page)
+{
+        BUG_ON(page_count(page) != 1);
+        BUG_ON(page->mapping);
+
+	spin_lock(&hugetlb_lock);
+	page[1].mapping = NULL;
+	update_and_free_page(page);
+	spin_unlock(&hugetlb_lock);
+        return 0;
+}
+
+static int
+hugepage_register(struct page *page, int active)
+{
+	spin_lock(&hugetlb_lock);
+	register_huge_page(page);
+	spin_unlock(&hugetlb_lock);
+        return 0;
+}
+
+static int
+hugepage_release_buffer(struct page *page)
+{
+	BUG();
+	return -1;
+}
+
+/*
+ * Hugetlbpage migration is harder than regular page migration
+ * for lack of swap related features on hugetlbpages. To do this
+ * new feature has been intoduced:
+ *  - rmap mechanism to unmap a hugetlbpage.
+ *  - a pagefault handler against hugetlbpages.
+ *  - a list on which all hugetlbpages to be put instead of the LRU
+ *    lists for regular pages. 
+ * With the feature, hugetlbpages can be handled in the same way
+ * for regular pages. 
+ * 
+ * The following is a flow to migrate hugetlbpages:
+ *  1. allocate a new hugetlbpage.
+ *    a. look for an appropriate section for a hugetlbpage.
+ *    b. make all pages in the section migrated to another zone.
+ *    c. allocate it as a new hugetlbpage.
+ *  2. lock the new hugetlbpage and don't set PG_uptodate flag on it.
+ *  3. modify the oldpage entry in the corresponding radix tree on
+ *     hugetlbfs with the new hugetlbpage.
+ *  4. clear all PTEs that refer to the old hugetlbpage.
+ *  5. wait until all references on the old hugetlbpage have gone.
+ *  6. copy from the old hugetlbpage to the new hugetlbpage.
+ *  7. set PG_uptodate flag of the new hugetlbpage.
+ *  8. release the old hugetlbpage into the Buddy allocator directly.
+ *  9. unlock the new hugetlbpage and wakeup all waiters.
+ *
+ * If a new access to a hugetlbpage migrating occurs, it will be blocked
+ * in a pagefaut handler until everything has done.
+ *
+ *
+ * disabled+------+---------------------------+------+------+---
+ * zone    |      |       old hugepage        |      |      |
+ *         +------+-------------|-------------+------+------+---
+ *                              +--migrate
+ *                                    |
+ *                                    V
+ *                        <-- reserve new hugepage -->
+ *           page   page   page   page   page   page   page  
+ *         +------+------+------+------+------+------+------+---
+ * zone    |      |      |Booked|Booked|Booked|Booked|      |   
+ *         +------+------+--|---+--|---+------+------+------+---
+ *                          |      |
+ *                 migrate--+      +--------------+
+ *                    |                           |
+ * other   +------+---V--+------+------+---    migrate
+ * zones   |      |      |      |      |          |
+ *         +------+------+------+------+--        |
+ *         +------+------+------+------+------+---V--+------+---
+ *         |      |      |      |      |      |      |      |
+ *         +------+------+------+------+------+------+------+---
+ */
+
+static struct mmigrate_operations hugepage_mmigrate_ops = {
+	.mmigrate_alloc_page       = hugepage_mmigrate_alloc,
+	.mmigrate_free_page        = hugepage_delete,
+	.mmigrate_copy_page        = copy_hugepage,
+	.mmigrate_lru_add_page     = hugepage_register,
+	.mmigrate_release_buffers  = hugepage_release_buffer,
+	.mmigrate_prepare          = NULL,
+	.mmigrate_stick_page       = NULL
+};
+
+int mmigrate_hugetlb_pages(struct zone *zone)
+{
+	struct page *page, *page1, *map;
+	int idx = zone->zone_pgdat->node_id;
+	LIST_HEAD(templist);
+	int rest = 0;
+
+	/*
+	 *  Release unused hugetlbpages coresponding to the specified zone.
+	 */
+	spin_lock(&hugetlb_lock);
+	try_to_free_hugepages(idx, free_huge_pages, zone);
+	spin_unlock(&hugetlb_lock);
+/* 	max_huge_pages = set_max_huge_pages(max_huge_pages); */
+
+	/*
+	 * Look for all hugetlbpages coresponding to the specified zone.
+	 */
+	spin_lock(&hugetlb_lock);
+	list_for_each_entry_safe(page, map, &hugepage_alllists[idx], lru) {
+		/*
+		 * looking for all hugetlbpages coresponding to the
+		 * specified zone.
+		 */
+		if (page_zone(page) != zone)
+			continue;
+		page_cache_get(page-1);
+		unregister_huge_page(page-1);
+		list_add(&page->lru, &templist);
+	}
+	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * Try to migrate the pages one by one.
+	 */
+	list_for_each_entry_safe(page1, map, &templist, lru) {
+		list_del(&page1->lru);
+		INIT_LIST_HEAD(&page1->lru);
+		page = page1 - 1;
+
+		if (page_count(page) <= 1 || page->mapping == NULL ||
+		    mmigrate_onepage(page, MIGRATE_ANYNODE, 0, &hugepage_mmigrate_ops)) {
+			/* free the page later */
+			spin_lock(&hugetlb_lock);
+			register_huge_page(page);
+			spin_unlock(&hugetlb_lock);
+			page_cache_release(page);
+			rest++;
+		}
+	}
+
+	/*
+	 *  Reallocate unused hugetlbpages.
+	 */
+	max_huge_pages = set_max_huge_pages(max_huge_pages);
+	return rest;
+}
+#endif /* CONFIG_MEMHOTPLUG */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [14/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (12 preceding siblings ...)
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
  2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7.ORG/mm/page_alloc.c	Thu Jun 17 15:19:28 2032
+++ linux-2.6.7/mm/page_alloc.c	Thu Jun 17 15:26:37 2032
@@ -2386,6 +2386,8 @@ int lower_zone_protection_sysctl_handler
 }
 
 #ifdef CONFIG_MEMHOTPLUG
+extern int mhtest_hpage_read(char *p, int, int);
+
 static int mhtest_read(char *page, char **start, off_t off, int count,
     int *eof, void *data)
 {
@@ -2409,9 +2411,15 @@ static int mhtest_read(char *page, char 
 				/* skip empty zone */
 				continue;
 			len = sprintf(p,
-			    "\t%s[%d]: free %ld, active %ld, present %ld\n",
+			    "\t%s[%d]: free %ld, active %ld, present %ld",
 			    z->name, NODEZONE(i, j),
 			    z->free_pages, z->nr_active, z->present_pages);
+			p += len;
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+			len = mhtest_hpage_read(p, i, j);
+			p += len;
+#endif
+			len = sprintf(p, "\n");
 			p += len;
 		}
 		*p++ = '\n';
--- linux-2.6.7.ORG/mm/hugetlb.c	Thu Jun 17 15:26:09 2032
+++ linux-2.6.7/mm/hugetlb.c	Thu Jun 17 15:26:37 2032
@@ -260,6 +260,24 @@ static unsigned long set_max_huge_pages(
 	return nr_huge_pages;
 }
 
+#ifdef CONFIG_MEMHOTPLUG
+int mhtest_hpage_read(char *p, int nodenum, int zonenum)
+{
+	struct page *page;
+	int total = 0;
+	int free = 0;
+	spin_lock(&hugetlb_lock);
+	list_for_each_entry(page, &hugepage_alllists[nodenum], lru) {
+		if (page_zonenum(page) == zonenum) total++;
+	}
+	list_for_each_entry(page, &hugepage_freelists[nodenum], lru) {
+		if (page_zonenum(page) == zonenum) free++;
+	}
+	spin_unlock(&hugetlb_lock);
+	return sprintf(p, " / HugePage free %d, total %d", free, total);
+}
+#endif
+
 #ifdef CONFIG_SYSCTL
 int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 			   struct file *file, void __user *buffer,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (13 preceding siblings ...)
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
  2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm


--- linux-2.6.7/fs/direct-io.c.ORG	Fri Jun 18 13:52:47 2032
+++ linux-2.6.7/fs/direct-io.c	Fri Jun 18 13:53:49 2032
@@ -411,7 +411,7 @@ static int dio_bio_complete(struct dio *
 		for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {
 			struct page *page = bvec[page_no].bv_page;
 
-			if (dio->rw == READ)
+			if (dio->rw == READ && !PageCompound(page))
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] memory hotremoval for linux-2.6.7 [16/16]
  2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
                   ` (14 preceding siblings ...)
  2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
  15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
  To: linux-kernel, lhms-devel; +Cc: linux-mm

--- linux-2.6.7.ORG/fs/direct-io.c	Thu Jun 17 15:17:13 2032
+++ linux-2.6.7/fs/direct-io.c	Thu Jun 17 15:28:44 2032
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/hugetlb.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
 #include <linux/err.h>
@@ -110,7 +111,11 @@ struct dio {
 	 * Page queue.  These variables belong to dio_refill_pages() and
 	 * dio_get_page().
 	 */
+#ifndef CONFIG_HUGETLB_PAGE
 	struct page *pages[DIO_PAGES];	/* page buffer */
+#else
+	struct page *pages[HPAGE_SIZE/PAGE_SIZE];	/* page buffer */
+#endif
 	unsigned head;			/* next page to process */
 	unsigned tail;			/* last valid page + 1 */
 	int page_errors;		/* errno from get_user_pages() */
@@ -143,9 +148,20 @@ static int dio_refill_pages(struct dio *
 {
 	int ret;
 	int nr_pages;
+	struct vm_area_struct * vma;
 
-	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
 	down_read(&current->mm->mmap_sem);
+#ifdef CONFIG_HUGETLB_PAGE
+	vma = find_vma(current->mm, dio->curr_user_address);
+	if (vma && is_vm_hugetlb_page(vma)) {
+		unsigned long n = dio->curr_user_address & PAGE_MASK;
+		n = (n & ~HPAGE_MASK) >> PAGE_SHIFT;
+		n = HPAGE_SIZE/PAGE_SIZE - n;
+		nr_pages = min(dio->total_pages - dio->curr_page, (int)n);
+	} else
+#endif
+		nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
+
 	ret = get_user_pages(
 		current,			/* Task for fault acounting */
 		current->mm,			/* whose pages? */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2004-07-14 14:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox