* [PATCH] memory hotremoval for linux-2.6.7 [0/16]
@ 2004-07-14 13:41 Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
` (15 more replies)
0 siblings, 16 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 13:41 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
Hi,
I'm pleased to say I've cleaned up the memory hotremoval patch
Mr. Iwamoto implemented. Part of ugly code has gone.
Main changes are:
- Replaced the name remap with mmigrate as it was used for
another fuctionality.
- Made some of the memory hotremoval code share with the swapout-code.
- Added many comments to describe the design of the memory hotremoval.
- Added a basic funtion to support for memsection.
try_to_migrate_page() is it. It continues to get a proper page
in a specified section and migrate it while there remain pages
in the section.
The patches are against linux-2.6.7.
Note that some patches are to fix bugs. Without the patches hugetlbpage
migration won't work.
Thanks,
Hirokazu Takahashi.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] memory hotremoval for linux-2.6.7 [1/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
@ 2004-07-14 14:02 ` Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:02 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/mm_inline.h Sat Jul 10 12:42:43 2032
+++ linux-2.6.7/include/linux/mm_inline.h Sat Jul 10 12:34:19 2032
@@ -38,3 +38,42 @@ del_page_from_lru(struct zone *zone, str
zone->nr_inactive--;
}
}
+
+static inline struct page *
+steal_page_from_lru(struct zone *zone, struct page *page)
+{
+ if (!TestClearPageLRU(page))
+ BUG();
+ list_del(&page->lru);
+ if (get_page_testone(page)) {
+ /*
+ * It was already free! release_pages() or put_page()
+ * are about to remove it from the LRU and free it. So
+ * put the refcount back and put the page back on the
+ * LRU
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ if (PageActive(page))
+ list_add(&page->lru, &zone->active_list);
+ else
+ list_add(&page->lru, &zone->inactive_list);
+ return NULL;
+ }
+ if (PageActive(page))
+ zone->nr_active--;
+ else
+ zone->nr_inactive--;
+ return page;
+}
+
+static inline void
+putback_page_to_lru(struct zone *zone, struct page *page)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+}
--- linux-2.6.7.ORG/mm/vmscan.c Sat Jul 10 12:42:43 2032
+++ linux-2.6.7/mm/vmscan.c Sat Jul 10 12:41:29 2032
@@ -557,23 +557,11 @@ static void shrink_cache(struct zone *zo
prefetchw_prev_lru_page(page,
&zone->inactive_list, flags);
-
- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, &zone->inactive_list);
+ if (steal_page_from_lru(zone, page) == NULL)
continue;
- }
list_add(&page->lru, &page_list);
nr_taken++;
}
- zone->nr_inactive -= nr_taken;
zone->pages_scanned += nr_taken;
spin_unlock_irq(&zone->lru_lock);
@@ -596,13 +584,8 @@ static void shrink_cache(struct zone *zo
*/
while (!list_empty(&page_list)) {
page = lru_to_page(&page_list);
- if (TestSetPageLRU(page))
- BUG();
list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
+ putback_page_to_lru(zone, page);
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -655,26 +638,12 @@ refill_inactive_zone(struct zone *zone,
while (pgscanned < nr_pages && !list_empty(&zone->active_list)) {
page = lru_to_page(&zone->active_list);
prefetchw_prev_lru_page(page, &zone->active_list, flags);
- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It was already free! release_pages() or put_page()
- * are about to remove it from the LRU and free it. So
- * put the refcount back and put the page back on the
- * LRU
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, &zone->active_list);
- } else {
+ if (steal_page_from_lru(zone, page) != NULL) {
list_add(&page->lru, &l_hold);
pgmoved++;
}
pgscanned++;
}
- zone->nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] memory hotremoval for linux-2.6.7 [2/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
@ 2004-07-14 14:02 ` Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:02 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/swap.h Sat Jul 10 12:30:17 2032
+++ linux-2.6.7/include/linux/swap.h Sat Jul 10 13:47:57 2032
@@ -174,6 +174,17 @@ extern void swap_setup(void);
/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
extern int shrink_all_memory(int);
+typedef enum {
+ /* failed to write page out, page is locked */
+ PAGE_KEEP,
+ /* move page to the active list, page is locked */
+ PAGE_ACTIVATE,
+ /* page has been sent to the disk successfully, page is unlocked */
+ PAGE_SUCCESS,
+ /* page is clean and locked */
+ PAGE_CLEAN,
+} pageout_t;
+extern pageout_t pageout(struct page *, struct address_space *);
extern int vm_swappiness;
#ifdef CONFIG_MMU
--- linux-2.6.7.ORG/mm/vmscan.c Sat Jul 10 15:13:47 2032
+++ linux-2.6.7/mm/vmscan.c Sat Jul 10 13:48:42 2032
@@ -236,22 +241,10 @@ static void handle_write_error(struct ad
unlock_page(page);
}
-/* possible outcome of pageout() */
-typedef enum {
- /* failed to write page out, page is locked */
- PAGE_KEEP,
- /* move page to the active list, page is locked */
- PAGE_ACTIVATE,
- /* page has been sent to the disk successfully, page is unlocked */
- PAGE_SUCCESS,
- /* page is clean and locked */
- PAGE_CLEAN,
-} pageout_t;
-
/*
* pageout is called by shrink_list() for each dirty page. Calls ->writepage().
*/
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+pageout_t pageout(struct page *page, struct address_space *mapping)
{
/*
* If the page is dirty, only perform writeback if that write
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [3/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/arch/i386/Kconfig Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/arch/i386/Kconfig Sun Jul 11 10:04:58 2032
@@ -734,9 +734,19 @@ comment "NUMA (NUMA-Q) requires SMP, 64G
comment "NUMA (Summit) requires SMP, 64GB highmem support, ACPI"
depends on X86_SUMMIT && (!HIGHMEM64G || !ACPI)
+config MEMHOTPLUG
+ bool "Memory hotplug test"
+ depends on !X86_PAE
+ default n
+
+config MEMHOTPLUG_BLKSIZE
+ int "Size of a memory hotplug unit (in MB, must be multiple of 256)."
+ range 256 1024
+ depends on MEMHOTPLUG
+
config DISCONTIGMEM
bool
- depends on NUMA
+ depends on NUMA || MEMHOTPLUG
default y
config HAVE_ARCH_BOOTMEM_NODE
--- linux-2.6.7.ORG/include/linux/gfp.h Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/gfp.h Sat Jul 10 19:37:22 2032
@@ -11,9 +11,10 @@ struct vm_area_struct;
/*
* GFP bitmasks..
*/
-/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
-#define __GFP_DMA 0x01
-#define __GFP_HIGHMEM 0x02
+/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low three bits) */
+#define __GFP_DMA 0x01
+#define __GFP_HIGHMEM 0x02
+#define __GFP_HOTREMOVABLE 0x03
/*
* Action modifiers - doesn't change the zoning
@@ -51,7 +52,7 @@ struct vm_area_struct;
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_HOTREMOVABLE)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
--- linux-2.6.7.ORG/include/linux/mmzone.h Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/mmzone.h Sun Jul 11 10:04:13 2032
@@ -65,8 +65,10 @@ struct per_cpu_pageset {
#define ZONE_DMA 0
#define ZONE_NORMAL 1
#define ZONE_HIGHMEM 2
+#define ZONE_HOTREMOVABLE 3 /* only for zonelists */
#define MAX_NR_ZONES 3 /* Sync this with ZONES_SHIFT */
+#define MAX_NR_ZONELISTS 4
#define ZONES_SHIFT 2 /* ceil(log2(MAX_NR_ZONES)) */
#define GFP_ZONEMASK 0x03
@@ -225,7 +227,7 @@ struct zonelist {
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
- struct zonelist node_zonelists[MAX_NR_ZONES];
+ struct zonelist node_zonelists[MAX_NR_ZONELISTS];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
@@ -237,6 +239,7 @@ typedef struct pglist_data {
struct pglist_data *pgdat_next;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
+ char removable, enabled;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
--- linux-2.6.7.ORG/include/linux/page-flags.h Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/page-flags.h Sun Jul 11 10:04:13 2032
@@ -78,6 +78,8 @@
#define PG_anon 20 /* Anonymous: anon_vma in mapping */
+#define PG_again 21
+
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -297,6 +299,10 @@ extern unsigned long __read_page_state(u
#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+
+#define PageAgain(page) test_bit(PG_again, &(page)->flags)
+#define SetPageAgain(page) set_bit(PG_again, &(page)->flags)
+#define ClearPageAgain(page) clear_bit(PG_again, &(page)->flags)
#define PageAnon(page) test_bit(PG_anon, &(page)->flags)
#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
--- linux-2.6.7.ORG/include/linux/rmap.h Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/rmap.h Sat Jul 10 19:37:22 2032
@@ -96,7 +96,7 @@ static inline void page_dup_rmap(struct
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *);
-int try_to_unmap(struct page *);
+int try_to_unmap(struct page *, struct list_head *);
#else /* !CONFIG_MMU */
@@ -105,7 +105,7 @@ int try_to_unmap(struct page *);
#define anon_vma_link(vma) do {} while (0)
#define page_referenced(page) TestClearPageReferenced(page)
-#define try_to_unmap(page) SWAP_FAIL
+#define try_to_unmap(page, force) SWAP_FAIL
#endif /* CONFIG_MMU */
--- linux-2.6.7.ORG/mm/Makefile Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/Makefile Sat Jul 10 19:37:22 2032
@@ -15,3 +15,5 @@ obj-y := bootmem.o filemap.o mempool.o
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
+
+obj-$(CONFIG_MEMHOTPLUG) += memhotplug.o
--- linux-2.6.7.ORG/mm/filemap.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/filemap.c Sat Jul 10 19:37:22 2032
@@ -250,7 +250,8 @@ int filemap_write_and_wait(struct addres
int add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t offset, int gfp_mask)
{
- int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ int error = radix_tree_preload((gfp_mask & ~GFP_ZONEMASK) |
+ ((gfp_mask & GFP_ZONEMASK) == __GFP_DMA ? __GFP_DMA : 0));
if (error == 0) {
spin_lock_irq(&mapping->tree_lock);
@@ -495,6 +496,7 @@ repeat:
page_cache_release(page);
goto repeat;
}
+ BUG_ON(PageAgain(page));
}
}
spin_unlock_irq(&mapping->tree_lock);
@@ -738,6 +740,8 @@ page_not_up_to_date:
goto page_ok;
}
+ BUG_ON(PageAgain(page));
+
readpage:
/* ... and start the actual read. The read will unlock the page. */
error = mapping->a_ops->readpage(filp, page);
@@ -1206,6 +1210,8 @@ page_not_uptodate:
goto success;
}
+ BUG_ON(PageAgain(page));
+
if (!mapping->a_ops->readpage(file, page)) {
wait_on_page_locked(page);
if (PageUptodate(page))
@@ -1314,6 +1320,8 @@ page_not_uptodate:
goto success;
}
+ BUG_ON(PageAgain(page));
+
if (!mapping->a_ops->readpage(file, page)) {
wait_on_page_locked(page);
if (PageUptodate(page))
@@ -1518,6 +1526,8 @@ retry:
unlock_page(page);
goto out;
}
+ BUG_ON(PageAgain(page));
+
err = filler(data, page);
if (err < 0) {
page_cache_release(page);
--- linux-2.6.7.ORG/mm/memory.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/memory.c Sun Jul 11 10:04:42 2032
@@ -1305,6 +1305,7 @@ static int do_swap_page(struct mm_struct
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+again:
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1332,6 +1333,12 @@ static int do_swap_page(struct mm_struct
mark_page_accessed(page);
lock_page(page);
+ if (PageAgain(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
+ BUG_ON(PageAgain(page));
/*
* Back out if somebody else faulted in this pte while we
--- linux-2.6.7.ORG/mm/page_alloc.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/page_alloc.c Sun Jul 11 10:04:58 2032
@@ -25,6 +25,7 @@
#include <linux/module.h>
#include <linux/suspend.h>
#include <linux/pagevec.h>
+#include <linux/memhotplug.h>
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/notifier.h>
@@ -231,6 +232,7 @@ static inline void free_pages_check(cons
1 << PG_maplock |
1 << PG_anon |
1 << PG_swapcache |
+ 1 << PG_again |
1 << PG_writeback )))
bad_page(function, page);
if (PageDirty(page))
@@ -341,12 +343,13 @@ static void prep_new_page(struct page *p
1 << PG_maplock |
1 << PG_anon |
1 << PG_swapcache |
+ 1 << PG_again |
1 << PG_writeback )))
bad_page(__FUNCTION__, page);
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_checked | 1 << PG_mappedtodisk);
+ 1 << PG_checked | 1 << PG_mappedtodisk | 1 << PG_again);
page->private = 0;
set_page_refs(page, order);
}
@@ -404,7 +407,7 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMHOTPLUG)
static void __drain_pages(unsigned int cpu)
{
struct zone *zone;
@@ -447,7 +450,9 @@ int is_head_of_free_region(struct page *
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
}
+#endif
+#if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_MEMHOTPLUG)
/*
* Spill all of this CPU's per-cpu pages back into the buddy allocator.
*/
@@ -847,7 +852,8 @@ unsigned int nr_free_pages(void)
struct zone *zone;
for_each_zone(zone)
- sum += zone->free_pages;
+ if (zone->zone_pgdat->enabled)
+ sum += zone->free_pages;
return sum;
}
@@ -860,7 +866,8 @@ unsigned int nr_used_zone_pages(void)
struct zone *zone;
for_each_zone(zone)
- pages += zone->nr_active + zone->nr_inactive;
+ if (zone->zone_pgdat->enabled)
+ pages += zone->nr_active + zone->nr_inactive;
return pages;
}
@@ -887,6 +894,8 @@ static unsigned int nr_free_zone_pages(i
struct zone **zonep = zonelist->zones;
struct zone *zone;
+ if (!pgdat->enabled)
+ continue;
for (zone = *zonep++; zone; zone = *zonep++) {
unsigned long size = zone->present_pages;
unsigned long high = zone->pages_high;
@@ -921,7 +930,8 @@ unsigned int nr_free_highpages (void)
unsigned int pages = 0;
for_each_pgdat(pgdat)
- pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+ if (pgdat->enabled)
+ pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
return pages;
}
@@ -1171,13 +1181,21 @@ void show_free_areas(void)
/*
* Builds allocation fallback zone lists.
*/
-static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
+static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
{
+
+ if (!pgdat->enabled)
+ return j;
+ if (k != ZONE_HOTREMOVABLE &&
+ pgdat->removable)
+ return j;
+
switch (k) {
struct zone *zone;
default:
BUG();
case ZONE_HIGHMEM:
+ case ZONE_HOTREMOVABLE:
zone = pgdat->node_zones + ZONE_HIGHMEM;
if (zone->present_pages) {
#ifndef CONFIG_HIGHMEM
@@ -1304,24 +1322,48 @@ static void __init build_zonelists(pg_da
#else /* CONFIG_NUMA */
-static void __init build_zonelists(pg_data_t *pgdat)
+static void build_zonelists(pg_data_t *pgdat)
{
int i, j, k, node, local_node;
+ int hotremovable;
+#ifdef CONFIG_MEMHOTPLUG
+ struct zone *zone;
+#endif
local_node = pgdat->node_id;
- for (i = 0; i < MAX_NR_ZONES; i++) {
+ for (i = 0; i < MAX_NR_ZONELISTS; i++) {
struct zonelist *zonelist;
zonelist = pgdat->node_zonelists + i;
- memset(zonelist, 0, sizeof(*zonelist));
+ /* memset(zonelist, 0, sizeof(*zonelist)); */
j = 0;
k = ZONE_NORMAL;
- if (i & __GFP_HIGHMEM)
+ hotremovable = 0;
+ switch (i) {
+ default:
+ BUG();
+ return;
+ case 0:
+ k = ZONE_NORMAL;
+ break;
+ case __GFP_HIGHMEM:
k = ZONE_HIGHMEM;
- if (i & __GFP_DMA)
+ break;
+ case __GFP_DMA:
k = ZONE_DMA;
+ break;
+ case __GFP_HOTREMOVABLE:
+#ifdef CONFIG_MEMHOTPLUG
+ k = ZONE_HIGHMEM;
+#else
+ k = ZONE_HOTREMOVABLE;
+#endif
+ hotremovable = 1;
+ break;
+ }
+#ifndef CONFIG_MEMHOTPLUG
j = build_zonelists_node(pgdat, zonelist, j, k);
/*
* Now we build the zonelist so that it contains the zones
@@ -1335,19 +1377,54 @@ static void __init build_zonelists(pg_da
j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
for (node = 0; node < local_node; node++)
j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-
- zonelist->zones[j] = NULL;
- }
+#else
+ while (hotremovable >= 0) {
+ for(; k >= 0; k--) {
+ zone = pgdat->node_zones + k;
+ for (node = local_node; ;) {
+ if (NODE_DATA(node) == NULL ||
+ !NODE_DATA(node)->enabled ||
+ (!!NODE_DATA(node)->removable) !=
+ (!!hotremovable))
+ goto next;
+ zone = NODE_DATA(node)->node_zones + k;
+ if (zone->present_pages)
+ zonelist->zones[j++] = zone;
+ next:
+ node = (node + 1) % numnodes;
+ if (node == local_node)
+ break;
+ }
+ }
+ if (hotremovable) {
+ /* place non-hotremovable after hotremovable */
+ k = ZONE_HIGHMEM;
+ }
+ hotremovable--;
+ }
+#endif
+ BUG_ON(j > sizeof(zonelist->zones) /
+ sizeof(zonelist->zones[0]) - 1);
+ for(; j < sizeof(zonelist->zones) /
+ sizeof(zonelist->zones[0]); j++)
+ zonelist->zones[j] = NULL;
+ }
}
#endif /* CONFIG_NUMA */
-void __init build_all_zonelists(void)
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+build_all_zonelists(void)
{
int i;
for(i = 0 ; i < numnodes ; i++)
- build_zonelists(NODE_DATA(i));
+ if (NODE_DATA(i) != NULL)
+ build_zonelists(NODE_DATA(i));
printk("Built %i zonelists\n", numnodes);
}
@@ -1419,7 +1496,7 @@ static void __init calculate_zone_totalp
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
-void __init memmap_init_zone(struct page *start, unsigned long size, int nid,
+void memmap_init_zone(struct page *start, unsigned long size, int nid,
unsigned long zone, unsigned long start_pfn)
{
struct page *page;
@@ -1457,10 +1534,13 @@ static void __init free_area_init_core(s
int cpu, nid = pgdat->node_id;
struct page *lmem_map = pgdat->node_mem_map;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
+#ifdef CONFIG_MEMHOTPLUG
+ int cold = !nid;
+#endif
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
-
+
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize;
@@ -1530,6 +1610,13 @@ static void __init free_area_init_core(s
zone->wait_table_size = wait_table_size(size);
zone->wait_table_bits =
wait_table_bits(zone->wait_table_size);
+#ifdef CONFIG_MEMHOTPLUG
+ if (!cold)
+ zone->wait_table = (wait_queue_head_t *)
+ kmalloc(zone->wait_table_size
+ * sizeof(wait_queue_head_t), GFP_KERNEL);
+ else
+#endif
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));
@@ -1584,6 +1671,13 @@ static void __init free_area_init_core(s
*/
bitmap_size = (size-1) >> (i+4);
bitmap_size = LONG_ALIGN(bitmap_size+1);
+#ifdef CONFIG_MEMHOTPLUG
+ if (!cold) {
+ zone->free_area[i].map =
+ (unsigned long *)kmalloc(bitmap_size, GFP_KERNEL);
+ memset(zone->free_area[i].map, 0, bitmap_size);
+ } else
+#endif
zone->free_area[i].map =
(unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
}
@@ -1901,7 +1995,7 @@ static void setup_per_zone_protection(vo
* that the pages_{min,low,high} values for each zone are set correctly
* with respect to min_free_kbytes.
*/
-static void setup_per_zone_pages_min(void)
+void setup_per_zone_pages_min(void)
{
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
--- linux-2.6.7.ORG/mm/rmap.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/rmap.c Sat Jul 10 19:37:22 2032
@@ -30,6 +30,7 @@
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/memhotplug.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/rmap.h>
@@ -421,7 +422,8 @@ void page_remove_rmap(struct page *page)
* Subfunctions of try_to_unmap: try_to_unmap_one called
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
-static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
+static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
+ struct list_head *force)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -429,6 +431,9 @@ static int try_to_unmap_one(struct page
pmd_t *pmd;
pte_t *pte;
pte_t pteval;
+#ifdef CONFIG_MEMHOTPLUG
+ struct page_va_list *vlist;
+#endif
int ret = SWAP_AGAIN;
if (!mm->rss)
@@ -466,8 +471,22 @@ static int try_to_unmap_one(struct page
*/
if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
ptep_test_and_clear_young(pte)) {
- ret = SWAP_FAIL;
- goto out_unmap;
+ if (force == NULL || vma->vm_flags & VM_RESERVED) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+#ifdef CONFIG_MEMHOTPLUG
+ vlist = kmalloc(sizeof(struct page_va_list), GFP_KERNEL);
+ atomic_inc(&mm->mm_count);
+ vlist->mm = mmgrab(mm);
+ if (vlist->mm == NULL) {
+ mmdrop(mm);
+ kfree(vlist);
+ } else {
+ vlist->addr = address;
+ list_add(&vlist->list, force);
+ }
+#endif
}
/*
@@ -620,7 +639,7 @@ out_unlock:
return SWAP_AGAIN;
}
-static inline int try_to_unmap_anon(struct page *page)
+static inline int try_to_unmap_anon(struct page *page, struct list_head *force)
{
struct anon_vma *anon_vma = (struct anon_vma *) page->mapping;
struct vm_area_struct *vma;
@@ -629,7 +648,7 @@ static inline int try_to_unmap_anon(stru
spin_lock(&anon_vma->lock);
BUG_ON(list_empty(&anon_vma->head));
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- ret = try_to_unmap_one(page, vma);
+ ret = try_to_unmap_one(page, vma, force);
if (ret == SWAP_FAIL || !page->mapcount)
break;
}
@@ -649,7 +668,7 @@ static inline int try_to_unmap_anon(stru
* The spinlock address_space->i_mmap_lock is tried. If it can't be gotten,
* return a temporary error.
*/
-static inline int try_to_unmap_file(struct page *page)
+static inline int try_to_unmap_file(struct page *page, struct list_head *force)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -666,7 +685,7 @@ static inline int try_to_unmap_file(stru
while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap,
&iter, pgoff, pgoff)) != NULL) {
- ret = try_to_unmap_one(page, vma);
+ ret = try_to_unmap_one(page, vma, force);
if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
@@ -760,7 +779,7 @@ out:
* SWAP_AGAIN - we missed a trylock, try again later
* SWAP_FAIL - the page is unswappable
*/
-int try_to_unmap(struct page *page)
+int try_to_unmap(struct page *page, struct list_head *force)
{
int ret;
@@ -769,9 +788,9 @@ int try_to_unmap(struct page *page)
BUG_ON(!page->mapcount);
if (PageAnon(page))
- ret = try_to_unmap_anon(page);
+ ret = try_to_unmap_anon(page, force);
else
- ret = try_to_unmap_file(page);
+ ret = try_to_unmap_file(page, force);
if (!page->mapcount) {
if (page_test_and_clear_dirty(page))
--- linux-2.6.7.ORG/mm/swapfile.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/swapfile.c Sat Jul 10 19:37:22 2032
@@ -662,6 +662,7 @@ static int try_to_unuse(unsigned int typ
*/
swap_map = &si->swap_map[i];
entry = swp_entry(type, i);
+ again:
page = read_swap_cache_async(entry, NULL, 0);
if (!page) {
/*
@@ -696,6 +697,11 @@ static int try_to_unuse(unsigned int typ
wait_on_page_locked(page);
wait_on_page_writeback(page);
lock_page(page);
+ if (PageAgain(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
wait_on_page_writeback(page);
/*
@@ -804,6 +810,7 @@ static int try_to_unuse(unsigned int typ
swap_writepage(page, &wbc);
lock_page(page);
+ BUG_ON(PageAgain(page));
wait_on_page_writeback(page);
}
if (PageSwapCache(page)) {
--- linux-2.6.7.ORG/mm/truncate.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/truncate.c Sat Jul 10 19:37:22 2032
@@ -132,6 +132,8 @@ void truncate_inode_pages(struct address
next++;
if (TestSetPageLocked(page))
continue;
+ /* no PageAgain(page) check; page->mapping check
+ * is done in truncate_complete_page */
if (PageWriteback(page)) {
unlock_page(page);
continue;
@@ -165,6 +167,24 @@ void truncate_inode_pages(struct address
struct page *page = pvec.pages[i];
lock_page(page);
+ if (page->mapping == NULL) {
+ /* XXX Is page->index still valid? */
+ unsigned long index = page->index;
+ int again = PageAgain(page);
+
+ unlock_page(page);
+ put_page(page);
+ page = find_lock_page(mapping, index);
+ if (page == NULL) {
+ BUG_ON(again);
+ /* XXX */
+ if (page->index > next)
+ next = page->index;
+ next++;
+ }
+ BUG_ON(!again);
+ pvec.pages[i] = page;
+ }
wait_on_page_writeback(page);
if (page->index > next)
next = page->index;
@@ -257,14 +277,29 @@ void invalidate_inode_pages2(struct addr
struct page *page = pvec.pages[i];
lock_page(page);
- if (page->mapping == mapping) { /* truncate race? */
- wait_on_page_writeback(page);
- next = page->index + 1;
- if (page_mapped(page))
- clear_page_dirty(page);
- else
- invalidate_complete_page(mapping, page);
+ while (page->mapping != mapping) {
+ struct page *newpage;
+ unsigned long index = page->index;
+
+ BUG_ON(page->mapping != NULL);
+
+ unlock_page(page);
+ newpage = find_lock_page(mapping, index);
+ if (page == newpage) {
+ put_page(page);
+ break;
+ }
+ BUG_ON(!PageAgain(page));
+ pvec.pages[i] = newpage;
+ put_page(page);
+ page = newpage;
}
+ wait_on_page_writeback(page);
+ next = page->index + 1;
+ if (page_mapped(page))
+ clear_page_dirty(page);
+ else
+ invalidate_complete_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
--- linux-2.6.7.ORG/mm/vmscan.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/vmscan.c Sat Jul 10 19:37:22 2032
@@ -32,6 +32,7 @@
#include <linux/topology.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
+#include <linux/kthread.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
@@ -387,7 +388,7 @@ static int shrink_list(struct list_head
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page)) {
+ switch (try_to_unmap(page, NULL)) {
case SWAP_FAIL:
page_map_unlock(page);
goto activate_locked;
@@ -1091,6 +1092,8 @@ int kswapd(void *p)
if (current->flags & PF_FREEZE)
refrigerator(PF_FREEZE);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ if (kthread_should_stop())
+ return 0;
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
@@ -1173,5 +1176,15 @@ static int __init kswapd_init(void)
hotcpu_notifier(cpu_callback, 0);
return 0;
}
+
+#ifdef CONFIG_MEMHOTPLUG
+void
+kswapd_start_one(pg_data_t *pgdat)
+{
+ pgdat->kswapd = kthread_create(kswapd, pgdat, "kswapd%d",
+ pgdat->node_id);
+ total_memory = nr_free_pagecache_pages();
+}
+#endif
module_init(kswapd_init)
--- linux-2.6.7.ORG/include/linux/memhotplug.h Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/include/linux/memhotplug.h Sun Jul 11 10:11:51 2032
@@ -0,0 +1,32 @@
+#ifndef _LINUX_MEMHOTPLUG_H
+#define _LINUX_MEMHOTPLUG_H
+
+#include <linux/config.h>
+#include <linux/mm.h>
+
+#ifdef __KERNEL__
+
+struct page_va_list {
+ struct mm_struct *mm;
+ unsigned long addr;
+ struct list_head list;
+};
+
+struct mmigrate_operations {
+ struct page * (*mmigrate_alloc_page)(int);
+ int (*mmigrate_free_page)(struct page *);
+ int (*mmigrate_copy_page)(struct page *, struct page *);
+ int (*mmigrate_lru_add_page)(struct page *, int);
+ int (*mmigrate_release_buffers)(struct page *);
+ int (*mmigrate_prepare)(struct page *page, int fastmode);
+ int (*mmigrate_stick_page)(struct list_head *vlist);
+};
+
+extern int mmigrated(void *p);
+extern int mmigrate_onepage(struct page *, int, int, struct mmigrate_operations *);
+extern int try_to_migrate_pages(struct zone *, int, struct page * (*)(struct zone *, void *), void *);
+
+#define MIGRATE_ANYNODE (-1)
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MEMHOTPLUG_H */
--- linux-2.6.7.ORG/mm/memhotplug.c Sun Jul 11 10:05:04 2032
+++ linux-2.6.7/mm/memhotplug.c Sun Jul 11 10:12:48 2032
@@ -0,0 +1,817 @@
+/*
+ * linux/mm/memhotplug.c
+ *
+ * Support of memory hotplug
+ *
+ * Authors: Toshihiro Iwamoto, <iwamoto@valinux.co.jp>
+ * Hirokazu Takahashi, <taka@valinux.co.jp>
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/mm_inline.h>
+#include <linux/rmap.h>
+#include <linux/memhotplug.h>
+
+#ifdef CONFIG_KDB
+#include <linux/kdb.h>
+#endif
+
+/*
+ * The following flow is a way to migrate a oldpage.
+ * 1. allocate a newpage.
+ * 2. lock the newpage and don't set PG_uptodate flag on it.
+ * 3. modify the oldpage entry in the corresponding radix tree with the
+ * newpage.
+ * 4. clear all PTEs that refer to the oldpage.
+ * 5. wait until all references on the oldpage have gone.
+ * 6. copy from the oldpage to the newpage.
+ * 7. set PG_uptodate flag of the newpage.
+ * 8. release the oldpage.
+ * 9. unlock the newpage and wakeup all waiters.
+ *
+ *
+ * adress_space oldpage
+ * +-----------+ +---------+
+ * | | | | +-----+
+ * | page_tree------+ -- X -->| |<-- X --| PTE |.....
+ * | | | |PG_uptodate +-----+
+ * | | | +---------+ :
+ * +-----------+ | :
+ * | newpage pagefaults
+ * | +---------+ :
+ * +--------->|PG_locked| ........:
+ * | | Blocked
+ * | | ...........system calls
+ * +---------+
+ *
+ *
+ * The key point is to block accesses to the page under operation by
+ * modifying the radix tree. After the radix tree has been modified, no new
+ * access goes to the oldpage. They will be redirected to the newpage which
+ * will be blocked until the data is ready because it is locked and not
+ * up to date. Remember that dropping PG_uptodate is important to block
+ * all read accesses, including system call accesses and page fault accesses.
+ *
+ * By this aproach, pages in the swapcache are handled in the same way as
+ * pages in the pagecache are since both pages are on radix trees.
+ * And any kind of pages in the pagecache can be migrated even if they
+ * are not assoiciated with backing store like pages in sysfs, in ramdisk
+ * and so on. We can migrate all pages on the LRU in the same way.
+ */
+
+
+static void
+print_buffer(struct page* page)
+{
+ struct address_space* mapping = page_mapping(page);
+ struct buffer_head *bh, *head;
+
+ spin_lock(&mapping->private_lock);
+ bh = head = page_buffers(page);
+ printk("buffers:");
+ do {
+ printk(" %lx %d", bh->b_state, atomic_read(&bh->b_count));
+
+ bh = bh->b_this_page;
+ } while (bh != head);
+ printk("\n");
+ spin_unlock(&mapping->private_lock);
+}
+
+/*
+ * Make pages on the "vlist" mapped or they may be freed
+ * though there are mlocked.
+ */
+static int
+stick_mlocked_page(struct list_head *vlist)
+{
+ struct page_va_list *v1, *v2;
+ struct vm_area_struct *vma;
+ int error;
+
+ list_for_each_entry_safe(v1, v2, vlist, list) {
+ list_del(&v1->list);
+ down_read(&v1->mm->mmap_sem);
+ vma = find_vma(v1->mm, v1->addr);
+ if (vma == NULL || !(vma->vm_flags & VM_LOCKED))
+ goto out;
+ error = get_user_pages(current, v1->mm, v1->addr, PAGE_SIZE,
+ (vma->vm_flags & VM_WRITE) != 0, 0, NULL, NULL);
+ out:
+ up_read(&v1->mm->mmap_sem);
+ mmput(v1->mm);
+ kfree(v1);
+ }
+ return 0;
+}
+
+/* helper function for mmigrate_onepage */
+#define REMAPPREP_WB 1
+#define REMAPPREP_BUFFER 2
+
+/*
+ * Try to free buffers if "page" has them.
+ *
+ * TODO:
+ * It would be possible to migrate a page without pageout
+ * if a address_space had a page migration method.
+ */
+static int
+mmigrate_preparepage(struct page *page, int fastmode)
+{
+ struct address_space *mapping;
+ int waitcnt = fastmode ? 0 : 100;
+ int res = -REMAPPREP_BUFFER;
+
+ BUG_ON(!PageLocked(page));
+
+ mapping = page_mapping(page);
+
+ if (!PagePrivate(page) && PageWriteback(page) &&
+ !PageSwapCache(page)) {
+ printk("mmigrate_preparepage: mapping %p page %p\n",
+ page->mapping, page);
+ return -REMAPPREP_WB;
+ }
+
+ /*
+ * TODO: wait_on_page_writeback() would be better if it supports
+ * timeout.
+ */
+ while (PageWriteback(page)) {
+ if (!waitcnt)
+ return -REMAPPREP_WB;
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(10);
+ __set_current_state(TASK_RUNNING);
+ waitcnt--;
+ }
+ if (PagePrivate(page)) {
+ if (PageDirty(page)) {
+ switch(pageout(page, mapping)) {
+ case PAGE_ACTIVATE:
+ res = -REMAPPREP_WB;
+ waitcnt = 1;
+ case PAGE_KEEP:
+ case PAGE_CLEAN:
+ break;
+ case PAGE_SUCCESS:
+ lock_page(page);
+ mapping = page_mapping(page);
+ if (!PagePrivate(page))
+ return 0;
+ }
+ }
+
+ while (1) {
+ if (try_to_release_page(page, GFP_KERNEL))
+ break;
+ if (!waitcnt)
+ return res;
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(10);
+ __set_current_state(TASK_RUNNING);
+ waitcnt--;
+ if (!waitcnt)
+ print_buffer(page);
+ }
+ }
+ return 0;
+}
+
+/*
+ * Assign a swap entry to an anonymous page if it doesn't have yet,
+ * so that it can be handled like one in the page cache.
+ */
+static struct address_space *
+make_page_mapped(struct page *page)
+{
+ if (!page_mapped(page)) {
+ if (page_count(page) > 1)
+ printk("page %p not mapped: count %d\n",
+ page, page_count(page));
+ return NULL;
+ }
+ /* The page is an anon page. Allocate its swap entry. */
+ page_map_unlock(page);
+ add_to_swap(page);
+ page_map_lock(page);
+ return page_mapping(page);
+}
+
+/*
+ * Replace "page" with "newpage" on the radix tree. After that, all
+ * new access to "page" will be redirected to "newpage" and it
+ * will be blocked until migrating has been done.
+ */
+static int
+radix_tree_replace_pages(struct page *page, struct page *newpage,
+ struct address_space *mapping)
+{
+ if (radix_tree_preload(GFP_KERNEL))
+ return -1;
+
+ if (PagePrivate(page)) /* XXX */
+ BUG();
+
+ /* should {__add_to,__remove_from}_page_cache be used instead? */
+ spin_lock_irq(&mapping->tree_lock);
+ if (mapping != page_mapping(page))
+ printk("mapping changed %p -> %p, page %p\n",
+ mapping, page_mapping(page), page);
+ if (radix_tree_delete(&mapping->page_tree, page_index(page)) == NULL) {
+ /* Page truncated. */
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ return -1;
+ }
+ /* Don't __put_page(page) here. Truncate may be in progress. */
+ newpage->flags |= page->flags & ~(1 << PG_uptodate) &
+ ~(1 << PG_highmem) & ~(1 << PG_anon) &
+ ~(1 << PG_maplock) &
+ ~(1 << PG_active) & ~(~0UL << NODEZONE_SHIFT);
+
+ radix_tree_insert(&mapping->page_tree, page_index(page), newpage);
+ page_cache_get(newpage);
+ newpage->index = page->index;
+ if (PageSwapCache(page))
+ newpage->private = page->private;
+ else
+ newpage->mapping = page->mapping;
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ return 0;
+}
+
+/*
+ * Remove all PTE mappings to "page".
+ */
+static int
+unmap_page(struct page *page, struct list_head *vlist)
+{
+ int error = SWAP_SUCCESS;
+
+ page_map_lock(page);
+ while (page_mapped(page) &&
+ (error = try_to_unmap(page, vlist)) == SWAP_AGAIN) {
+ /*
+ * There may be race condition, just wait for a while
+ */
+ page_map_unlock(page);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(1);
+ __set_current_state(TASK_RUNNING);
+ page_map_lock(page);
+ }
+ page_map_unlock(page);
+ if (error == SWAP_FAIL) {
+ /* either during mremap or mlocked */
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * Wait for "page" to become free. Usually this function waits until
+ * the page count drops to 2. For a truncated page, it waits until
+ * the count drops to 1.
+ * Returns: 0 on success, 1 on page truncation, -1 on error.
+ */
+static int
+wait_on_page_freeable(struct page *page, struct address_space *mapping,
+ struct list_head *vlist, unsigned long index,
+ int nretry, struct mmigrate_operations *ops)
+{
+ struct address_space *mapping1;
+ void *p;
+ int truncated = 0;
+wait_again:
+ while ((truncated + page_count(page)) > 2) {
+ if (nretry <= 0)
+ return -1;
+ /*
+ * No lock needed while waiting page count.
+ * Yield CPU to other accesses which may have to lock the
+ * page to proceed.
+ */
+ unlock_page(page);
+
+ /*
+ * Wait until all references has gone.
+ */
+ while ((truncated + page_count(page)) > 2) {
+ nretry--;
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(1);
+ if ((nretry % 5000) == 0) {
+ printk("mmigrate_onepage: still waiting on %p %d\n", page, nretry);
+ break;
+ }
+ /*
+ * Another remaining access to the page may reassign
+ * buffers or make it mapped again.
+ */
+ if (PagePrivate(page) || page_mapped(page))
+ break; /* see below */
+ }
+
+ lock_page(page);
+ BUG_ON(page_count(page) == 0);
+ mapping1 = page_mapping(page);
+ if (mapping != mapping1 && mapping1 != NULL)
+ printk("mapping changed %p -> %p, page %p\n",
+ mapping, mapping1, page);
+
+ /*
+ * Free buffers of the page which may have been
+ * reassigned.
+ */
+ if (PagePrivate(page))
+ ops->mmigrate_release_buffers(page);
+
+ /*
+ * Clear all PTE mappings to the page as it may have
+ * been mapped again.
+ */
+ unmap_page(page, vlist);
+ }
+ if (PageReclaim(page) || PageWriteback(page) || PagePrivate(page))
+#ifdef CONFIG_KDB
+ KDB_ENTER();
+#else
+ BUG();
+#endif
+ if (page_count(page) == 1)
+ /* page has been truncated. */
+ return 1;
+ spin_lock_irq(&mapping->tree_lock);
+ p = radix_tree_lookup(&mapping->page_tree, index);
+ spin_unlock_irq(&mapping->tree_lock);
+ if (p == NULL) {
+ BUG_ON(page->mapping != NULL);
+ truncated = 1;
+ BUG_ON(page_mapping(page) != NULL);
+ goto wait_again;
+ }
+
+ return 0;
+}
+
+/*
+ * A file which "page" belongs to has been truncated. Free both pages.
+ */
+static void
+free_truncated_pages(struct page *page, struct page *newpage,
+ struct address_space *mapping)
+{
+ void *p;
+ BUG_ON(page_mapping(page) != NULL);
+ put_page(newpage);
+ if (page_count(newpage) != 1) {
+ printk("newpage count %d != 1, %p\n",
+ page_count(newpage), newpage);
+ BUG();
+ }
+ newpage->mapping = page->mapping = NULL;
+ ClearPageActive(page);
+ ClearPageActive(newpage);
+ ClearPageSwapCache(page);
+ ClearPageSwapCache(newpage);
+ unlock_page(page);
+ unlock_page(newpage);
+ put_page(newpage);
+}
+
+/*
+ * Roll back a page migration.
+ *
+ * In some cases, a page migration needs to be rolled back and to
+ * be retried later. This is a bit tricky because it is likely that some
+ * processes have already looked up the radix tree and waiting for its
+ * lock. Such processes need to discard a newpage and look up the radix
+ * tree again, as the newpage is now invalid.
+ * A new page flag (PG_again) is used for that purpose.
+ *
+ * 1. Roll back the radix tree change.
+ * 2. Set PG_again flag of the newpage and unlock it.
+ * 3. Woken up processes see the PG_again bit and looks up the radix
+ * tree again.
+ * 4. Wait until the page count of the newpage falls to 1 (for the
+ * migrated process).
+ * 5. Roll back is complete. the newpage can be freed.
+ */
+static int
+radix_tree_rewind_page(struct page *page, struct page *newpage,
+ struct address_space *mapping)
+{
+ int waitcnt;
+ pgoff_t index;
+
+ /*
+ * Try to unwind by notifying waiters. If someone misbehaves,
+ * we die.
+ */
+ if (radix_tree_preload(GFP_KERNEL))
+ BUG();
+ /* should {__add_to,__remove_from}_page_cache be used instead? */
+ spin_lock_irq(&mapping->tree_lock);
+ index = page_index(page);
+ if (radix_tree_delete(&mapping->page_tree, index) == NULL)
+ /* Hold extra count to handle truncate */
+ page_cache_get(newpage);
+ radix_tree_insert(&mapping->page_tree, index, page);
+ /* no page_cache_get(page); needed */
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+
+ /*
+ * PG_again flag notifies waiters that this newpage isn't what
+ * the waiters expect.
+ */
+ SetPageAgain(newpage);
+ newpage->mapping = NULL;
+ /* XXX unmap needed? No, it shouldn't. Handled by fault handlers. */
+ unlock_page(newpage);
+
+ /*
+ * Some accesses may be blocked on the newpage. Wait until the
+ * accesses has gone.
+ */
+ waitcnt = HZ;
+ for(; page_count(newpage) > 2; waitcnt--) {
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(10);
+ if (waitcnt == 0) {
+ printk("You are hosed.\n");
+ printk("newpage %p flags %lx %d %d, page %p flags %lx %d\n",
+ newpage, newpage->flags, page_count(newpage),
+ newpage->mapcount,
+ page, page->flags, page_count(page));
+ BUG();
+ }
+ }
+
+ BUG_ON(PageUptodate(newpage));
+ ClearPageDirty(newpage);
+ ClearPageActive(newpage);
+ spin_lock_irq(&mapping->tree_lock);
+ if (page_count(newpage) == 1) {
+ printk("newpage %p truncated. page %p\n", newpage, page);
+ BUG();
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+ unlock_page(page);
+ BUG_ON(page_count(newpage) != 2);
+ ClearPageAgain(newpage);
+ __put_page(newpage);
+ return 1;
+}
+
+/*
+ * Allocate a new page from a specified node.
+ */
+static struct page *
+mmigrate_alloc_page(int nid)
+{
+ if (nid == MIGRATE_ANYNODE)
+ return alloc_page(GFP_HIGHUSER);
+ else
+ return alloc_pages_node(nid, GFP_HIGHUSER, 0);
+}
+
+/*
+ * Release "page" into the Buddy allocator.
+ */
+static int
+mmigrate_free_page(struct page *page)
+{
+ BUG_ON(page_count(page) != 1);
+ put_page(page);
+ return 0;
+}
+
+/*
+ * Copy data from "from" to "to".
+ */
+static int
+mmigrate_copy_page(struct page *to, struct page *from)
+{
+ copy_highpage(to, from);
+ return 0;
+}
+
+/*
+ * Insert "page" into the LRU.
+ */
+static int
+mmigrate_lru_add_page(struct page *page, int active)
+{
+ if (active)
+ lru_cache_add_active(page);
+ else
+ lru_cache_add(page);
+ return 0;
+}
+
+static int
+mmigrate_release_buffer(struct page *page)
+{
+ try_to_release_page(page, GFP_KERNEL);
+ return 0;
+}
+
+/*
+ * This is a migrate-operations for regular pages which include
+ * anonymous pages, pages in the pagecache and pages in the swapcache.
+ */
+static struct mmigrate_operations mmigrate_ops = {
+ .mmigrate_alloc_page = mmigrate_alloc_page,
+ .mmigrate_free_page = mmigrate_free_page,
+ .mmigrate_copy_page = mmigrate_copy_page,
+ .mmigrate_lru_add_page = mmigrate_lru_add_page,
+ .mmigrate_release_buffers = mmigrate_release_buffer,
+ .mmigrate_prepare = mmigrate_preparepage,
+ .mmigrate_stick_page = stick_mlocked_page
+};
+
+/*
+ * Try to migrate one page. Returns non-zero on failure.
+ */
+int mmigrate_onepage(struct page *page, int nodeid, int fastmode,
+ struct mmigrate_operations *ops)
+{
+ struct page *newpage;
+ struct address_space *mapping;
+ LIST_HEAD(vlist);
+ int nretry = fastmode ? HZ/50: HZ*10; /* XXXX */
+
+ if ((newpage = ops->mmigrate_alloc_page(nodeid)) == NULL)
+ return -ENOMEM;
+
+ /*
+ * Make sure that the newpage must be locked and keep not up-to-date
+ * during the page migration, so that it's guaranteed that all
+ * accesses including read accesses to the newpage will be blocked
+ * until everything has become ok.
+ *
+ * Unlike in the case of swapout mechanism, all accesses which include
+ * read accesses and write accesses to the page have to be blocked
+ * since both of the oldpage and the newpage exist at the same time
+ * and the newpage contains invalid data while some rereferences
+ * of the oldpage remain.
+ *
+ * FYI, swap code allows read accesses during swaping as the
+ * content of the page is valid and it will never be freed
+ * while some references of it exist. And write access is also
+ * possible during swapping, it will pull the page back and
+ * modify them even if it's under I/O.
+ */
+ if (TestSetPageLocked(newpage))
+ BUG();
+
+ lock_page(page);
+
+ if (ops->mmigrate_prepare && ops->mmigrate_prepare(page, fastmode))
+ goto radixfail;
+
+ /*
+ * Put the page in a radix tree if it isn't in the tree yet,
+ * so that all pages can be handled on radix trees and move
+ * them in the same way.
+ */
+ page_map_lock(page);
+ if (PageAnon(page) && !PageSwapCache(page))
+ make_page_mapped(page);
+ mapping = page_mapping(page);
+ page_map_unlock(page);
+
+ if (mapping == NULL)
+ goto radixfail;
+
+ /*
+ * Replace the oldpage with the newpage in the radix tree,
+ * after that the newpage can catch all access requests to the
+ * oldpage instead.
+ *
+ * We cannot leave the oldpage locked in the radix tree because:
+ * - It cannot block read access if PG_uptodate is on. PG_uptodate
+ * flag cannot be off since it means data in the page is invalid.
+ * - Some accesses cannot be finished if someone is holding the
+ * lock as they may require the lock to handle the oldpage.
+ * - It's hard to determine when the page can be freed if there
+ * remain references to the oldpage.
+ */
+ if (radix_tree_replace_pages(page, newpage, mapping))
+ goto radixfail;
+
+ /*
+ * With cleared PTEs, any access via PTEs to the oldpages can
+ * be caught and blocked in a pagefault handler.
+ */
+ if (unmap_page(page, &vlist))
+ goto unmapfail;
+ if (PagePrivate(page))
+ printk("buffer reappeared\n");
+
+ /*
+ * We can't proceed if there remain some references on the oldpage.
+ *
+ * This code may sometimes fail because:
+ * A page may be grabed twice in the same transaction. During
+ * the page migration, the transaction which already have got
+ * the oldpage try to grab the newpage, this causes a dead lock.
+ *
+ * The transaction believes both pages are the same, but an access
+ * to the newpage is blocked until the oldpage is released.
+ *
+ * Renaming a file in the same directory is a good example.
+ * It grabs the same page for the directory twice.
+ *
+ * In this case, try to migrate the page later.
+ */
+ switch (wait_on_page_freeable(page, mapping, &vlist, page_index(newpage), nretry, ops)) {
+ case 1:
+ /* truncated */
+ free_truncated_pages(page, newpage, mapping);
+ ops->mmigrate_free_page(page);
+ return 0;
+ case -1:
+ /* failed */
+ goto unmapfail;
+ }
+
+ BUG_ON(mapping != page_mapping(page));
+
+ ops->mmigrate_copy_page(newpage, page);
+
+ if (PageDirty(page))
+ set_page_dirty(newpage);
+ page->mapping = NULL;
+ unlock_page(page);
+ __put_page(page);
+
+ /*
+ * Finally, the newpage has become ready!
+ */
+ SetPageUptodate(newpage);
+
+ if (ops->mmigrate_lru_add_page)
+ ops->mmigrate_lru_add_page(newpage, PageActive(page));
+ ClearPageActive(page);
+ ClearPageSwapCache(page);
+
+ ops->mmigrate_free_page(page);
+
+ /*
+ * Wake up all waiters which have been waiting for completion
+ * of the page migration.
+ */
+ unlock_page(newpage);
+
+ /*
+ * Mlock the newpage if the oldpage had been mlocked.
+ */
+ if (ops->mmigrate_stick_page)
+ ops->mmigrate_stick_page(&vlist);
+ page_cache_release(newpage);
+ return 0;
+
+unmapfail:
+ /*
+ * Roll back all operations.
+ */
+ radix_tree_rewind_page(page, newpage, mapping);
+ if (ops->mmigrate_stick_page)
+ ops->mmigrate_stick_page(&vlist);
+ ClearPageActive(newpage);
+ ClearPageSwapCache(newpage);
+ ops->mmigrate_free_page(newpage);
+ return 1;
+
+radixfail:
+ unlock_page(page);
+ unlock_page(newpage);
+ if (ops->mmigrate_stick_page)
+ ops->mmigrate_stick_page(&vlist);
+ ops->mmigrate_free_page(newpage);
+ return 1;
+}
+
+static struct work_struct lru_drain_wq[NR_CPUS];
+static void
+lru_drain_schedule(void *p)
+{
+ int cpu = get_cpu();
+
+ schedule_work(&lru_drain_wq[cpu]);
+ put_cpu();
+}
+
+/*
+ * Find an appropriate page to be migrated on the LRU lists.
+ */
+
+static struct page *
+get_target_page(struct zone *zone, void *arg)
+{
+ struct page *page, *page2;
+ list_for_each_entry_safe(page, page2, &zone->inactive_list, lru) {
+ if (steal_page_from_lru(zone, page) == NULL)
+ continue;
+ return page;
+ }
+ list_for_each_entry_safe(page, page2, &zone->active_list, lru) {
+ if (steal_page_from_lru(zone, page) == NULL)
+ continue;
+ return page;
+ }
+ return NULL;
+}
+
+int try_to_migrate_pages(struct zone *zone, int destnode,
+ struct page * (*func)(struct zone *, void *), void *arg)
+{
+ struct page *page, *page2;
+ int nr_failed = 0;
+ LIST_HEAD(failedp);
+
+ while(nr_failed < 100) {
+ spin_lock_irq(&zone->lru_lock);
+ page = (*func)(zone, arg);
+ spin_unlock_irq(&zone->lru_lock);
+ if (page == NULL)
+ break;
+ if (PageLocked(page) ||
+ mmigrate_onepage(page, destnode, 1, &mmigrate_ops)) {
+ nr_failed++;
+ list_add(&page->lru, &failedp);
+ }
+ }
+
+ nr_failed = 0;
+ list_for_each_entry_safe(page, page2, &failedp, lru) {
+ list_del(&page->lru);
+ if ( /* !PageLocked(page) && */
+ !mmigrate_onepage(page, destnode, 0, &mmigrate_ops)) {
+ continue;
+ }
+ nr_failed++;
+ spin_lock_irq(&zone->lru_lock);
+ putback_page_to_lru(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ page_cache_release(page);
+ }
+ return nr_failed;
+}
+
+/*
+ * The migrate-daemon, started as a kernel thread on demand.
+ *
+ * This migrates all pages on a spcified zone one by one. It traverses
+ * the LRU lists of the zone and tries to migrate each page. It doesn't
+ * matter if the page is in the pagecache or in the swapcache or anonymous.
+ *
+ * TODO:
+ * Memsection support. The following code assumes that a whole zone are
+ * going to be removed. You can replace get_target_page() with
+ * a proper function if you want to remove part of memory in a zone.
+ */
+static DECLARE_MUTEX(mmigrated_sem);
+int mmigrated(void *p)
+{
+ struct zone *zone = p;
+ int nr_failed = 0;
+ LIST_HEAD(failedp);
+
+ daemonize("migrate%d", zone->zone_start_pfn);
+ current->flags |= PF_KSWAPD; /* It's fake */
+ if (down_trylock(&mmigrated_sem)) {
+ printk("mmigrated already running\n");
+ return 0;
+ }
+ on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+ nr_failed = try_to_migrate_pages(zone, MIGRATE_ANYNODE, get_target_page, NULL);
+/* if (nr_failed) */
+/* goto retry; */
+ on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+ up(&mmigrated_sem);
+ return 0;
+}
+
+static int __init mmigrated_init(void)
+{
+ int i;
+
+ for(i = 0; i < NR_CPUS; i++)
+ INIT_WORK(&lru_drain_wq[i], (void (*)(void *))lru_add_drain, NULL);
+ return 0;
+}
+
+module_init(mmigrated_init);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [4/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (2 preceding siblings ...)
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
$Id: va-emulation_memhotplug.patch,v 1.23 2004/06/17 08:19:45 iwamoto Exp $
diff -dpur linux-2.6.7/arch/i386/Kconfig linux-2.6.7-mh/arch/i386/Kconfig
--- linux-2.6.7/arch/i386/Kconfig Thu Mar 11 11:55:22 2004
+++ linux-2.6.7-mh/arch/i386/Kconfig Thu Apr 1 14:46:19 2004
@@ -736,7 +736,7 @@ config DISCONTIGMEM
config HAVE_ARCH_BOOTMEM_NODE
bool
- depends on NUMA
+ depends on NUMA || MEMHOTPLUG
default y
config HIGHPTE
diff -dpur linux-2.6.7/arch/i386/mm/discontig.c linux-2.6.7-mh/arch/i386/mm/discontig.c
--- linux-2.6.7/arch/i386/mm/discontig.c Sun Apr 4 12:37:23 2004
+++ linux-2.6.7-mh/arch/i386/mm/discontig.c Tue Apr 27 17:41:22 2004
@@ -64,6 +64,7 @@ unsigned long node_end_pfn[MAX_NUMNODES]
extern unsigned long find_max_low_pfn(void);
extern void find_max_pfn(void);
extern void one_highpage_init(struct page *, int, int);
+static unsigned long calculate_blk_remap_pages(void);
extern struct e820map e820;
extern unsigned long init_pg_tables_end;
@@ -111,6 +112,51 @@ int __init get_memcfg_numa_flat(void)
return 1;
}
+int __init get_memcfg_numa_blks(void)
+{
+ int i, pfn;
+
+ printk("NUMA - single node, flat memory mode, but broken in several blocks\n");
+
+ /* Run the memory configuration and find the top of memory. */
+ find_max_pfn();
+ if (max_pfn & (PTRS_PER_PTE - 1)) {
+ pfn = max_pfn & ~(PTRS_PER_PTE - 1);
+ printk("Rounding down maxpfn %ld -> %d\n", max_pfn, pfn);
+ max_pfn = pfn;
+ }
+ for(i = 0; i < MAX_NUMNODES; i++) {
+ pfn = PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) * i;
+ node_start_pfn[i] = pfn;
+ printk("node %d start %d\n", i, pfn);
+ pfn += PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+ if (pfn < max_pfn)
+ node_end_pfn[i] = pfn;
+ else {
+ node_end_pfn[i] = max_pfn;
+ i++;
+ printk("total %d blocks, max %ld\n", i, max_pfn);
+ break;
+ }
+ }
+
+ printk("physnode_map");
+ /* Needed for pfn_to_nid */
+ for (pfn = node_start_pfn[0]; pfn <= max_pfn;
+ pfn += PAGES_PER_ELEMENT)
+ {
+ physnode_map[pfn / PAGES_PER_ELEMENT] =
+ pfn / PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+ printk(" %d", physnode_map[pfn / PAGES_PER_ELEMENT]);
+ }
+ printk("\n");
+
+ node_set_online(0);
+ numnodes = i;
+
+ return 1;
+}
+
/*
* Find the highest page frame number we have available for the node
*/
@@ -132,11 +178,21 @@ static void __init find_max_pfn_node(int
* Allocate memory for the pg_data_t via a crude pre-bootmem method
* We ought to relocate these onto their own node later on during boot.
*/
-static void __init allocate_pgdat(int nid)
+static void allocate_pgdat(int nid)
{
- if (nid)
+ if (nid) {
+#ifndef CONFIG_MEMHOTPLUG
NODE_DATA(nid) = (pg_data_t *)node_remap_start_vaddr[nid];
- else {
+#else
+ int remapsize;
+ unsigned long addr;
+
+ remapsize = calculate_blk_remap_pages();
+ addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+ (nid - 1) * remapsize));
+ NODE_DATA(nid) = (void *)addr;
+#endif
+ } else {
NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
min_low_pfn += PFN_UP(sizeof(pg_data_t));
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
@@ -185,6 +241,7 @@ static void __init register_bootmem_low_
void __init remap_numa_kva(void)
{
+#ifndef CONFIG_MEMHOTPLUG
void *vaddr;
unsigned long pfn;
int node;
@@ -197,6 +254,7 @@ void __init remap_numa_kva(void)
PAGE_KERNEL_LARGE);
}
}
+#endif
}
static unsigned long calculate_numa_remap_pages(void)
@@ -227,6 +285,21 @@ static unsigned long calculate_numa_rema
return reserve_pages;
}
+static unsigned long calculate_blk_remap_pages(void)
+{
+ unsigned long size;
+
+ /* calculate the size of the mem_map needed in bytes */
+ size = (PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) + 1)
+ * sizeof(struct page) + sizeof(pg_data_t);
+ /* convert size to large (pmd size) pages, rounding up */
+ size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
+ /* now the roundup is correct, convert to PAGE_SIZE pages */
+ size = size * PTRS_PER_PTE;
+
+ return size;
+}
+
unsigned long __init setup_memory(void)
{
int nid;
@@ -234,13 +307,14 @@ unsigned long __init setup_memory(void)
unsigned long reserve_pages;
get_memcfg_numa();
- reserve_pages = calculate_numa_remap_pages();
+ reserve_pages = calculate_blk_remap_pages() * (numnodes - 1);
/* partially used pages are not usable - thus round upwards */
system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
find_max_pfn();
- system_max_low_pfn = max_low_pfn = find_max_low_pfn();
+ system_max_low_pfn = max_low_pfn = (find_max_low_pfn() &
+ ~(PTRS_PER_PTE - 1));
#ifdef CONFIG_HIGHMEM
highstart_pfn = highend_pfn = max_pfn;
if (max_pfn > system_max_low_pfn)
@@ -256,14 +330,19 @@ unsigned long __init setup_memory(void)
printk("Low memory ends at vaddr %08lx\n",
(ulong) pfn_to_kaddr(max_low_pfn));
+#ifdef CONFIG_MEMHOTPLUG
+ for (nid = 1; nid < numnodes; nid++)
+ NODE_DATA(nid) = NULL;
+ nid = 0;
+ {
+#else
for (nid = 0; nid < numnodes; nid++) {
+#endif
node_remap_start_vaddr[nid] = pfn_to_kaddr(
- highstart_pfn - node_remap_offset[nid]);
+ max_low_pfn + calculate_blk_remap_pages() * nid);
allocate_pgdat(nid);
- printk ("node %d will remap to vaddr %08lx - %08lx\n", nid,
- (ulong) node_remap_start_vaddr[nid],
- (ulong) pfn_to_kaddr(highstart_pfn
- - node_remap_offset[nid] + node_remap_size[nid]));
+ printk ("node %d will remap to vaddr %08lx - \n", nid,
+ (ulong) node_remap_start_vaddr[nid]);
}
printk("High memory starts at vaddr %08lx\n",
(ulong) pfn_to_kaddr(highstart_pfn));
@@ -275,9 +354,12 @@ unsigned long __init setup_memory(void)
/*
* Initialize the boot-time allocator (with low memory only):
*/
- bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
+ bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0,
+ (system_max_low_pfn > node_end_pfn[0]) ?
+ node_end_pfn[0] : system_max_low_pfn);
- register_bootmem_low_pages(system_max_low_pfn);
+ register_bootmem_low_pages((system_max_low_pfn > node_end_pfn[0]) ?
+ node_end_pfn[0] : system_max_low_pfn);
/*
* Reserve the bootmem bitmap itself as well. We do this in two
@@ -342,14 +424,26 @@ void __init zone_sizes_init(void)
* Clobber node 0's links and NULL out pgdat_list before starting.
*/
pgdat_list = NULL;
- for (nid = numnodes - 1; nid >= 0; nid--) {
+#ifndef CONFIG_MEMHOTPLUG
+ for (nid = numnodes - 1; nid >= 0; nid--) {
+#else
+ nid = 0;
+ {
+#endif
if (nid)
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ if (nid == 0)
+ NODE_DATA(nid)->enabled = 1;
NODE_DATA(nid)->pgdat_next = pgdat_list;
pgdat_list = NODE_DATA(nid);
}
+#ifdef CONFIG_MEMHOTPLUG
+ nid = 0;
+ {
+#else
for (nid = 0; nid < numnodes; nid++) {
+#endif
unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
unsigned long *zholes_size;
unsigned int max_dma;
@@ -368,14 +462,17 @@ void __init zone_sizes_init(void)
} else {
if (low < max_dma)
zones_size[ZONE_DMA] = low;
- else {
+ else if (low <= high) {
BUG_ON(max_dma > low);
- BUG_ON(low > high);
zones_size[ZONE_DMA] = max_dma;
zones_size[ZONE_NORMAL] = low - max_dma;
#ifdef CONFIG_HIGHMEM
zones_size[ZONE_HIGHMEM] = high - low;
#endif
+ } else {
+ BUG_ON(max_dma > low);
+ zones_size[ZONE_DMA] = max_dma;
+ zones_size[ZONE_NORMAL] = high - max_dma;
}
}
zholes_size = get_zholes_size(nid);
@@ -405,7 +502,11 @@ void __init set_highmem_pages_init(int b
#ifdef CONFIG_HIGHMEM
int nid;
+#ifdef CONFIG_MEMHOTPLUG
+ for (nid = 0; nid < 1; nid++) {
+#else
for (nid = 0; nid < numnodes; nid++) {
+#endif
unsigned long node_pfn, node_high_size, zone_start_pfn;
struct page * zone_mem_map;
@@ -423,12 +524,234 @@ void __init set_highmem_pages_init(int b
#endif
}
-void __init set_max_mapnr_init(void)
+void set_max_mapnr_init(void)
{
#ifdef CONFIG_HIGHMEM
+#ifndef CONFIG_MEMHOTPLUG
highmem_start_page = NODE_DATA(0)->node_zones[ZONE_HIGHMEM].zone_mem_map;
+#else
+ struct pglist_data *z = NULL;
+ int i;
+
+ for (i = 0; i < numnodes; i++) {
+ if (NODE_DATA(i) == NULL)
+ continue;
+ z = NODE_DATA(i);
+ highmem_start_page = z->node_zones[ZONE_HIGHMEM].zone_mem_map;
+ if (highmem_start_page != NULL)
+ break;
+ }
+ if (highmem_start_page == NULL)
+ highmem_start_page =
+ z->node_zones[ZONE_NORMAL].zone_mem_map +
+ z->node_zones[ZONE_NORMAL].spanned_pages;
+#endif
num_physpages = highend_pfn;
#else
num_physpages = max_low_pfn;
#endif
}
+
+void
+plug_node(int nid)
+{
+ unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
+ unsigned long *zholes_size, addr, pfn;
+ unsigned long remapsize;
+ unsigned long flags;
+ int i, j;
+ struct page *node_mem_map, *page;
+ pg_data_t **pgdat;
+ struct mm_struct *mm;
+
+ unsigned long start = node_start_pfn[nid];
+ unsigned long high = node_end_pfn[nid];
+
+ BUG_ON(nid == 0);
+
+ allocate_pgdat(nid);
+
+ remapsize = calculate_blk_remap_pages();
+ addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+ (nid - 1) * remapsize));
+
+ /* shrink size,
+ which is done in calculate_numa_remap_pages() if normal NUMA */
+ high -= remapsize;
+ BUG_ON(start > high);
+
+ for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE)
+ set_pmd_pfn(addr + (pfn << PAGE_SHIFT), high + pfn,
+ PAGE_KERNEL_LARGE);
+ spin_lock_irqsave(&pgd_lock, flags);
+ for (page = pgd_list; page; page = (struct page *)page->index) {
+ for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE) {
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ pgd = (pgd_t *)page_address(page) +
+ pgd_index(addr + (pfn << PAGE_SHIFT));
+ pmd = pmd_offset(pgd, addr + (pfn << PAGE_SHIFT));
+ set_pmd(pmd, pfn_pmd(high + pfn, PAGE_KERNEL_LARGE));
+ }
+ }
+ spin_unlock_irqrestore(&pgd_lock, flags);
+ flush_tlb_all();
+
+ node_mem_map = (struct page *)((addr + sizeof(pg_data_t) +
+ PAGE_SIZE - 1) & PAGE_MASK);
+ memset(node_mem_map, 0, (remapsize << PAGE_SHIFT) -
+ ((char *)node_mem_map - (char *)addr));
+
+ printk("plug_node: %p %p\n", NODE_DATA(nid), node_mem_map);
+ memset(NODE_DATA(nid), 0, sizeof(*NODE_DATA(nid)));
+ printk("zeroed nodedata\n");
+
+ /* XXX defaults to hotremovable */
+ NODE_DATA(nid)->removable = 1;
+
+ BUG_ON(virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT > start);
+ if (start <= max_low_pfn)
+ zones_size[ZONE_NORMAL] =
+ (max_low_pfn > high ? high : max_low_pfn) - start;
+#ifdef CONFIG_HIGHMEM
+ if (high > max_low_pfn)
+ zones_size[ZONE_HIGHMEM] = high -
+ ((start > max_low_pfn) ? start : max_low_pfn);
+#endif
+ zholes_size = get_zholes_size(nid);
+ free_area_init_node(nid, NODE_DATA(nid), node_mem_map, zones_size,
+ start, zholes_size);
+
+ /* lock? */
+ for(pgdat = &pgdat_list; *pgdat; pgdat = &(*pgdat)->pgdat_next)
+ if ((*pgdat)->node_id > nid) {
+ NODE_DATA(nid)->pgdat_next = *pgdat;
+ *pgdat = NODE_DATA(nid);
+ break;
+ }
+ if (*pgdat == NULL)
+ *pgdat = NODE_DATA(nid);
+ {
+ struct zone *z;
+ for_each_zone (z)
+ printk("%p ", z);
+ printk("\n");
+ }
+ set_max_mapnr_init();
+
+ for(i = 0; i < MAX_NR_ZONES; i++) {
+ struct zone *z;
+ struct page *p;
+
+ z = &NODE_DATA(nid)->node_zones[i];
+
+ for(j = 0; j < z->spanned_pages; j++) {
+ p = &z->zone_mem_map[j];
+ ClearPageReserved(p);
+ if (i == ZONE_HIGHMEM)
+ set_bit(PG_highmem, &p->flags);
+ set_page_count(p, 1);
+ __free_page(p);
+ }
+ }
+ kswapd_start_one(NODE_DATA(nid));
+ setup_per_zone_pages_min();
+}
+
+void
+enable_node(int node)
+{
+ int i;
+ struct zone *z;
+
+ NODE_DATA(node)->enabled = 1;
+ build_all_zonelists();
+
+ for(i = 0; i < MAX_NR_ZONES; i++) {
+ z = zone_table[NODEZONE(node, i)];
+ totalram_pages += z->present_pages;
+ if (i == ZONE_HIGHMEM)
+ totalhigh_pages += z->present_pages;
+ }
+}
+
+void
+makepermanent_node(int node)
+{
+
+ NODE_DATA(node)->removable = 0;
+ build_all_zonelists();
+}
+
+void
+disable_node(int node)
+{
+ int i;
+ struct zone *z;
+
+ NODE_DATA(node)->enabled = 0;
+ build_all_zonelists();
+
+ for(i = 0; i < MAX_NR_ZONES; i++) {
+ z = zone_table[NODEZONE(node, i)];
+ totalram_pages -= z->present_pages;
+ if (i == ZONE_HIGHMEM)
+ totalhigh_pages -= z->present_pages;
+ }
+}
+
+int
+unplug_node(int nid)
+{
+ int i;
+ struct zone *z;
+ pg_data_t *pgdat;
+ struct page *page;
+ unsigned long addr, pfn, remapsize;
+ unsigned long flags;
+
+ if (NODE_DATA(nid)->enabled)
+ return -1;
+ for(i = 0; i < MAX_NR_ZONES; i++) {
+ z = zone_table[NODEZONE(nid, i)];
+ if (z->present_pages != z->free_pages)
+ return -1;
+ }
+ kthread_stop(NODE_DATA(nid)->kswapd);
+
+ /* lock? */
+ for(pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
+ if (pgdat->pgdat_next == NODE_DATA(nid)) {
+ pgdat->pgdat_next = pgdat->pgdat_next->pgdat_next;
+ break;
+ }
+ BUG_ON(pgdat == NULL);
+
+ for(i = 0; i < MAX_NR_ZONES; i++)
+ zone_table[NODEZONE(nid, i)] = NULL;
+ NODE_DATA(nid) = NULL;
+
+ /* unmap node_mem_map */
+ remapsize = calculate_blk_remap_pages();
+ addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+ (nid - 1) * remapsize));
+ for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE)
+ set_pmd_pfn(addr + (pfn << PAGE_SHIFT), 0, __pgprot(0));
+ spin_lock_irqsave(&pgd_lock, flags);
+ for (page = pgd_list; page; page = (struct page *)page->index) {
+ for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE) {
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ pgd = (pgd_t *)page_address(page) +
+ pgd_index(addr + (pfn << PAGE_SHIFT));
+ pmd = pmd_offset(pgd, addr + (pfn << PAGE_SHIFT));
+ pmd_clear(pmd);
+ }
+ }
+ spin_unlock_irqrestore(&pgd_lock, flags);
+ flush_tlb_all();
+
+ return 0;
+}
diff -dpur linux-2.6.7/arch/i386/mm/init.c linux-2.6.7-mh/arch/i386/mm/init.c
--- linux-2.6.7/arch/i386/mm/init.c Thu Mar 11 11:55:37 2004
+++ linux-2.6.7-mh/arch/i386/mm/init.c Wed Mar 31 19:38:26 2004
@@ -43,6 +43,7 @@
DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
unsigned long highstart_pfn, highend_pfn;
+extern unsigned long node_end_pfn[MAX_NUMNODES];
static int do_test_wp_bit(void);
@@ -481,7 +482,11 @@ void __init mem_init(void)
totalram_pages += __free_all_bootmem();
reservedpages = 0;
+#ifdef CONFIG_MEMHOTPLUG
+ for (tmp = 0; tmp < node_end_pfn[0]; tmp++)
+#else
for (tmp = 0; tmp < max_low_pfn; tmp++)
+#endif
/*
* Only count reserved RAM pages
*/
diff -dpur linux-2.6.7/include/asm-i386/mmzone.h linux-2.6.7-mh/include/asm-i386/mmzone.h
--- linux-2.6.7/include/asm-i386/mmzone.h Thu Mar 11 11:55:27 2004
+++ linux-2.6.7-mh/include/asm-i386/mmzone.h Wed Mar 31 19:38:26 2004
@@ -17,7 +17,9 @@
#include <asm/srat.h>
#endif
#else /* !CONFIG_NUMA */
+#ifndef CONFIG_MEMHOTPLUG
#define get_memcfg_numa get_memcfg_numa_flat
+#endif
#define get_zholes_size(n) (0)
#endif /* CONFIG_NUMA */
@@ -41,7 +43,7 @@ extern u8 physnode_map[];
static inline int pfn_to_nid(unsigned long pfn)
{
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined (CONFIG_MEMHOTPLUG)
return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
#else
return 0;
@@ -132,6 +134,10 @@ static inline int pfn_valid(int pfn)
#endif
extern int get_memcfg_numa_flat(void );
+#ifdef CONFIG_MEMHOTPLUG
+extern int get_memcfg_numa_blks(void);
+#endif
+
/*
* This allows any one NUMA architecture to be compiled
* for, and still fall back to the flat function if it
@@ -144,6 +150,9 @@ static inline void get_memcfg_numa(void)
return;
#elif CONFIG_ACPI_SRAT
if (get_memcfg_from_srat())
+ return;
+#elif CONFIG_MEMHOTPLUG
+ if (get_memcfg_numa_blks())
return;
#endif
diff -dpur linux-2.6.7/include/asm-i386/numnodes.h linux-2.6.7-mh/include/asm-i386/numnodes.h
--- linux-2.6.7/include/asm-i386/numnodes.h Thu Mar 11 11:55:23 2004
+++ linux-2.6.7-mh/include/asm-i386/numnodes.h Wed Mar 31 19:38:26 2004
@@ -13,6 +13,8 @@
/* Max 8 Nodes */
#define NODES_SHIFT 3
-#endif /* CONFIG_X86_NUMAQ */
+#elif defined(CONFIG_MEMHOTPLUG)
+#define NODES_SHIFT 3
+#endif
#endif /* _ASM_MAX_NUMNODES_H */
diff -dpur linux-2.6.7/mm/page_alloc.c linux-2.6.7-mh/mm/page_alloc.c
--- linux-2.6.7/mm/page_alloc.c Thu Mar 11 11:55:22 2004
+++ linux-2.6.7-mh/mm/page_alloc.c Thu Apr 1 16:54:26 2004
@@ -1177,7 +1177,12 @@ static inline unsigned long wait_table_b
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+calculate_zone_totalpages(struct pglist_data *pgdat,
unsigned long *zones_size, unsigned long *zholes_size)
{
unsigned long realtotalpages, totalpages = 0;
@@ -1231,8 +1236,13 @@ void __init memmap_init_zone(struct page
* - mark all memory queues empty
* - clear the memory bitmaps
*/
-static void __init free_area_init_core(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+free_area_init_core(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
{
unsigned long i, j;
const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
@@ -1371,7 +1381,12 @@ static void __init free_area_init_core(s
}
}
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+free_area_init_node(int nid, struct pglist_data *pgdat,
struct page *node_mem_map, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [5/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (3 preceding siblings ...)
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
@ 2004-07-14 14:03 ` Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:03 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
$Id: va-proc_memhotplug.patch,v 1.14 2004/07/06 09:05:54 taka Exp $
diff -dpur linux-2.6.7/mm/page_alloc.c linux-2.6.7-mh/mm/page_alloc.c
--- linux-2.6.7/mm/page_alloc.c.orig 2004-06-17 16:28:03.000000000 +0900
+++ linux-2.6.7-mh/mm/page_alloc.c 2004-06-17 16:28:34.000000000 +0900
@@ -32,6 +32,7 @@
#include <linux/topology.h>
#include <linux/sysctl.h>
#include <linux/cpu.h>
+#include <linux/proc_fs.h>
#include <asm/tlbflush.h>
@@ -2120,3 +2121,244 @@ int lower_zone_protection_sysctl_handler
setup_per_zone_protection();
return 0;
}
+
+#ifdef CONFIG_MEMHOTPLUG
+static int mhtest_read(char *page, char **start, off_t off, int count,
+ int *eof, void *data)
+{
+ char *p;
+ int i, j, len;
+ const struct pglist_data *pgdat;
+ const struct zone *z;
+
+ p = page;
+ for(i = 0; i < numnodes; i++) {
+ pgdat = NODE_DATA(i);
+ if (pgdat == NULL)
+ continue;
+ len = sprintf(p, "Node %d %sabled %shotremovable\n", i,
+ pgdat->enabled ? "en" : "dis",
+ pgdat->removable ? "" : "non");
+ p += len;
+ for (j = 0; j < MAX_NR_ZONES; j++) {
+ z = &pgdat->node_zones[j];
+ if (! z->present_pages)
+ /* skip empty zone */
+ continue;
+ len = sprintf(p,
+ "\t%s[%d]: free %ld, active %ld, present %ld\n",
+ z->name, NODEZONE(i, j),
+ z->free_pages, z->nr_active, z->present_pages);
+ p += len;
+ }
+ *p++ = '\n';
+ }
+ len = p - page;
+
+ if (len <= off + count)
+ *eof = 1;
+ *start = page + off;
+ len -= off;
+ if (len < 0)
+ len = 0;
+ if (len > count)
+ len = count;
+
+ return len;
+}
+
+static void mhtest_enable(int);
+static void mhtest_disable(int);
+static void mhtest_plug(int);
+static void mhtest_unplug(int);
+static void mhtest_purge(int);
+static void mhtest_remap(int);
+static void mhtest_active(int);
+static void mhtest_inuse(int);
+
+const static struct {
+ char *cmd;
+ void (*func)(int);
+ char zone_check;
+} mhtest_cmds[] = {
+ { "disable", mhtest_disable, 0 },
+ { "enable", mhtest_enable, 0 },
+ { "plug", mhtest_plug, 0 },
+ { "unplug", mhtest_unplug, 0 },
+ { "purge", mhtest_purge, 1 },
+ { "remap", mhtest_remap, 1 },
+ { "active", mhtest_active, 1 },
+ { "inuse", mhtest_inuse, 1 },
+ { NULL, NULL }};
+
+static void
+mhtest_disable(int idx) {
+ int i, z;
+
+ printk("disable %d\n", idx);
+ /* XXX */
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ for (i = 0; i < NR_CPUS; i++) {
+ struct per_cpu_pages *pcp;
+
+ pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0]; /* hot */
+ pcp->low = pcp->high = 0;
+
+ pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1]; /* cold */
+ pcp->low = pcp->high = 0;
+ }
+ zone_table[NODEZONE(idx, z)]->pages_high =
+ zone_table[NODEZONE(idx, z)]->present_pages;
+ }
+ disable_node(idx);
+}
+static void
+mhtest_enable(int idx) {
+ int i, z;
+
+ printk("enable %d\n", idx);
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ zone_table[NODEZONE(idx, z)]->pages_high =
+ zone_table[NODEZONE(idx, z)]->pages_min * 3;
+ /* XXX */
+ for (i = 0; i < NR_CPUS; i++) {
+ struct per_cpu_pages *pcp;
+
+ pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0]; /* hot */
+ pcp->low = 2 * pcp->batch;
+ pcp->high = 6 * pcp->batch;
+
+ pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1]; /* cold */
+ pcp->high = 2 * pcp->batch;
+ }
+ }
+ enable_node(idx);
+}
+
+static void
+mhtest_plug(int idx) {
+
+ if (NODE_DATA(idx) != NULL) {
+ printk("Already plugged\n");
+ return;
+ }
+ plug_node(idx);
+}
+
+static void
+mhtest_unplug(int idx) {
+
+ unplug_node(idx);
+}
+
+static void
+mhtest_purge(int idx)
+{
+ printk("purge %d\n", idx);
+ wake_up_interruptible(&zone_table[idx]->zone_pgdat->kswapd_wait);
+ /* XXX overkill, but who cares? */
+ on_each_cpu(drain_local_pages, NULL, 1, 1);
+}
+
+static void
+mhtest_remap(int idx) {
+
+ on_each_cpu(drain_local_pages, NULL, 1, 1);
+ kernel_thread(mmigrated, zone_table[idx], CLONE_KERNEL);
+}
+
+static void
+mhtest_active(int idx)
+{
+ struct list_head *l;
+ int i;
+
+ if (zone_table[idx] == NULL)
+ return;
+ spin_lock_irq(&zone_table[idx]->lru_lock);
+ i = 0;
+ list_for_each(l, &zone_table[idx]->active_list) {
+ printk(" %lx", (unsigned long)list_entry(l, struct page, lru));
+ i++;
+ if (i == 10)
+ break;
+ }
+ spin_unlock_irq(&zone_table[idx]->lru_lock);
+ printk("\n");
+}
+
+static void
+mhtest_inuse(int idx)
+{
+ int i;
+
+ if (zone_table[idx] == NULL)
+ return;
+ for(i = 0; i < zone_table[idx]->spanned_pages; i++)
+ if (page_count(&zone_table[idx]->zone_mem_map[i]))
+ printk(" %p", &zone_table[idx]->zone_mem_map[i]);
+ printk("\n");
+}
+
+static int mhtest_write(struct file *file, const char *buffer,
+ unsigned long count, void *data)
+{
+ int idx;
+ char buf[64], *p;
+ int i;
+
+ if (count > sizeof(buf) - 1)
+ count = sizeof(buf) - 1;
+ if (copy_from_user(buf, buffer, count))
+ return -EFAULT;
+
+ buf[count] = 0;
+
+ p = strchr(buf, ' ');
+ if (p == NULL)
+ goto out;
+
+ *p++ = '\0';
+ idx = (int)simple_strtoul(p, NULL, 0);
+
+ if (idx > MAX_NR_ZONES*MAX_NUMNODES) {
+ printk("Argument out of range\n");
+ goto out;
+ }
+
+ for(i = 0; ; i++) {
+ if (mhtest_cmds[i].cmd == NULL)
+ break;
+ if (strcmp(buf, mhtest_cmds[i].cmd) == 0) {
+ if (mhtest_cmds[i].zone_check) {
+ if (zone_table[idx] == NULL) {
+ printk("Zone %d not plugged\n", idx);
+ return count;
+ }
+ } else if (strcmp(buf, "plug") != 0 &&
+ NODE_DATA(idx) == NULL) {
+ printk("Node %d not plugged\n", idx);
+ return count;
+ }
+ (mhtest_cmds[i].func)(idx);
+ break;
+ }
+ }
+out:
+ return count;
+}
+
+static int __init procmhtest_init(void)
+{
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("memhotplug", 0, NULL);
+ if (entry == NULL)
+ return -1;
+
+ entry->read_proc = &mhtest_read;
+ entry->write_proc = &mhtest_write;
+ return 0;
+}
+__initcall(procmhtest_init);
+#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [6/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (4 preceding siblings ...)
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
$Id: va-aio.patch,v 1.11 2004/06/17 08:19:45 iwamoto Exp $
--- linux-2.6.7.ORG/arch/i386/kernel/sys_i386.c.orig 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/arch/i386/kernel/sys_i386.c 2004-06-17 16:20:12.000000000 +0900
@@ -70,7 +70,7 @@ asmlinkage long sys_mmap2(unsigned long
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
{
- return do_mmap2(addr, len, prot, flags, fd, pgoff);
+ return do_mmap2(addr, len, prot, flags & ~MAP_IMMOVABLE, fd, pgoff);
}
/*
@@ -101,7 +101,8 @@ asmlinkage int old_mmap(struct mmap_arg_
if (a.offset & ~PAGE_MASK)
goto out;
- err = do_mmap2(a.addr, a.len, a.prot, a.flags, a.fd, a.offset >> PAGE_SHIFT);
+ err = do_mmap2(a.addr, a.len, a.prot, a.flags & ~MAP_IMMOVABLE,
+ a.fd, a.offset >> PAGE_SHIFT);
out:
return err;
}
--- linux-2.6.7.ORG/fs/aio.c 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/fs/aio.c 2004-06-17 16:20:12.000000000 +0900
@@ -130,7 +130,8 @@ static int aio_setup_ring(struct kioctx
dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
down_write(&ctx->mm->mmap_sem);
info->mmap_base = do_mmap(NULL, 0, info->mmap_size,
- PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE,
+ PROT_READ|PROT_WRITE,
+ MAP_ANON|MAP_PRIVATE|MAP_IMMOVABLE,
0);
if (IS_ERR((void *)info->mmap_base)) {
up_write(&ctx->mm->mmap_sem);
--- linux-2.6.7.ORG/include/asm-i386/mman.h 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/asm-i386/mman.h 2004-06-17 16:20:12.000000000 +0900
@@ -22,6 +22,7 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_IMMOVABLE 0x20000
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
--- linux-2.6.7.ORG/include/asm-ia64/mman.h 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/asm-ia64/mman.h 2004-06-17 16:20:12.000000000 +0900
@@ -30,6 +30,7 @@
#define MAP_NORESERVE 0x04000 /* don't check for reservations */
#define MAP_POPULATE 0x08000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_IMMOVABLE 0x20000
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
--- linux-2.6.7.ORG/include/linux/mm.h 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/linux/mm.h 2004-06-17 16:20:12.000000000 +0900
@@ -134,6 +134,7 @@ struct vm_area_struct {
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
+#define VM_IMMOVABLE 0x01000000 /* Don't place in hot removable area */
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
--- linux-2.6.7.ORG/include/linux/mman.h 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/include/linux/mman.h 2004-06-17 16:20:12.000000000 +0900
@@ -58,7 +58,11 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED )
+#ifdef CONFIG_MEMHOTPLUG
+ | _calc_vm_trans(flags, MAP_IMMOVABLE, VM_IMMOVABLE )
+#endif
+ ;
}
#endif /* _LINUX_MMAN_H */
--- linux-2.6.7.ORG/kernel/fork.c 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/kernel/fork.c 2004-06-17 16:20:12.000000000 +0900
@@ -321,6 +321,9 @@ static inline int dup_mmap(struct mm_str
goto fail_nomem_policy;
vma_set_policy(tmp, pol);
tmp->vm_flags &= ~VM_LOCKED;
+#ifdef CONFIG_MEMHOTPLUG
+ tmp->vm_flags &= ~VM_IMMOVABLE;
+#endif
tmp->vm_mm = mm;
tmp->vm_next = NULL;
anon_vma_link(tmp);
--- linux-2.6.7.ORG/mm/memory.c 2004-06-17 16:20:02.000000000 +0900
+++ linux-2.6.7/mm/memory.c 2004-06-17 16:20:12.000000000 +0900
@@ -1069,7 +1069,13 @@ static int do_wp_page(struct mm_struct *
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+#ifdef CONFIG_MEMHOTPLUG
+ if (vma->vm_flags & VM_IMMOVABLE)
+ new_page = alloc_page_vma(GFP_USER | __GFP_HIGHMEM,
+ vma, address);
+ else
+#endif
+ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
copy_cow_page(old_page,new_page,address);
@@ -1412,6 +1418,12 @@ do_anonymous_page(struct mm_struct *mm,
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
+#ifdef CONFIG_MEMHOTPLUG
+ if (vma->vm_flags & VM_IMMOVABLE)
+ page = alloc_page_vma(GFP_USER | __GFP_HIGHMEM,
+ vma, addr);
+ else
+#endif
page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
if (!page)
goto no_mem;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [7/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (5 preceding siblings ...)
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
$Id: va-shmem.patch,v 1.5 2004/04/14 06:36:05 iwamoto Exp $
--- linux-2.6.5.ORG/mm/shmem.c Fri Apr 2 14:05:11 2032
+++ linux-2.6.5/mm/shmem.c Fri Apr 2 14:43:37 2032
@@ -80,7 +80,13 @@ static inline struct page *shmem_dir_all
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
*/
+#ifdef CONFIG_MEMHOTPLUG
+ return alloc_pages((gfp_mask & GFP_ZONEMASK) == __GFP_HOTREMOVABLE ?
+ (gfp_mask & ~GFP_ZONEMASK) | __GFP_HIGHMEM : gfp_mask,
+ PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#else
return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#endif
}
static inline void shmem_dir_free(struct page *page)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [8/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (6 preceding siblings ...)
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
@ 2004-07-14 14:04 ` Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:04 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/page-flags.h Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/include/linux/page-flags.h Sun Jul 11 10:51:49 2032
@@ -79,6 +79,7 @@
#define PG_anon 20 /* Anonymous: anon_vma in mapping */
#define PG_again 21
+#define PG_booked 22
/*
@@ -303,6 +304,10 @@ extern unsigned long __read_page_state(u
#define PageAgain(page) test_bit(PG_again, &(page)->flags)
#define SetPageAgain(page) set_bit(PG_again, &(page)->flags)
#define ClearPageAgain(page) clear_bit(PG_again, &(page)->flags)
+
+#define PageBooked(page) test_bit(PG_booked, &(page)->flags)
+#define SetPageBooked(page) set_bit(PG_booked, &(page)->flags)
+#define ClearPageBooked(page) clear_bit(PG_booked, &(page)->flags)
#define PageAnon(page) test_bit(PG_anon, &(page)->flags)
#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
--- linux-2.6.7.ORG/include/linux/mmzone.h Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/include/linux/mmzone.h Sun Jul 11 10:51:49 2032
@@ -187,6 +187,9 @@ struct zone {
char *name;
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
+ unsigned long contig_pages_alloc_hint;
+ unsigned long booked_pages;
+ long scan_pages;
} ____cacheline_maxaligned_in_smp;
--- linux-2.6.7.ORG/mm/page_alloc.c Sun Jul 11 10:49:53 2032
+++ linux-2.6.7/mm/page_alloc.c Sun Jul 11 10:53:04 2032
@@ -12,6 +12,7 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Dynamic compound page allocation, Hirokazu Takahashi, Jul 2004
*/
#include <linux/config.h>
@@ -25,6 +26,7 @@
#include <linux/module.h>
#include <linux/suspend.h>
#include <linux/pagevec.h>
+#include <linux/mm_inline.h>
#include <linux/memhotplug.h>
#include <linux/blkdev.h>
#include <linux/slab.h>
@@ -190,7 +192,11 @@ static inline void __free_pages_bulk (st
BUG();
index = page_idx >> (1 + order);
- zone->free_pages -= mask;
+ if (!PageBooked(page))
+ zone->free_pages -= mask;
+ else {
+ zone->booked_pages -= mask;
+ }
while (mask + (1 << (MAX_ORDER-1))) {
struct page *buddy1, *buddy2;
@@ -209,6 +215,9 @@ static inline void __free_pages_bulk (st
buddy2 = base + page_idx;
BUG_ON(bad_range(zone, buddy1));
BUG_ON(bad_range(zone, buddy2));
+ if (PageBooked(buddy1) != PageBooked(buddy2)) {
+ break;
+ }
list_del(&buddy1->lru);
mask <<= 1;
area++;
@@ -371,7 +380,12 @@ static struct page *__rmqueue(struct zon
if (list_empty(&area->free_list))
continue;
- page = list_entry(area->free_list.next, struct page, lru);
+ list_for_each_entry(page, &area->free_list, lru) {
+ if (!PageBooked(page))
+ goto gotit;
+ }
+ continue;
+gotit:
list_del(&page->lru);
index = page - zone->zone_mem_map;
if (current_order != MAX_ORDER-1)
@@ -503,6 +517,11 @@ static void fastcall free_hot_cold_page(
struct per_cpu_pages *pcp;
unsigned long flags;
+ if (PageBooked(page)) {
+ __free_pages_ok(page, 0);
+ return;
+ }
+
kernel_map_pages(page, 1, 0);
inc_page_state(pgfree);
free_pages_check(__FUNCTION__, page);
@@ -572,6 +591,225 @@ buffered_rmqueue(struct zone *zone, int
return page;
}
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+/*
+ * Check whether the page is freeable or not.
+ * It might not be free even if this function says OK,
+ * when it is just being allocated.
+ * This check is almost sufficient but not perfect.
+ */
+static inline int is_page_freeable(struct page *page)
+{
+ return (page->mapping || page_mapped(page) || !page_count(page)) &&
+ !(page->flags & (1<<PG_reserved|1<<PG_compound|1<<PG_booked|1<<PG_slab));
+}
+
+static inline int is_free_page(struct page *page)
+{
+ return !(page_mapped(page) ||
+ page->mapping != NULL ||
+ page_count(page) != 0 ||
+ (page->flags & (
+ 1 << PG_reserved|
+ 1 << PG_compound|
+ 1 << PG_booked |
+ 1 << PG_lru |
+ 1 << PG_private |
+ 1 << PG_locked |
+ 1 << PG_active |
+ 1 << PG_reclaim |
+ 1 << PG_dirty |
+ 1 << PG_slab |
+ 1 << PG_writeback )));
+}
+
+static int
+try_to_book_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+ struct page *p;
+ int booked_count = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ for (p = page; p < &page[1<<order]; p++) {
+ if (!is_page_freeable(p))
+ goto out;
+ if (is_free_page(p))
+ booked_count++;
+ SetPageBooked(p);
+ }
+
+ zone->booked_pages = booked_count;
+ zone->free_pages -= booked_count;
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return 1;
+out:
+ for (p--; p >= page; p--) {
+ ClearPageBooked(p);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return 0;
+}
+
+/*
+ * Mark PG_booked on all pages in a specified section to reserve
+ * for future use. These won't be reused until PG_booked is cleared.
+ */
+static struct page *
+book_pages(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+ unsigned long num = 1<<order;
+ unsigned long slot = zone->contig_pages_alloc_hint;
+ struct page *page;
+
+ slot = (slot + num - 1) & ~(num - 1); /* align */
+
+ for ( ; zone->scan_pages > 0; slot += num) {
+ zone->scan_pages -= num;
+ if (slot + num > zone->present_pages)
+ slot = 0;
+ page = &zone->zone_mem_map[slot];
+ if (try_to_book_pages(zone, page, order)) {
+ zone->contig_pages_alloc_hint = slot + num;
+ return page;
+ }
+ }
+ return NULL;
+}
+
+static void
+unbook_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+ struct page *p;
+ for (p = page; p < &page[1<<order]; p++) {
+ ClearPageBooked(p);
+ }
+}
+
+struct sweepctl {
+ struct page *start;
+ struct page *end;
+ int rest;
+};
+
+/*
+ * Choose a page among the booked pages.
+ *
+ */
+static struct page*
+get_booked_page(struct zone *zone, void *arg)
+{
+ struct sweepctl *ctl = (struct sweepctl *)arg;
+ struct page *page = ctl->start;
+ struct page *end = ctl->end;
+
+ for (; page <= end; page++) {
+ if (!page_count(page) && !PageLRU(page))
+ continue;
+ if (!PageBooked(page)) {
+ printk(KERN_ERR "ERROR sweepout_pages: page:%p isn't booked.\n", page);
+ }
+ if (!PageLRU(page) || steal_page_from_lru(zone, page) == NULL) {
+ ctl->rest++;
+ continue;
+ }
+ ctl->start = page + 1;
+ return page;
+ }
+ ctl->start = end + 1;
+ return NULL;
+}
+
+/*
+ * sweepout_pages() might not work well as the booked pages
+ * might include some unfreeable pages.
+ */
+static int
+sweepout_pages(struct zone *zone, struct page *page, int num)
+{
+ struct sweepctl ctl;
+ int failed = 0;
+ int retry = 0;
+again:
+ on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+ ctl.start = page;
+ ctl.end = &page[num - 1];
+ ctl.rest = 0;
+ failed = try_to_migrate_pages(zone, MIGRATE_ANYNODE, get_booked_page, &ctl);
+
+ if (retry != failed || ctl.rest) {
+ retry = failed;
+ schedule_timeout(HZ/4);
+ /* Actually we should wait on the pages */
+ goto again;
+ }
+
+ on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+ return failed;
+}
+
+/*
+ * Allocate contiguous pages even if pages are fragmented in zones.
+ * Page Migration mechanism helps to make enough space in them.
+ */
+static struct page *
+force_alloc_pages(unsigned int gfp_mask, unsigned int order,
+ struct zonelist *zonelist)
+{
+ struct zone **zones = zonelist->zones;
+ struct zone *zone;
+ struct page *page = NULL;
+ unsigned long flags;
+ int i;
+ int ret;
+
+ static DECLARE_MUTEX(bookedpage_sem);
+
+ down(&bookedpage_sem);
+
+ for (i = 0; zones[i] != NULL; i++) {
+ zone = zones[i];
+ zone->scan_pages = zone->present_pages;
+ while (zone->scan_pages > 0) {
+ page = book_pages(zone, gfp_mask, order);
+ if (!page)
+ break;
+ ret = sweepout_pages(zone, page, 1<<order);
+ if (ret) {
+ spin_lock_irqsave(&zone->lock, flags);
+ unbook_pages(zone, page, order);
+ page = NULL;
+
+ zone->free_pages += zone->booked_pages;
+ spin_unlock_irqrestore(&zone->lock, flags);
+ continue;
+ }
+ spin_lock_irqsave(&zone->lock, flags);
+ unbook_pages(zone, page, order);
+ zone->free_pages += zone->booked_pages;
+ page = __rmqueue(zone, order);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ if (page) {
+ prep_compound_page(page, order);
+ up(&bookedpage_sem);
+ return page;
+ }
+ }
+ }
+ up(&bookedpage_sem);
+ return NULL;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
+static inline int
+enough_pages(struct zone *zone, unsigned long min, const int wait)
+{
+ return (long)zone->free_pages - (long)min >= 0 ||
+ (!wait && (long)zone->free_pages - (long)zone->pages_high >= 0);
+}
+
/*
* This is the 'heart' of the zoned buddy allocator.
*
@@ -623,8 +861,7 @@ __alloc_pages(unsigned int gfp_mask, uns
if (rt_task(p))
min -= z->pages_low >> 1;
- if (z->free_pages >= min ||
- (!wait && z->free_pages >= z->pages_high)) {
+ if (enough_pages(z, min, wait)) {
page = buffered_rmqueue(z, order, gfp_mask);
if (page) {
zone_statistics(zonelist, z);
@@ -648,8 +885,7 @@ __alloc_pages(unsigned int gfp_mask, uns
if (rt_task(p))
min -= z->pages_low >> 1;
- if (z->free_pages >= min ||
- (!wait && z->free_pages >= z->pages_high)) {
+ if (enough_pages(z, min, wait)) {
page = buffered_rmqueue(z, order, gfp_mask);
if (page) {
zone_statistics(zonelist, z);
@@ -694,8 +930,7 @@ rebalance:
min = (1UL << order) + z->protection[alloc_type];
- if (z->free_pages >= min ||
- (!wait && z->free_pages >= z->pages_high)) {
+ if (enough_pages(z, min, wait)) {
page = buffered_rmqueue(z, order, gfp_mask);
if (page) {
zone_statistics(zonelist, z);
@@ -703,6 +938,20 @@ rebalance:
}
}
}
+
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+ /*
+ * Defrag pages to allocate large contiguous pages
+ *
+ * FIXME: The following code will work only if CONFIG_HUGETLB_PAGE
+ * flag is on.
+ */
+ if (order) {
+ page = force_alloc_pages(gfp_mask, order, zonelist);
+ if (page)
+ goto got_pg;
+ }
+#endif /* CONFIG_HUGETLB_PAGE */
/*
* Don't let big-order allocations loop unless the caller explicitly
--- linux-2.6.7.ORG/mm/memhotplug.c Sun Jul 11 10:45:27 2032
+++ linux-2.6.7/mm/memhotplug.c Sun Jul 11 10:51:49 2032
@@ -240,7 +240,7 @@ radix_tree_replace_pages(struct page *pa
}
/* Don't __put_page(page) here. Truncate may be in progress. */
newpage->flags |= page->flags & ~(1 << PG_uptodate) &
- ~(1 << PG_highmem) & ~(1 << PG_anon) &
+ ~(1 << PG_highmem) & ~(1 << PG_anon) & ~(1 << PG_booked) &
~(1 << PG_maplock) &
~(1 << PG_active) & ~(~0UL << NODEZONE_SHIFT);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [9/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (7 preceding siblings ...)
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/hugetlb.h Mon Jul 5 14:01:34 2032
+++ linux-2.6.7/include/linux/hugetlb.h Mon Jul 5 14:00:53 2032
@@ -25,6 +25,8 @@ struct page *follow_huge_addr(struct mm_
int write);
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write);
+extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
+ int, unsigned long);
int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
int pmd_huge(pmd_t pmd);
struct page *alloc_huge_page(void);
@@ -81,6 +83,7 @@ static inline unsigned long hugetlb_tota
#define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0)
#define alloc_huge_page() ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
+#define hugetlb_fault(mm, vma, write, addr) 0
#ifndef HPAGE_MASK
#define HPAGE_MASK 0 /* Keep the compiler happy */
--- linux-2.6.7.ORG/mm/memory.c Mon Jul 5 14:01:34 2032
+++ linux-2.6.7/mm/memory.c Mon Jul 5 13:55:53 2032
@@ -1683,7 +1683,7 @@ int handle_mm_fault(struct mm_struct *mm
inc_page_state(pgfault);
if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return hugetlb_fault(mm, vma, write_access, address);
/*
* We need the page table lock to synchronize with kswapd
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c Mon Jul 5 14:01:34 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c Mon Jul 5 14:02:37 2032
@@ -80,10 +80,12 @@ int copy_hugetlb_page_range(struct mm_st
goto nomem;
src_pte = huge_pte_offset(src, addr);
entry = *src_pte;
- ptepage = pte_page(entry);
- get_page(ptepage);
+ if (!pte_none(entry)) {
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+ dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ }
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -111,6 +113,11 @@ follow_hugetlb_page(struct mm_struct *mm
pte = huge_pte_offset(mm, vaddr);
+ if (!pte || pte_none(*pte)) {
+ hugetlb_fault(mm, vma, 0, vaddr);
+ pte = huge_pte_offset(mm, vaddr);
+ }
+
/* hugetlb should be locked, and hence, prefaulted */
WARN_ON(!pte || pte_none(*pte));
@@ -198,6 +205,13 @@ follow_huge_pmd(struct mm_struct *mm, un
struct page *page;
page = pte_page(*(pte_t *)pmd);
+ if (!page) {
+ struct vm_area_struct *vma = find_vma(mm, address);
+ if (!vma)
+ return NULL;
+ hugetlb_fault(mm, vma, write, address);
+ page = pte_page(*(pte_t *)pmd);
+ }
if (page)
page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
return page;
@@ -221,11 +235,71 @@ void unmap_hugepage_range(struct vm_area
continue;
page = pte_page(pte);
put_page(page);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+ struct page *page;
+ unsigned long idx;
+ pte_t *pte = huge_pte_alloc(mm, address);
+ int ret;
+
+ BUG_ON(vma->vm_start & ~HPAGE_MASK);
+ BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+ if (!pte) {
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+
+ if (!pte_none(*pte)) {
+ ret = VM_FAULT_MINOR;
+ goto out;
+ }
+
+ idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+again:
+ page = find_lock_page(mapping, idx);
+
+ if (!page) {
+ if (hugetlb_get_quota(mapping)) {
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ page = alloc_huge_page();
+ if (!page) {
+ hugetlb_put_quota(mapping);
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ if (ret) {
+ hugetlb_put_quota(mapping);
+ put_page(page);
+ goto again;
+ }
+ }
+ spin_lock(&mm->page_table_lock);
+ if (pte_none(*pte)) {
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, *pte);
+ } else {
+ put_page(page);
+ }
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ ret = VM_FAULT_MINOR;
+out:
+ return ret;
+}
+
int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
{
struct mm_struct *mm = current->mm;
@@ -235,46 +309,26 @@ int hugetlb_prefault(struct address_spac
BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
+#if 0
spin_lock(&mm->page_table_lock);
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
+ if (addr < vma->vm_start)
+ addr = vma->vm_start;
+ if (addr >= vma->vm_end) {
+ ret = 0;
+ break;
}
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ spin_unlock(&mm->page_table_lock);
+ ret = hugetlb_fault(mm, vma, 1, addr);
+ schedule();
+ spin_lock(&mm->page_table_lock);
+ if (ret == VM_FAULT_SIGBUS) {
+ ret = -ENOMEM;
+ break;
}
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ ret = 0;
}
-out:
spin_unlock(&mm->page_table_lock);
+#endif
return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [10/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (8 preceding siblings ...)
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/mm/hugetlb.c Thu Jun 17 15:17:51 2032
+++ linux-2.6.7/mm/hugetlb.c Thu Jun 17 15:21:18 2032
@@ -15,8 +15,20 @@ const unsigned long hugetlb_zero = 0, hu
static unsigned long nr_huge_pages, free_huge_pages;
unsigned long max_huge_pages;
static struct list_head hugepage_freelists[MAX_NUMNODES];
+static struct list_head hugepage_alllists[MAX_NUMNODES];
static spinlock_t hugetlb_lock = SPIN_LOCK_UNLOCKED;
+static void register_huge_page(struct page *page)
+{
+ list_add(&page[1].lru,
+ &hugepage_alllists[page_zone(page)->zone_pgdat->node_id]);
+}
+
+static void unregister_huge_page(struct page *page)
+{
+ list_del(&page[1].lru);
+}
+
static void enqueue_huge_page(struct page *page)
{
list_add(&page->lru,
@@ -90,14 +102,17 @@ static int __init hugetlb_init(void)
unsigned long i;
struct page *page;
- for (i = 0; i < MAX_NUMNODES; ++i)
+ for (i = 0; i < MAX_NUMNODES; ++i) {
INIT_LIST_HEAD(&hugepage_freelists[i]);
+ INIT_LIST_HEAD(&hugepage_alllists[i]);
+ }
for (i = 0; i < max_huge_pages; ++i) {
page = alloc_fresh_huge_page();
if (!page)
break;
spin_lock(&hugetlb_lock);
+ register_huge_page(page);
enqueue_huge_page(page);
spin_unlock(&hugetlb_lock);
}
@@ -139,6 +154,7 @@ static int try_to_free_low(unsigned long
if (PageHighMem(page))
continue;
list_del(&page->lru);
+ unregister_huge_page(page);
update_and_free_page(page);
--free_huge_pages;
if (!--count)
@@ -161,6 +177,7 @@ static unsigned long set_max_huge_pages(
if (!page)
return nr_huge_pages;
spin_lock(&hugetlb_lock);
+ register_huge_page(page);
enqueue_huge_page(page);
free_huge_pages++;
nr_huge_pages++;
@@ -174,6 +191,7 @@ static unsigned long set_max_huge_pages(
struct page *page = dequeue_huge_page();
if (!page)
break;
+ unregister_huge_page(page);
update_and_free_page(page);
}
spin_unlock(&hugetlb_lock);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [11/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (9 preceding siblings ...)
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/hugetlb.h Mon Jul 5 14:05:39 2032
+++ linux-2.6.7/include/linux/hugetlb.h Mon Jul 5 14:06:19 2032
@@ -27,6 +27,7 @@ struct page *follow_huge_pmd(struct mm_s
pmd_t *pmd, int write);
extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
int, unsigned long);
+int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma, struct list_head *force);
int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
int pmd_huge(pmd_t pmd);
struct page *alloc_huge_page(void);
@@ -84,6 +85,7 @@ static inline unsigned long hugetlb_tota
#define alloc_huge_page() ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
#define hugetlb_fault(mm, vma, write, addr) 0
+#define try_to_unmap_hugepage(page, vma, force) 0
#ifndef HPAGE_MASK
#define HPAGE_MASK 0 /* Keep the compiler happy */
--- linux-2.6.7.ORG/mm/rmap.c Mon Jul 5 14:01:22 2032
+++ linux-2.6.7/mm/rmap.c Mon Jul 5 14:06:19 2032
@@ -27,6 +27,7 @@
* on the mm->page_table_lock
*/
#include <linux/mm.h>
+#include <linux/hugetlb.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -441,6 +442,13 @@ static int try_to_unmap_one(struct page
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
+
+ /*
+ * Is there any better way to check whether the page is
+ * HugePage or not?
+ */
+ if (vma && is_vm_hugetlb_page(vma))
+ return try_to_unmap_hugepage(page, vma, force);
/*
* We need the page_table_lock to protect us from page faults,
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c Mon Jul 5 14:05:39 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c Mon Jul 5 14:06:19 2032
@@ -10,6 +10,7 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/pagemap.h>
+#include <linux/rmap.h>
#include <linux/smp_lock.h>
#include <linux/slab.h>
#include <linux/err.h>
@@ -83,6 +84,7 @@ int copy_hugetlb_page_range(struct mm_st
if (!pte_none(entry)) {
ptepage = pte_page(entry);
get_page(ptepage);
+ page_dup_rmap(ptepage);
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
}
set_pte(dst_pte, entry);
@@ -234,6 +236,7 @@ void unmap_hugepage_range(struct vm_area
if (pte_none(pte))
continue;
page = pte_page(pte);
+ page_remove_rmap(page);
put_page(page);
mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
@@ -288,6 +291,7 @@ again:
spin_lock(&mm->page_table_lock);
if (pte_none(*pte)) {
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ page_add_file_rmap(page);
flush_tlb_page(vma, address);
update_mmu_cache(vma, address, *pte);
} else {
@@ -332,3 +336,87 @@ int hugetlb_prefault(struct address_spac
#endif
return ret;
}
+
+/*
+ * At what user virtual address is page expected in vma?
+ */
+static inline unsigned long
+huge_vma_address(struct page *page, struct vm_area_struct *vma)
+{
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ unsigned long address;
+
+ address = vma->vm_start + ((pgoff - vma->vm_pgoff) << HPAGE_SHIFT);
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ /* page should be within any vma from prio_tree_next */
+ BUG_ON(!PageAnon(page));
+ return -EFAULT;
+ }
+ return address;
+}
+
+/*
+ * Try to clear the PTE which map the hugepage.
+ */
+int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma,
+ struct list_head *force)
+{
+ pte_t *pte;
+ pte_t pteval;
+ int ret = SWAP_AGAIN;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+
+ address = huge_vma_address(page, vma);
+ if (address == -EFAULT)
+ goto out;
+
+ /*
+ * We need the page_table_lock to protect us from page faults,
+ * munmap, fork, etc...
+ */
+ if (!spin_trylock(&mm->page_table_lock))
+ goto out;
+
+ pte = huge_pte_offset(mm, address);
+ if (!pte || pte_none(*pte))
+ goto out_unlock;
+ if (!pte_present(*pte))
+ goto out_unlock;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unlock;
+
+ BUG_ON(!vma);
+
+#if 0
+ if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
+ ptep_test_and_clear_young(pte)) {
+ ret = SWAP_FAIL;
+ goto out_unlock;
+ }
+#endif
+
+ /* Nuke the page table entry. */
+ flush_cache_page(vma, address);
+ pteval = ptep_get_and_clear(pte);
+ flush_tlb_range(vma, address, address + HPAGE_SIZE);
+
+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ BUG_ON(PageAnon(page));
+
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
+ BUG_ON(!page->mapcount);
+ page->mapcount--;
+ page_cache_release(page);
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+
+out:
+ return ret;
+}
+
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (10 preceding siblings ...)
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
@ 2004-07-14 14:05 ` Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:05 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7/mm/hugetlb.c.save Wed Jul 7 18:34:06 2032
+++ linux-2.6.7/mm/hugetlb.c Wed Jul 7 18:35:10 2032
@@ -149,8 +149,8 @@ static int try_to_free_low(unsigned long
{
int i;
for (i = 0; i < MAX_NUMNODES; ++i) {
- struct page *page;
- list_for_each_entry(page, &hugepage_freelists[i], lru) {
+ struct page *page, *page1;
+ list_for_each_entry_safe(page, page1, &hugepage_freelists[i], lru) {
if (PageHighMem(page))
continue;
list_del(&page->lru);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [13/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (11 preceding siblings ...)
2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/include/linux/hugetlb.h Sun Jul 11 11:33:45 2032
+++ linux-2.6.7/include/linux/hugetlb.h Sun Jul 11 11:34:11 2032
@@ -28,6 +28,7 @@ struct page *follow_huge_pmd(struct mm_s
extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
int, unsigned long);
int try_to_unmap_hugepage(struct page *page, struct vm_area_struct *vma, struct list_head *force);
+int mmigrate_hugetlb_pages(struct zone *);
int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
int pmd_huge(pmd_t pmd);
struct page *alloc_huge_page(void);
@@ -86,6 +87,7 @@ static inline unsigned long hugetlb_tota
#define free_huge_page(p) ({ (void)(p); BUG(); })
#define hugetlb_fault(mm, vma, write, addr) 0
#define try_to_unmap_hugepage(page, vma, force) 0
+#define mmigrate_hugetlb_pages(zone) 0
#ifndef HPAGE_MASK
#define HPAGE_MASK 0 /* Keep the compiler happy */
--- linux-2.6.7.ORG/arch/i386/mm/hugetlbpage.c Sun Jul 11 11:33:45 2032
+++ linux-2.6.7/arch/i386/mm/hugetlbpage.c Sun Jul 11 11:34:11 2032
@@ -288,6 +288,15 @@ again:
goto again;
}
}
+
+ if (page->mapping == NULL) {
+ BUG_ON(! PageAgain(page));
+ /* This page will go back to freelists[] */
+ put_page(page); /* XXX */
+ unlock_page(page);
+ goto again;
+ }
+
spin_lock(&mm->page_table_lock);
if (pte_none(*pte)) {
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
--- linux-2.6.7.ORG/mm/memhotplug.c Sun Jul 11 12:56:53 2032
+++ linux-2.6.7/mm/memhotplug.c Sun Jul 11 12:56:17 2032
@@ -17,6 +17,7 @@
#include <linux/buffer_head.h>
#include <linux/mm_inline.h>
#include <linux/rmap.h>
+#include <linux/hugetlb.h>
#include <linux/memhotplug.h>
#ifdef CONFIG_KDB
@@ -841,6 +842,10 @@ int mmigrated(void *p)
current->flags |= PF_KSWAPD; /* It's fake */
if (down_trylock(&mmigrated_sem)) {
printk("mmigrated already running\n");
+ return 0;
+ }
+ if (mmigrate_hugetlb_pages(zone)) {
+ up(&mmigrated_sem);
return 0;
}
on_each_cpu(lru_drain_schedule, NULL, 1, 1);
--- linux-2.6.7.ORG/mm/hugetlb.c Sun Jul 11 11:30:50 2032
+++ linux-2.6.7/mm/hugetlb.c Sun Jul 11 13:14:25 2032
@@ -1,6 +1,7 @@
/*
* Generic hugetlb support.
* (C) William Irwin, April 2004
+ * Support of memory hotplug for hugetlbpages, Hirokazu Takahashi, Jul 2004
*/
#include <linux/gfp.h>
#include <linux/list.h>
@@ -8,6 +9,8 @@
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
+#include <linux/memhotplug.h>
#include <linux/sysctl.h>
#include <linux/highmem.h>
@@ -58,6 +61,9 @@ static struct page *alloc_fresh_huge_pag
{
static int nid = 0;
struct page *page;
+ struct pglist_data *pgdat;
+ while ((pgdat = NODE_DATA(nid)) == NULL || !pgdat->enabled)
+ nid = (nid + 1) % numnodes;
page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP,
HUGETLB_PAGE_ORDER);
nid = (nid + 1) % numnodes;
@@ -91,6 +97,8 @@ struct page *alloc_huge_page(void)
free_huge_pages--;
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ 1 << PG_referenced | 1 << PG_again);
page[1].mapping = (void *)free_huge_page;
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
@@ -144,25 +152,36 @@ static void update_and_free_page(struct
__free_pages(page, HUGETLB_PAGE_ORDER);
}
-#ifdef CONFIG_HIGHMEM
-static int try_to_free_low(unsigned long count)
+static int
+try_to_free_hugepages(int idx, unsigned long count, struct zone *zone)
{
- int i;
- for (i = 0; i < MAX_NUMNODES; ++i) {
- struct page *page, *page1;
- list_for_each_entry_safe(page, page1, &hugepage_freelists[i], lru) {
+ struct page *page, *page1;
+ list_for_each_entry_safe(page, page1, &hugepage_freelists[idx], lru) {
+ if (zone) {
+ if (page_zone(page) != zone)
+ continue;
+ } else {
if (PageHighMem(page))
continue;
- list_del(&page->lru);
- unregister_huge_page(page);
- update_and_free_page(page);
- --free_huge_pages;
- if (!--count)
- return 0;
}
+ list_del(&page->lru);
+ unregister_huge_page(page);
+ update_and_free_page(page);
+ --free_huge_pages;
+ if (!--count)
+ return 0;
}
return count;
}
+
+#ifdef CONFIG_HIGHMEM
+static int try_to_free_low(unsigned long count)
+{
+ int i;
+ for (i = 0; i < MAX_NUMNODES; ++i)
+ count = try_to_free_hugepages(i, count, NULL);
+ return count;
+}
#else
static inline int try_to_free_low(unsigned long count)
{
@@ -250,10 +269,8 @@ unsigned long hugetlb_total_pages(void)
EXPORT_SYMBOL(hugetlb_total_pages);
/*
- * We cannot handle pagefaults against hugetlb pages at all. They cause
- * handle_mm_fault() to try to instantiate regular-sized pages in the
- * hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get
- * this far.
+ * hugetlb_nopage() is never called since hugetlb_fault() has
+ * implemented.
*/
static struct page *hugetlb_nopage(struct vm_area_struct *vma,
unsigned long address, int *unused)
@@ -275,3 +292,200 @@ void zap_hugepage_range(struct vm_area_s
unmap_hugepage_range(vma, start, start + length);
spin_unlock(&mm->page_table_lock);
}
+
+#ifdef CONFIG_MEMHOTPLUG
+static int copy_hugepage(struct page *to, struct page *from)
+{
+ int size;
+ for (size = 0; size < HPAGE_SIZE; size += PAGE_SIZE) {
+ copy_highpage(to, from);
+ to++;
+ from++;
+ }
+ return 0;
+}
+
+/*
+ * Allocate a hugepage from Buddy allocator directly.
+ */
+static struct page *
+hugepage_mmigrate_alloc(int nid)
+{
+ struct page *page;
+ /*
+ * TODO:
+ * - NUMA aware page allocation is required. we should allocate
+ * a hugepage from the node which the process depends on.
+ * - New hugepages should be preallocated prior to migrating pages
+ * so that lack of memory can be found before them.
+ * - New hugepages should be allocate from the node specified by nid.
+ */
+ page = alloc_fresh_huge_page();
+
+ if (page == NULL) {
+ printk(KERN_WARNING "remap: Failed to allocate new hugepage\n");
+ } else {
+ spin_lock(&hugetlb_lock);
+ register_huge_page(page);
+ enqueue_huge_page(page);
+ free_huge_pages++;
+ nr_huge_pages++;
+ spin_unlock(&hugetlb_lock);
+ }
+ page = alloc_huge_page();
+ unregister_huge_page(page); /* XXXX */
+ return page;
+}
+
+/*
+ * Free a hugepage into Buddy allocator directly.
+ */
+static int
+hugepage_delete(struct page *page)
+{
+ BUG_ON(page_count(page) != 1);
+ BUG_ON(page->mapping);
+
+ spin_lock(&hugetlb_lock);
+ page[1].mapping = NULL;
+ update_and_free_page(page);
+ spin_unlock(&hugetlb_lock);
+ return 0;
+}
+
+static int
+hugepage_register(struct page *page, int active)
+{
+ spin_lock(&hugetlb_lock);
+ register_huge_page(page);
+ spin_unlock(&hugetlb_lock);
+ return 0;
+}
+
+static int
+hugepage_release_buffer(struct page *page)
+{
+ BUG();
+ return -1;
+}
+
+/*
+ * Hugetlbpage migration is harder than regular page migration
+ * for lack of swap related features on hugetlbpages. To do this
+ * new feature has been intoduced:
+ * - rmap mechanism to unmap a hugetlbpage.
+ * - a pagefault handler against hugetlbpages.
+ * - a list on which all hugetlbpages to be put instead of the LRU
+ * lists for regular pages.
+ * With the feature, hugetlbpages can be handled in the same way
+ * for regular pages.
+ *
+ * The following is a flow to migrate hugetlbpages:
+ * 1. allocate a new hugetlbpage.
+ * a. look for an appropriate section for a hugetlbpage.
+ * b. make all pages in the section migrated to another zone.
+ * c. allocate it as a new hugetlbpage.
+ * 2. lock the new hugetlbpage and don't set PG_uptodate flag on it.
+ * 3. modify the oldpage entry in the corresponding radix tree on
+ * hugetlbfs with the new hugetlbpage.
+ * 4. clear all PTEs that refer to the old hugetlbpage.
+ * 5. wait until all references on the old hugetlbpage have gone.
+ * 6. copy from the old hugetlbpage to the new hugetlbpage.
+ * 7. set PG_uptodate flag of the new hugetlbpage.
+ * 8. release the old hugetlbpage into the Buddy allocator directly.
+ * 9. unlock the new hugetlbpage and wakeup all waiters.
+ *
+ * If a new access to a hugetlbpage migrating occurs, it will be blocked
+ * in a pagefaut handler until everything has done.
+ *
+ *
+ * disabled+------+---------------------------+------+------+---
+ * zone | | old hugepage | | |
+ * +------+-------------|-------------+------+------+---
+ * +--migrate
+ * |
+ * V
+ * <-- reserve new hugepage -->
+ * page page page page page page page
+ * +------+------+------+------+------+------+------+---
+ * zone | | |Booked|Booked|Booked|Booked| |
+ * +------+------+--|---+--|---+------+------+------+---
+ * | |
+ * migrate--+ +--------------+
+ * | |
+ * other +------+---V--+------+------+--- migrate
+ * zones | | | | | |
+ * +------+------+------+------+-- |
+ * +------+------+------+------+------+---V--+------+---
+ * | | | | | | | |
+ * +------+------+------+------+------+------+------+---
+ */
+
+static struct mmigrate_operations hugepage_mmigrate_ops = {
+ .mmigrate_alloc_page = hugepage_mmigrate_alloc,
+ .mmigrate_free_page = hugepage_delete,
+ .mmigrate_copy_page = copy_hugepage,
+ .mmigrate_lru_add_page = hugepage_register,
+ .mmigrate_release_buffers = hugepage_release_buffer,
+ .mmigrate_prepare = NULL,
+ .mmigrate_stick_page = NULL
+};
+
+int mmigrate_hugetlb_pages(struct zone *zone)
+{
+ struct page *page, *page1, *map;
+ int idx = zone->zone_pgdat->node_id;
+ LIST_HEAD(templist);
+ int rest = 0;
+
+ /*
+ * Release unused hugetlbpages coresponding to the specified zone.
+ */
+ spin_lock(&hugetlb_lock);
+ try_to_free_hugepages(idx, free_huge_pages, zone);
+ spin_unlock(&hugetlb_lock);
+/* max_huge_pages = set_max_huge_pages(max_huge_pages); */
+
+ /*
+ * Look for all hugetlbpages coresponding to the specified zone.
+ */
+ spin_lock(&hugetlb_lock);
+ list_for_each_entry_safe(page, map, &hugepage_alllists[idx], lru) {
+ /*
+ * looking for all hugetlbpages coresponding to the
+ * specified zone.
+ */
+ if (page_zone(page) != zone)
+ continue;
+ page_cache_get(page-1);
+ unregister_huge_page(page-1);
+ list_add(&page->lru, &templist);
+ }
+ spin_unlock(&hugetlb_lock);
+
+ /*
+ * Try to migrate the pages one by one.
+ */
+ list_for_each_entry_safe(page1, map, &templist, lru) {
+ list_del(&page1->lru);
+ INIT_LIST_HEAD(&page1->lru);
+ page = page1 - 1;
+
+ if (page_count(page) <= 1 || page->mapping == NULL ||
+ mmigrate_onepage(page, MIGRATE_ANYNODE, 0, &hugepage_mmigrate_ops)) {
+ /* free the page later */
+ spin_lock(&hugetlb_lock);
+ register_huge_page(page);
+ spin_unlock(&hugetlb_lock);
+ page_cache_release(page);
+ rest++;
+ }
+ }
+
+ /*
+ * Reallocate unused hugetlbpages.
+ */
+ max_huge_pages = set_max_huge_pages(max_huge_pages);
+ return rest;
+}
+#endif /* CONFIG_MEMHOTPLUG */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [14/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (12 preceding siblings ...)
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/mm/page_alloc.c Thu Jun 17 15:19:28 2032
+++ linux-2.6.7/mm/page_alloc.c Thu Jun 17 15:26:37 2032
@@ -2386,6 +2386,8 @@ int lower_zone_protection_sysctl_handler
}
#ifdef CONFIG_MEMHOTPLUG
+extern int mhtest_hpage_read(char *p, int, int);
+
static int mhtest_read(char *page, char **start, off_t off, int count,
int *eof, void *data)
{
@@ -2409,9 +2411,15 @@ static int mhtest_read(char *page, char
/* skip empty zone */
continue;
len = sprintf(p,
- "\t%s[%d]: free %ld, active %ld, present %ld\n",
+ "\t%s[%d]: free %ld, active %ld, present %ld",
z->name, NODEZONE(i, j),
z->free_pages, z->nr_active, z->present_pages);
+ p += len;
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+ len = mhtest_hpage_read(p, i, j);
+ p += len;
+#endif
+ len = sprintf(p, "\n");
p += len;
}
*p++ = '\n';
--- linux-2.6.7.ORG/mm/hugetlb.c Thu Jun 17 15:26:09 2032
+++ linux-2.6.7/mm/hugetlb.c Thu Jun 17 15:26:37 2032
@@ -260,6 +260,24 @@ static unsigned long set_max_huge_pages(
return nr_huge_pages;
}
+#ifdef CONFIG_MEMHOTPLUG
+int mhtest_hpage_read(char *p, int nodenum, int zonenum)
+{
+ struct page *page;
+ int total = 0;
+ int free = 0;
+ spin_lock(&hugetlb_lock);
+ list_for_each_entry(page, &hugepage_alllists[nodenum], lru) {
+ if (page_zonenum(page) == zonenum) total++;
+ }
+ list_for_each_entry(page, &hugepage_freelists[nodenum], lru) {
+ if (page_zonenum(page) == zonenum) free++;
+ }
+ spin_unlock(&hugetlb_lock);
+ return sprintf(p, " / HugePage free %d, total %d", free, total);
+}
+#endif
+
#ifdef CONFIG_SYSCTL
int hugetlb_sysctl_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (13 preceding siblings ...)
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7/fs/direct-io.c.ORG Fri Jun 18 13:52:47 2032
+++ linux-2.6.7/fs/direct-io.c Fri Jun 18 13:53:49 2032
@@ -411,7 +411,7 @@ static int dio_bio_complete(struct dio *
for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {
struct page *page = bvec[page_no].bv_page;
- if (dio->rw == READ)
+ if (dio->rw == READ && !PageCompound(page))
set_page_dirty_lock(page);
page_cache_release(page);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] memory hotremoval for linux-2.6.7 [16/16]
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
` (14 preceding siblings ...)
2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
@ 2004-07-14 14:06 ` Hirokazu Takahashi
15 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2004-07-14 14:06 UTC (permalink / raw)
To: linux-kernel, lhms-devel; +Cc: linux-mm
--- linux-2.6.7.ORG/fs/direct-io.c Thu Jun 17 15:17:13 2032
+++ linux-2.6.7/fs/direct-io.c Thu Jun 17 15:28:44 2032
@@ -27,6 +27,7 @@
#include <linux/slab.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
#include <linux/bio.h>
#include <linux/wait.h>
#include <linux/err.h>
@@ -110,7 +111,11 @@ struct dio {
* Page queue. These variables belong to dio_refill_pages() and
* dio_get_page().
*/
+#ifndef CONFIG_HUGETLB_PAGE
struct page *pages[DIO_PAGES]; /* page buffer */
+#else
+ struct page *pages[HPAGE_SIZE/PAGE_SIZE]; /* page buffer */
+#endif
unsigned head; /* next page to process */
unsigned tail; /* last valid page + 1 */
int page_errors; /* errno from get_user_pages() */
@@ -143,9 +148,20 @@ static int dio_refill_pages(struct dio *
{
int ret;
int nr_pages;
+ struct vm_area_struct * vma;
- nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
down_read(¤t->mm->mmap_sem);
+#ifdef CONFIG_HUGETLB_PAGE
+ vma = find_vma(current->mm, dio->curr_user_address);
+ if (vma && is_vm_hugetlb_page(vma)) {
+ unsigned long n = dio->curr_user_address & PAGE_MASK;
+ n = (n & ~HPAGE_MASK) >> PAGE_SHIFT;
+ n = HPAGE_SIZE/PAGE_SIZE - n;
+ nr_pages = min(dio->total_pages - dio->curr_page, (int)n);
+ } else
+#endif
+ nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
+
ret = get_user_pages(
current, /* Task for fault acounting */
current->mm, /* whose pages? */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2004-07-14 14:06 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-14 13:41 [PATCH] memory hotremoval for linux-2.6.7 [0/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [1/16] Hirokazu Takahashi
2004-07-14 14:02 ` [PATCH] memory hotremoval for linux-2.6.7 [2/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [3/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [4/16] Hirokazu Takahashi
2004-07-14 14:03 ` [PATCH] memory hotremoval for linux-2.6.7 [5/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [6/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [7/16] Hirokazu Takahashi
2004-07-14 14:04 ` [PATCH] memory hotremoval for linux-2.6.7 [8/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [9/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [10/16] Hirokazu Takahashi
2004-07-14 14:05 ` [PATCH] memory hotremoval for linux-2.6.7 [11/16] Hirokazu Takahashi
2004-07-14 14:05 ` [BUG][PATCH] memory hotremoval for linux-2.6.7 [12/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [13/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [14/16] Hirokazu Takahashi
2004-07-14 14:06 ` [BUG] [PATCH] memory hotremoval for linux-2.6.7 [15/16] Hirokazu Takahashi
2004-07-14 14:06 ` [PATCH] memory hotremoval for linux-2.6.7 [16/16] Hirokazu Takahashi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox