* [PATCH 1/4] VM: add may_swap flag to scan_control
[not found] <20050601141154.GN14894@localhost>
@ 2005-06-01 14:22 ` Martin Hicks
2005-06-01 14:23 ` [PATCH 2/4] VM: early zone reclaim Martin Hicks
` (2 subsequent siblings)
3 siblings, 0 replies; 4+ messages in thread
From: Martin Hicks @ 2005-06-01 14:22 UTC (permalink / raw)
To: Linux MM, Andrew Morton; +Cc: Ray Bryant
This adds an extra switch to the scan_control struct. It simply
lets the reclaim code know if its allowed to swap pages out.
This was required for a simple per-zone reclaimer. Without this
addition pages would be swapped out as soon as a zone ran out of
memory and the early reclaim kicked in.
Signed-off-by: Martin Hicks <mort@sgi.com>
mm/vmscan.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletion(-)
Index: linux-2.6.12-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/vmscan.c 2005-05-26 12:27:01.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/vmscan.c 2005-05-26 12:27:05.000000000 -0700
@@ -74,6 +74,9 @@ struct scan_control {
int may_writepage;
+ /* Can pages be swapped as part of reclaim? */
+ int may_swap;
+
/* This context's SWAP_CLUSTER_MAX. If freeing memory for
* suspend, we effectively ignore SWAP_CLUSTER_MAX.
* In this context, it doesn't matter that we scan the
@@ -414,7 +417,7 @@ static int shrink_list(struct list_head
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page)) {
+ if (PageAnon(page) && !PageSwapCache(page) && sc->may_swap) {
void *cookie = page->mapping;
pgoff_t index = page->index;
@@ -930,6 +933,7 @@ int try_to_free_pages(struct zone **zone
sc.gfp_mask = gfp_mask;
sc.may_writepage = 0;
+ sc.may_swap = 1;
inc_page_state(allocstall);
@@ -1030,6 +1034,7 @@ loop_again:
total_reclaimed = 0;
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
+ sc.may_swap = 1;
sc.nr_mapped = read_page_state(nr_mapped);
inc_page_state(pageoutrun);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread* [PATCH 2/4] VM: early zone reclaim
[not found] <20050601141154.GN14894@localhost>
2005-06-01 14:22 ` [PATCH 1/4] VM: add may_swap flag to scan_control Martin Hicks
@ 2005-06-01 14:23 ` Martin Hicks
2005-06-01 14:23 ` [PATCH 3/4] VM: add __GFP_NORECLAIM Martin Hicks
2005-06-01 14:23 ` [PATCH 4/4] VM: rate limit early reclaim Martin Hicks
3 siblings, 0 replies; 4+ messages in thread
From: Martin Hicks @ 2005-06-01 14:23 UTC (permalink / raw)
To: Linux MM, Andrew Morton; +Cc: Ray Bryant
This is the core of the (much simplified) early reclaim. The goal of
this patch is to reclaim some easily-freed pages from a zone before
falling back onto another zone.
One of the major uses of this is NUMA machines. With the default
allocator behavior the allocator would look for memory in another
zone, which might be off-node, before trying to reclaim from the
current zone.
This adds a zone tuneable to enable early zone reclaim. It is selected
on a per-zone basis and is turned on/off via syscall.
Signed-off-by: Martin Hicks <mort@sgi.com>
arch/i386/kernel/syscall_table.S | 2 -
arch/ia64/kernel/entry.S | 2 -
include/asm-i386/unistd.h | 2 -
include/asm-ia64/unistd.h | 1
include/linux/mmzone.h | 6 +++
include/linux/swap.h | 1
kernel/sys_ni.c | 1
mm/page_alloc.c | 31 ++++++++++++++++--
mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++
9 files changed, 103 insertions(+), 7 deletions(-)
Index: linux-2.6.12-rc5-mm1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.12-rc5-mm1.orig/arch/ia64/kernel/entry.S 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/arch/ia64/kernel/entry.S 2005-05-26 12:27:11.000000000 -0700
@@ -1573,7 +1573,7 @@ sys_call_table:
data8 sys_keyctl
data8 sys_ni_syscall
data8 sys_ni_syscall // 1275
- data8 sys_ni_syscall
+ data8 sys_set_zone_reclaim
data8 sys_ni_syscall
data8 sys_ni_syscall
data8 sys_ni_syscall
Index: linux-2.6.12-rc5-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/linux/mmzone.h 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/linux/mmzone.h 2005-05-26 12:27:11.000000000 -0700
@@ -145,6 +145,12 @@ struct zone {
int all_unreclaimable; /* All pages pinned */
/*
+ * Does the allocator try to reclaim pages from the zone as soon
+ * as it fails a watermark_ok() in __alloc_pages?
+ */
+ int reclaim_pages;
+
+ /*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
Index: linux-2.6.12-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/linux/swap.h 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/linux/swap.h 2005-05-26 12:27:11.000000000 -0700
@@ -173,6 +173,7 @@ extern void swap_setup(void);
/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
+extern int zone_reclaim(struct zone *, unsigned int, unsigned int);
extern int shrink_all_memory(int);
extern int vm_swappiness;
Index: linux-2.6.12-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/page_alloc.c 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/page_alloc.c 2005-05-26 12:27:11.000000000 -0700
@@ -724,6 +724,14 @@ int zone_watermark_ok(struct zone *z, in
return 1;
}
+static inline int
+check_zone_reclaim(struct zone *z, unsigned int gfp_mask)
+{
+ if (!z->reclaim_pages)
+ return 0;
+ return 1;
+}
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -763,14 +771,29 @@ __alloc_pages(unsigned int __nocast gfp_
restart:
/* Go through the zonelist once, looking for a zone with enough free */
for (i = 0; (z = zones[i]) != NULL; i++) {
-
- if (!zone_watermark_ok(z, order, z->pages_low,
- classzone_idx, 0, 0))
- continue;
+ int do_reclaim = check_zone_reclaim(z, gfp_mask);
if (!cpuset_zone_allowed(z))
continue;
+ /*
+ * If the zone is to attempt early page reclaim then this loop
+ * will try to reclaim pages and check the watermark a second
+ * time before giving up and falling back to the next zone.
+ */
+ zone_reclaim_retry:
+ if (!zone_watermark_ok(z, order, z->pages_low,
+ classzone_idx, 0, 0)) {
+ if (!do_reclaim)
+ continue;
+ else {
+ zone_reclaim(z, gfp_mask, order);
+ /* Only try reclaim once */
+ do_reclaim = 0;
+ goto zone_reclaim_retry;
+ }
+ }
+
page = buffered_rmqueue(z, order, gfp_mask);
if (page)
goto got_pg;
Index: linux-2.6.12-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/vmscan.c 2005-05-26 12:27:05.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/vmscan.c 2005-05-26 12:27:11.000000000 -0700
@@ -1326,3 +1326,67 @@ static int __init kswapd_init(void)
}
module_init(kswapd_init)
+
+
+/*
+ * Try to free up some pages from this zone through reclaim.
+ */
+int zone_reclaim(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+ struct scan_control sc;
+ int nr_pages = 1 << order;
+ int total_reclaimed = 0;
+
+ /* The reclaim may sleep, so don't do it if sleep isn't allowed */
+ if (!(gfp_mask & __GFP_WAIT))
+ return 0;
+ if (zone->all_unreclaimable)
+ return 0;
+
+ sc.gfp_mask = gfp_mask;
+ sc.may_writepage = 0;
+ sc.may_swap = 0;
+ sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+ /* scan at the highest priority */
+ sc.priority = 0;
+
+ if (nr_pages > SWAP_CLUSTER_MAX)
+ sc.swap_cluster_max = nr_pages;
+ else
+ sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+
+ shrink_zone(zone, &sc);
+ total_reclaimed = sc.nr_reclaimed;
+
+ return total_reclaimed;
+}
+
+asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
+ unsigned int state)
+{
+ struct zone *z;
+ int i;
+
+ if (node >= MAX_NUMNODES || !node_online(node))
+ return -EINVAL;
+
+ /* This will break if we ever add more zones */
+ if (!(zone & (1<<ZONE_DMA|1<<ZONE_NORMAL|1<<ZONE_HIGHMEM)))
+ return -EINVAL;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ if (!(zone & 1<<i))
+ continue;
+
+ z = &NODE_DATA(node)->node_zones[i];
+
+ if (state)
+ z->reclaim_pages = 1;
+ else
+ z->reclaim_pages = 0;
+ }
+
+ return 0;
+}
Index: linux-2.6.12-rc5-mm1/kernel/sys_ni.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/kernel/sys_ni.c 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/kernel/sys_ni.c 2005-05-26 12:27:11.000000000 -0700
@@ -77,6 +77,7 @@ cond_syscall(sys_request_key);
cond_syscall(sys_keyctl);
cond_syscall(compat_sys_keyctl);
cond_syscall(compat_sys_socketcall);
+cond_syscall(sys_set_zone_reclaim);
/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
Index: linux-2.6.12-rc5-mm1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/asm-i386/unistd.h 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/asm-i386/unistd.h 2005-05-26 12:27:11.000000000 -0700
@@ -256,7 +256,7 @@
#define __NR_io_submit 248
#define __NR_io_cancel 249
#define __NR_fadvise64 250
-
+#define __NR_set_zone_reclaim 251
#define __NR_exit_group 252
#define __NR_lookup_dcookie 253
#define __NR_epoll_create 254
Index: linux-2.6.12-rc5-mm1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/asm-ia64/unistd.h 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/asm-ia64/unistd.h 2005-05-26 12:27:11.000000000 -0700
@@ -263,6 +263,7 @@
#define __NR_add_key 1271
#define __NR_request_key 1272
#define __NR_keyctl 1273
+#define __NR_set_zone_reclaim 1276
#ifdef __KERNEL__
Index: linux-2.6.12-rc5-mm1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.12-rc5-mm1.orig/arch/i386/kernel/syscall_table.S 2005-05-26 12:26:59.000000000 -0700
+++ linux-2.6.12-rc5-mm1/arch/i386/kernel/syscall_table.S 2005-05-26 12:27:11.000000000 -0700
@@ -251,7 +251,7 @@ ENTRY(sys_call_table)
.long sys_io_submit
.long sys_io_cancel
.long sys_fadvise64 /* 250 */
- .long sys_ni_syscall
+ .long sys_set_zone_reclaim
.long sys_exit_group
.long sys_lookup_dcookie
.long sys_epoll_create
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread* [PATCH 3/4] VM: add __GFP_NORECLAIM
[not found] <20050601141154.GN14894@localhost>
2005-06-01 14:22 ` [PATCH 1/4] VM: add may_swap flag to scan_control Martin Hicks
2005-06-01 14:23 ` [PATCH 2/4] VM: early zone reclaim Martin Hicks
@ 2005-06-01 14:23 ` Martin Hicks
2005-06-01 14:23 ` [PATCH 4/4] VM: rate limit early reclaim Martin Hicks
3 siblings, 0 replies; 4+ messages in thread
From: Martin Hicks @ 2005-06-01 14:23 UTC (permalink / raw)
To: Linux MM, Andrew Morton; +Cc: Ray Bryant
When using the early zone reclaim, it was noticed that allocating new
pages that should be spread across the whole system caused eviction
of local pages.
This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts. The example that is implemented here is
for page cache pages. We want page cache pages to be spread across the
whole system, and we don't want page cache pages to evict other pages
to get local memory.
Signed-off-by: Martin Hicks <mort@sgi.com>
include/linux/gfp.h | 3 ++-
include/linux/pagemap.h | 4 ++--
mm/page_alloc.c | 2 ++
3 files changed, 6 insertions(+), 3 deletions(-)
Index: linux-2.6.12-rc5-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/linux/gfp.h 2005-05-26 12:26:57.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/linux/gfp.h 2005-05-26 12:27:15.000000000 -0700
@@ -39,6 +39,7 @@ struct vm_area_struct;
#define __GFP_COMP 0x4000u /* Add compound page metadata */
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
#define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_NORECLAIM 0x20000u /* No realy zone reclaim during allocation */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -47,7 +48,7 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC)
+ __GFP_NOMEMALLOC|__GFP_NORECLAIM)
#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
Index: linux-2.6.12-rc5-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/linux/pagemap.h 2005-05-26 12:26:57.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/linux/pagemap.h 2005-05-26 12:27:15.000000000 -0700
@@ -52,12 +52,12 @@ void release_pages(struct page **pages,
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return alloc_pages(mapping_gfp_mask(x), 0);
+ return alloc_pages(mapping_gfp_mask(x)|__GFP_NORECLAIM, 0);
}
static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+ return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORECLAIM, 0);
}
typedef int filler_t(void *, struct page *);
Index: linux-2.6.12-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/page_alloc.c 2005-05-26 12:27:11.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/page_alloc.c 2005-05-26 12:27:15.000000000 -0700
@@ -729,6 +729,8 @@ check_zone_reclaim(struct zone *z, unsig
{
if (!z->reclaim_pages)
return 0;
+ if (gfp_mask & __GFP_NORECLAIM)
+ return 0;
return 1;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread* [PATCH 4/4] VM: rate limit early reclaim
[not found] <20050601141154.GN14894@localhost>
` (2 preceding siblings ...)
2005-06-01 14:23 ` [PATCH 3/4] VM: add __GFP_NORECLAIM Martin Hicks
@ 2005-06-01 14:23 ` Martin Hicks
3 siblings, 0 replies; 4+ messages in thread
From: Martin Hicks @ 2005-06-01 14:23 UTC (permalink / raw)
To: Linux MM, Andrew Morton; +Cc: Ray Bryant
When early zone reclaim is turned on the LRU is scanned more frequently
when a zone is low on memory. This limits when the zone reclaim can
be called by skipping the scan if another thread (either via kswapd or
sync reclaim) is already reclaiming from the zone.
Signed-off-by: Martin Hicks <mort@sgi.com>
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 1 +
mm/vmscan.c | 10 ++++++++++
3 files changed, 13 insertions(+)
Index: linux-2.6.12-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/vmscan.c 2005-05-26 12:27:11.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/vmscan.c 2005-05-26 12:27:17.000000000 -0700
@@ -903,7 +903,9 @@ shrink_caches(struct zone **zones, struc
if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ atomic_inc(&zone->reclaim_in_progress);
shrink_zone(zone, sc);
+ atomic_dec(&zone->reclaim_in_progress);
}
}
@@ -1114,7 +1116,9 @@ scan:
sc.nr_reclaimed = 0;
sc.priority = priority;
sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
+ atomic_inc(&zone->reclaim_in_progress);
shrink_zone(zone, &sc);
+ atomic_dec(&zone->reclaim_in_progress);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -1357,9 +1361,15 @@ int zone_reclaim(struct zone *zone, unsi
else
sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+ /* Don't reclaim the zone if there are other reclaimers active */
+ if (!atomic_inc_and_test(&zone->reclaim_in_progress))
+ goto out;
+
shrink_zone(zone, &sc);
total_reclaimed = sc.nr_reclaimed;
+ out:
+ atomic_dec(&zone->reclaim_in_progress);
return total_reclaimed;
}
Index: linux-2.6.12-rc5-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.12-rc5-mm1.orig/include/linux/mmzone.h 2005-05-26 12:27:11.000000000 -0700
+++ linux-2.6.12-rc5-mm1/include/linux/mmzone.h 2005-05-26 12:27:17.000000000 -0700
@@ -149,6 +149,8 @@ struct zone {
* as it fails a watermark_ok() in __alloc_pages?
*/
int reclaim_pages;
+ /* A count of how many reclaimers are scanning this zone */
+ atomic_t reclaim_in_progress;
/*
* prev_priority holds the scanning priority for this zone. It is
Index: linux-2.6.12-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.12-rc5-mm1.orig/mm/page_alloc.c 2005-05-26 12:27:15.000000000 -0700
+++ linux-2.6.12-rc5-mm1/mm/page_alloc.c 2005-05-26 12:27:17.000000000 -0700
@@ -1757,6 +1757,7 @@ static void __init free_area_init_core(s
zone->nr_scan_inactive = 0;
zone->nr_active = 0;
zone->nr_inactive = 0;
+ atomic_set(&zone->reclaim_in_progress, -1);
if (!size)
continue;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread