From: David Rientjes <rientjes@google.com>
To: Paul Jackson <pj@sgi.com>
Cc: clameter@sgi.com, akpm@osdl.org, linux-mm@kvack.org
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Thu, 21 Sep 2006 15:11:17 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.63.0609211510130.17417@chino.corp.google.com> (raw)
In-Reply-To: <Pine.LNX.4.63.0609191222310.7790@chino.corp.google.com>
On Tue, 19 Sep 2006, David Rientjes wrote:
> As Paul and Andrew suggested, there are three additions to task_struct:
>
> 1. cached copy of struct zonelist *zonelist that was passed into
> get_page_from_freelist,
>
> 2. index of zone where free memory was located last, and
>
> 3. index of next zone to try when (2) is full.
>
> get_page_from_freelist, in the case where the passed in zonelist* differs
> from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current
> implementation going through the zonelist and finding one with enough free
> pages. Otherwise, if we are in the NUMA emulation case, the node where
> the memory was found most recently can be cached since all memory is
> equal. There is no consideration given to the distance between the last
> used node and the node at the front of the zonelist because the distance
> between all nodes is 10. (If the passed in zonelist* differs from (1),
> then the three additions to task_struct are reset per the new
> configuration in the same sense as cpuset_update_task_memory_state since
> the memory placement has changed relative to current->cpuset which
> cpusets allows by outside manipulation.)
>
As suggested by Paul Jackson and friends, this patch abstracts a numa=fake
macro to the global kernel code. A macro, 'numa_emu_enabled', is defined
that can be tested against to determine whether NUMA emulation was
successful at boot.
In the NUMA emulation case, the most recently allocated from zone is now
cached in task_struct and used whenever the same zonelist is passed into
get_page_from_freelist with GFP_HARDWALL and ALLOC_CPUSET. The node
distance compared to the first zone's node_id is not compared because
x86_64 NUMA emulation is not supported for real NUMA machines anyway
(later work).
This patch is on top of my numa=fake patches that are not currently in -mm
(this one appears for comments). Also includes Christoph Lameter's
z->zone_pgdat->node_id speedup moved away from zone_to_nid since it, too,
does not appear in my tree.
These trials were the same as before: 3G machine, numa=fake=64, 'usemem -m
1500 -s 100000 &' in 2G cpuset, and a kernel build in the remaining.
unpatched patched no cpusets, numa=fake=off
real 5m16.223s 5m9.711s 4m58.118s
user 9m13.323s 9m16.803s 9m16.583s
sys 1m7.756s 0m53.947s 0m30.994s
Unpatched top 13:
8292 __cpuset_zone_allowed 39.4857 <-- ~210.0
1813 mwait_idle 23.2436
1042 clear_page 18.2807
24 clear_page_end 3.4286
207 find_get_page 2.9155
123 pfn_to_page 2.6739
347 zone_watermark_ok 2.2244
128 __down_read_trylock 1.9394
84 page_remove_rmap 1.9091
155 find_vma 1.7816
80 page_to_pfn 1.5686
60 __strnlen_user 1.5385
1250 get_page_from_freelist 1.3426 <-- ~931.0
329093.6744
Patched top:
5068 __cpuset_zone_allowed 25.3400 <-- 200.0
1348 mwait_idle 17.2821
928 clear_page 16.2807
195 find_get_page 2.7465
17 clear_page_end 2.4286
106 pfn_to_page 2.3043
344 zone_watermark_ok 2.2051
44 nr_free_pages 1.5172
66 page_remove_rmap 1.5000
54 __strnlen_user 1.3846
119 find_vma 1.3678
62 page_to_pfn 1.2157
73 __down_read_trylock 1.1061
1133 get_page_from_freelist 1.0648 <-- ~1064.0
Tradeoff:
Unpatched: 8292*39.4857 + 1250*1.3426 = 329093.6744
Patched: 5068*25.3400 + 1133*1.0648 = 129629.5384
Not-signed-off-by: David Rientjes <rientjes@google.com>
---
arch/x86_64/mm/numa.c | 9 +++++++--
arch/x86_64/mm/srat.c | 2 ++
include/linux/mmzone.h | 1 +
include/linux/numa.h | 7 +++++++
include/linux/sched.h | 4 ++++
kernel/cpuset.c | 9 +++++++--
mm/page_alloc.c | 17 +++++++++++++++--
7 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 9a9e452..46ede0b 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -11,6 +11,7 @@ #include <linux/mmzone.h>
#include <linux/ctype.h>
#include <linux/module.h>
#include <linux/nodemask.h>
+#include <linux/numa.h>
#include <asm/e820.h>
#include <asm/proto.h>
@@ -187,6 +188,7 @@ #define E820_ADDR_HOLE_SIZE(start, end)
(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) << \
PAGE_SHIFT)
char *cmdline __initdata;
+int numa_emu;
/*
* Sets up nodeid to range from addr to addr + sz. If the end boundary is
@@ -381,8 +383,11 @@ void __init numa_initmem_init(unsigned l
int i;
#ifdef CONFIG_NUMA_EMU
- if (cmdline && !numa_emulation(start_pfn, end_pfn))
- return;
+ if (cmdline) {
+ numa_emu = !numa_emulation(start_pfn, end_pfn);
+ if (numa_emu)
+ return;
+ }
#endif
#ifdef CONFIG_ACPI_NUMA
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 66f375f..eed080c 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -436,6 +436,8 @@ int __node_distance(int a, int b)
{
int index;
+ if (numa_emu_enabled)
+ return 10;
if (!acpi_slit)
return a == b ? 10 : 20;
index = acpi_slit->localities * node_to_pxm(a);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f45163c..81e047d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -151,6 +151,7 @@ struct zone {
unsigned long lowmem_reserve[MAX_NR_ZONES];
#ifdef CONFIG_NUMA
+ int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
diff --git a/include/linux/numa.h b/include/linux/numa.h
index a31a730..ff2720d 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -10,4 +10,11 @@ #endif
#define MAX_NUMNODES (1 << NODES_SHIFT)
+#ifdef CONFIG_NUMA_EMU
+extern int numa_emu;
+#define numa_emu_enabled numa_emu
+#else
+#define numa_emu_enabled 0
+#endif
+
#endif /* _LINUX_NUMA_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34ed0d9..5a2a7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -973,6 +973,10 @@ #ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next;
#endif
+#ifdef CONFIG_NUMA_EMU
+ struct zonelist *last_zonelist;
+ u32 last_zone_used;
+#endif
#ifdef CONFIG_CPUSETS
struct cpuset *cpuset;
nodemask_t mems_allowed;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4ea6f0d..df19ecf 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -35,6 +35,7 @@ #include <linux/mm.h>
#include <linux/module.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/numa.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
#include <linux/rcupdate.h>
@@ -677,6 +678,10 @@ void cpuset_update_task_memory_state(voi
tsk->flags |= PF_SPREAD_SLAB;
else
tsk->flags &= ~PF_SPREAD_SLAB;
+ if (numa_emu_enabled) {
+ tsk->last_zonelist = NULL;
+ tsk->last_zone_used = 0;
+ }
task_unlock(tsk);
mutex_unlock(&callback_mutex);
mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -2245,7 +2250,7 @@ int cpuset_zonelist_valid_mems_allowed(s
int i;
for (i = 0; zl->zones[i]; i++) {
- int nid = zl->zones[i]->zone_pgdat->node_id;
+ int nid = zl->zones[i]->node;
if (node_isset(nid, current->mems_allowed))
return 1;
@@ -2318,7 +2323,7 @@ int __cpuset_zone_allowed(struct zone *z
if (in_interrupt())
return 1;
- node = z->zone_pgdat->node_id;
+ node = z->node;
might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
if (node_isset(node, current->mems_allowed))
return 1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54a4f53..c80d6a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@ #include <linux/cpu.h>
#include <linux/cpuset.h>
#include <linux/memory_hotplug.h>
#include <linux/nodemask.h>
+#include <linux/numa.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
#include <linux/stop_machine.h>
@@ -870,6 +871,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
struct zone **z = zonelist->zones;
struct page *page = NULL;
int classzone_idx = zone_idx(*z);
+ unsigned index = 0;
+
+ if (numa_emu_enabled) {
+ if (zonelist == current->last_zonelist &&
+ (alloc_flags & __GFP_HARDWALL) && (alloc_flags & ALLOC_CPUSET))
+ z += current->last_zone_used;
+ current->last_zonelist = zonelist;
+ }
/*
* Go through the zonelist once, looking for a zone with enough free.
@@ -897,8 +906,11 @@ get_page_from_freelist(gfp_t gfp_mask, u
page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
if (page) {
+ if (numa_emu_enabled)
+ current->last_zone_used = index;
break;
}
+ index++;
} while (*(++z) != NULL);
return page;
}
@@ -1203,7 +1215,7 @@ #endif
#ifdef CONFIG_NUMA
static void show_node(struct zone *zone)
{
- printk("Node %d ", zone->zone_pgdat->node_id);
+ printk("Node %d ", zone->node);
}
#else
#define show_node(zone) do { } while (0)
@@ -1965,7 +1977,7 @@ __meminit int init_currently_empty_zone(
zone->zone_start_pfn = zone_start_pfn;
- memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+ memmap_init(size, zone->node, zone_idx(zone), zone_start_pfn);
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
@@ -2006,6 +2018,7 @@ static void __meminit free_area_init_cor
zone->spanned_pages = size;
zone->present_pages = realsize;
#ifdef CONFIG_NUMA
+ zone->node = nid;
zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio)
/ 100;
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-09-21 22:11 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-09-13 23:50 Christoph Lameter
2006-09-15 5:00 ` Andrew Morton
2006-09-15 6:49 ` Paul Jackson
2006-09-15 7:23 ` Andrew Morton
2006-09-15 7:44 ` Paul Jackson
2006-09-15 8:06 ` Andrew Morton
2006-09-15 15:53 ` David Rientjes
2006-09-15 23:03 ` David Rientjes
2006-09-16 0:04 ` Paul Jackson
2006-09-16 1:36 ` Andrew Morton
2006-09-16 2:23 ` Christoph Lameter
2006-09-16 4:34 ` Andrew Morton
2006-09-16 3:28 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16 3:40 ` Paul Jackson
2006-09-16 3:45 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16 2:47 ` Christoph Lameter
2006-09-17 3:45 ` David Rientjes
2006-09-17 11:17 ` Paul Jackson
2006-09-17 12:41 ` Christoph Lameter
2006-09-17 13:03 ` Paul Jackson
2006-09-17 20:36 ` David Rientjes
2006-09-17 21:20 ` Paul Jackson
2006-09-17 22:27 ` Paul Jackson
2006-09-17 23:49 ` David Rientjes
2006-09-18 2:20 ` Paul Jackson
2006-09-18 16:34 ` Paul Jackson
2006-09-18 17:49 ` David Rientjes
2006-09-18 20:46 ` Paul Jackson
2006-09-19 20:52 ` David Rientjes
2006-09-19 21:26 ` Christoph Lameter
2006-09-19 21:50 ` David Rientjes
2006-09-21 22:11 ` David Rientjes [this message]
2006-09-22 10:10 ` Nick Piggin
2006-09-22 16:26 ` Paul Jackson
2006-09-22 16:36 ` Christoph Lameter
2006-09-15 8:28 ` Andrew Morton
2006-09-16 3:38 ` Paul Jackson
2006-09-16 4:42 ` Andi Kleen
2006-09-16 11:38 ` Paul Jackson
2006-09-16 4:48 ` Andrew Morton
2006-09-16 11:30 ` Paul Jackson
2006-09-16 15:18 ` Andrew Morton
2006-09-17 9:28 ` Paul Jackson
2006-09-17 9:51 ` Nick Piggin
2006-09-17 11:15 ` Paul Jackson
2006-09-17 12:44 ` Nick Piggin
2006-09-17 13:19 ` Paul Jackson
2006-09-17 13:52 ` Nick Piggin
2006-09-17 21:19 ` Paul Jackson
2006-09-18 12:44 ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20 ` Christoph Lameter
2006-09-18 20:43 ` Peter Zijlstra
2006-09-19 14:35 ` Nick Piggin
2006-09-19 14:44 ` Christoph Lameter
2006-09-19 15:02 ` Nick Piggin
2006-09-19 14:51 ` Peter Zijlstra
2006-09-19 15:10 ` Nick Piggin
2006-09-19 15:05 ` Peter Zijlstra
2006-09-19 15:39 ` Christoph Lameter
2006-09-17 16:29 ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18 2:11 ` Paul Jackson
2006-09-18 5:09 ` Andrew Morton
2006-09-18 7:49 ` Paul Jackson
2006-09-16 11:48 ` Paul Jackson
2006-09-16 15:38 ` Andrew Morton
2006-09-16 21:51 ` Paul Jackson
2006-09-16 23:10 ` Andrew Morton
2006-09-17 4:37 ` Christoph Lameter
2006-09-17 4:55 ` Andrew Morton
2006-09-17 12:09 ` Paul Jackson
2006-09-17 12:36 ` Christoph Lameter
2006-09-17 13:06 ` Paul Jackson
2006-09-19 19:17 ` David Rientjes
2006-09-19 19:19 ` David Rientjes
2006-09-19 19:31 ` Christoph Lameter
2006-09-19 21:12 ` David Rientjes
2006-09-19 21:28 ` Christoph Lameter
2006-09-19 21:53 ` Paul Jackson
2006-09-15 17:08 ` Christoph Lameter
2006-09-15 17:37 ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38 ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42 ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.63.0609211510130.17417@chino.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@osdl.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=pj@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox