linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Paul Jackson <pj@sgi.com>
Cc: clameter@sgi.com, akpm@osdl.org, linux-mm@kvack.org
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Thu, 21 Sep 2006 15:11:17 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.63.0609211510130.17417@chino.corp.google.com> (raw)
In-Reply-To: <Pine.LNX.4.63.0609191222310.7790@chino.corp.google.com>

On Tue, 19 Sep 2006, David Rientjes wrote:

> As Paul and Andrew suggested, there are three additions to task_struct:
> 
>    1.	cached copy of struct zonelist *zonelist that was passed into
> 	get_page_from_freelist,
> 
>    2.	index of zone where free memory was located last, and
> 
>    3.	index of next zone to try when (2) is full.
> 
> get_page_from_freelist, in the case where the passed in zonelist* differs 
> from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current 
> implementation going through the zonelist and finding one with enough free 
> pages.  Otherwise, if we are in the NUMA emulation case, the node where 
> the memory was found most recently can be cached since all memory is 
> equal.  There is no consideration given to the distance between the last 
> used node and the node at the front of the zonelist because the distance 
> between all nodes is 10.  (If the passed in zonelist* differs from (1), 
> then the three additions to task_struct are reset per the new 
> configuration in the same sense as cpuset_update_task_memory_state since 
> the memory placement has changed relative to current->cpuset which 
> cpusets allows by outside manipulation.)  
> 

As suggested by Paul Jackson and friends, this patch abstracts a numa=fake 
macro to the global kernel code.  A macro, 'numa_emu_enabled', is defined 
that can be tested against to determine whether NUMA emulation was 
successful at boot.

In the NUMA emulation case, the most recently allocated from zone is now 
cached in task_struct and used whenever the same zonelist is passed into 
get_page_from_freelist with GFP_HARDWALL and ALLOC_CPUSET.  The node 
distance compared to the first zone's node_id is not compared because 
x86_64 NUMA emulation is not supported for real NUMA machines anyway 
(later work).

This patch is on top of my numa=fake patches that are not currently in -mm 
(this one appears for comments).  Also includes Christoph Lameter's 
z->zone_pgdat->node_id speedup moved away from zone_to_nid since it, too, 
does not appear in my tree.

These trials were the same as before: 3G machine, numa=fake=64, 'usemem -m 
1500 -s 100000 &' in 2G cpuset, and a kernel build in the remaining.

		unpatched	patched		no cpusets, numa=fake=off
	real	5m16.223s	5m9.711s	4m58.118s
	user	9m13.323s	9m16.803s	9m16.583s
	sys	1m7.756s	0m53.947s	0m30.994s

Unpatched top 13:
	8292 __cpuset_zone_allowed		39.4857 <-- ~210.0
	1813 mwait_idle				23.2436
	1042 clear_page				18.2807
	  24 clear_page_end			 3.4286
	 207 find_get_page			 2.9155
	 123 pfn_to_page			 2.6739
	 347 zone_watermark_ok			 2.2244
	 128 __down_read_trylock		 1.9394
	  84 page_remove_rmap			 1.9091
	 155 find_vma				 1.7816
	  80 page_to_pfn			 1.5686
	  60 __strnlen_user			 1.5385
	1250 get_page_from_freelist		 1.3426 <-- ~931.0
329093.6744

Patched top:
	5068 __cpuset_zone_allowed		25.3400 <-- 200.0
	1348 mwait_idle				17.2821
	 928 clear_page				16.2807
	 195 find_get_page			 2.7465
	  17 clear_page_end			 2.4286
	 106 pfn_to_page			 2.3043
	 344 zone_watermark_ok			 2.2051
	  44 nr_free_pages			 1.5172
	  66 page_remove_rmap			 1.5000
	  54 __strnlen_user			 1.3846
	 119 find_vma				 1.3678
	  62 page_to_pfn			 1.2157
	  73 __down_read_trylock		 1.1061
	1133 get_page_from_freelist		 1.0648 <-- ~1064.0

Tradeoff:
	Unpatched:	8292*39.4857 + 1250*1.3426 = 329093.6744
	Patched:	5068*25.3400 + 1133*1.0648 = 129629.5384

Not-signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86_64/mm/numa.c  |    9 +++++++--
 arch/x86_64/mm/srat.c  |    2 ++
 include/linux/mmzone.h |    1 +
 include/linux/numa.h   |    7 +++++++
 include/linux/sched.h  |    4 ++++
 kernel/cpuset.c        |    9 +++++++--
 mm/page_alloc.c        |   17 +++++++++++++++--
 7 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 9a9e452..46ede0b 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -11,6 +11,7 @@ #include <linux/mmzone.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/numa.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -187,6 +188,7 @@ #define E820_ADDR_HOLE_SIZE(start, end)	
 	(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) <<	\
 		PAGE_SHIFT)
 char *cmdline __initdata;
+int numa_emu;
 
 /*
  * Sets up nodeid to range from addr to addr + sz.  If the end boundary is
@@ -381,8 +383,11 @@ void __init numa_initmem_init(unsigned l
 	int i;
 
 #ifdef CONFIG_NUMA_EMU
-	if (cmdline && !numa_emulation(start_pfn, end_pfn))
- 		return;
+	if (cmdline) {
+		numa_emu = !numa_emulation(start_pfn, end_pfn);
+		if (numa_emu)
+			return;
+	}
 #endif
 
 #ifdef CONFIG_ACPI_NUMA
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 66f375f..eed080c 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -436,6 +436,8 @@ int __node_distance(int a, int b)
 {
 	int index;
 
+	if (numa_emu_enabled)
+		return 10;
 	if (!acpi_slit)
 		return a == b ? 10 : 20;
 	index = acpi_slit->localities * node_to_pxm(a);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f45163c..81e047d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -151,6 +151,7 @@ struct zone {
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
 #ifdef CONFIG_NUMA
+	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
diff --git a/include/linux/numa.h b/include/linux/numa.h
index a31a730..ff2720d 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -10,4 +10,11 @@ #endif
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#ifdef CONFIG_NUMA_EMU
+extern int numa_emu;
+#define numa_emu_enabled	numa_emu
+#else
+#define numa_emu_enabled	0
+#endif
+
 #endif /* _LINUX_NUMA_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34ed0d9..5a2a7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -973,6 +973,10 @@ #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
 	short il_next;
 #endif
+#ifdef CONFIG_NUMA_EMU
+	struct zonelist *last_zonelist;
+	u32 last_zone_used;
+#endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
 	nodemask_t mems_allowed;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4ea6f0d..df19ecf 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -35,6 +35,7 @@ #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/numa.h>
 #include <linux/pagemap.h>
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
@@ -677,6 +678,10 @@ void cpuset_update_task_memory_state(voi
 			tsk->flags |= PF_SPREAD_SLAB;
 		else
 			tsk->flags &= ~PF_SPREAD_SLAB;
+		if (numa_emu_enabled) {
+			tsk->last_zonelist = NULL;
+			tsk->last_zone_used = 0;
+		}
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -2245,7 +2250,7 @@ int cpuset_zonelist_valid_mems_allowed(s
 	int i;
 
 	for (i = 0; zl->zones[i]; i++) {
-		int nid = zl->zones[i]->zone_pgdat->node_id;
+		int nid = zl->zones[i]->node;
 
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
@@ -2318,7 +2323,7 @@ int __cpuset_zone_allowed(struct zone *z
 
 	if (in_interrupt())
 		return 1;
-	node = z->zone_pgdat->node_id;
+	node = z->node;
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
 	if (node_isset(node, current->mems_allowed))
 		return 1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54a4f53..c80d6a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@ #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
+#include <linux/numa.h>
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
 #include <linux/stop_machine.h>
@@ -870,6 +871,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	struct zone **z = zonelist->zones;
 	struct page *page = NULL;
 	int classzone_idx = zone_idx(*z);
+	unsigned index = 0;
+
+	if (numa_emu_enabled) {
+		if (zonelist == current->last_zonelist &&
+		    (alloc_flags & __GFP_HARDWALL) && (alloc_flags & ALLOC_CPUSET))
+			z += current->last_zone_used;
+		current->last_zonelist = zonelist;
+	}
 
 	/*
 	 * Go through the zonelist once, looking for a zone with enough free.
@@ -897,8 +906,11 @@ get_page_from_freelist(gfp_t gfp_mask, u
 
 		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
 		if (page) {
+			if (numa_emu_enabled)
+				current->last_zone_used = index;
 			break;
 		}
+		index++;
 	} while (*(++z) != NULL);
 	return page;
 }
@@ -1203,7 +1215,7 @@ #endif
 #ifdef CONFIG_NUMA
 static void show_node(struct zone *zone)
 {
-	printk("Node %d ", zone->zone_pgdat->node_id);
+	printk("Node %d ", zone->node);
 }
 #else
 #define show_node(zone)	do { } while (0)
@@ -1965,7 +1977,7 @@ __meminit int init_currently_empty_zone(
 
 	zone->zone_start_pfn = zone_start_pfn;
 
-	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+	memmap_init(size, zone->node, zone_idx(zone), zone_start_pfn);
 
 	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
 
@@ -2006,6 +2018,7 @@ static void __meminit free_area_init_cor
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2006-09-21 22:11 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 23:50 Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson
2006-09-15  7:23     ` Andrew Morton
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes [this message]
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.63.0609211510130.17417@chino.corp.google.com \
    --to=rientjes@google.com \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=pj@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox