From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 21 Sep 2006 15:11:17 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH] GFP_THISNODE for the slab allocator In-Reply-To: Message-ID: References: <20060914220011.2be9100a.akpm@osdl.org> <20060914234926.9b58fd77.pj@sgi.com> <20060915002325.bffe27d1.akpm@osdl.org> <20060915004402.88d462ff.pj@sgi.com> <20060915010622.0e3539d2.akpm@osdl.org> <20060917041707.28171868.pj@sgi.com> <20060917060358.ac16babf.pj@sgi.com> <20060917152723.5bb69b82.pj@sgi.com> <20060917192010.cc360ece.pj@sgi.com> <20060918093434.e66b8887.pj@sgi.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: clameter@sgi.com, akpm@osdl.org, linux-mm@kvack.org List-ID: On Tue, 19 Sep 2006, David Rientjes wrote: > As Paul and Andrew suggested, there are three additions to task_struct: > > 1. cached copy of struct zonelist *zonelist that was passed into > get_page_from_freelist, > > 2. index of zone where free memory was located last, and > > 3. index of next zone to try when (2) is full. > > get_page_from_freelist, in the case where the passed in zonelist* differs > from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current > implementation going through the zonelist and finding one with enough free > pages. Otherwise, if we are in the NUMA emulation case, the node where > the memory was found most recently can be cached since all memory is > equal. There is no consideration given to the distance between the last > used node and the node at the front of the zonelist because the distance > between all nodes is 10. (If the passed in zonelist* differs from (1), > then the three additions to task_struct are reset per the new > configuration in the same sense as cpuset_update_task_memory_state since > the memory placement has changed relative to current->cpuset which > cpusets allows by outside manipulation.) > As suggested by Paul Jackson and friends, this patch abstracts a numa=fake macro to the global kernel code. A macro, 'numa_emu_enabled', is defined that can be tested against to determine whether NUMA emulation was successful at boot. In the NUMA emulation case, the most recently allocated from zone is now cached in task_struct and used whenever the same zonelist is passed into get_page_from_freelist with GFP_HARDWALL and ALLOC_CPUSET. The node distance compared to the first zone's node_id is not compared because x86_64 NUMA emulation is not supported for real NUMA machines anyway (later work). This patch is on top of my numa=fake patches that are not currently in -mm (this one appears for comments). Also includes Christoph Lameter's z->zone_pgdat->node_id speedup moved away from zone_to_nid since it, too, does not appear in my tree. These trials were the same as before: 3G machine, numa=fake=64, 'usemem -m 1500 -s 100000 &' in 2G cpuset, and a kernel build in the remaining. unpatched patched no cpusets, numa=fake=off real 5m16.223s 5m9.711s 4m58.118s user 9m13.323s 9m16.803s 9m16.583s sys 1m7.756s 0m53.947s 0m30.994s Unpatched top 13: 8292 __cpuset_zone_allowed 39.4857 <-- ~210.0 1813 mwait_idle 23.2436 1042 clear_page 18.2807 24 clear_page_end 3.4286 207 find_get_page 2.9155 123 pfn_to_page 2.6739 347 zone_watermark_ok 2.2244 128 __down_read_trylock 1.9394 84 page_remove_rmap 1.9091 155 find_vma 1.7816 80 page_to_pfn 1.5686 60 __strnlen_user 1.5385 1250 get_page_from_freelist 1.3426 <-- ~931.0 329093.6744 Patched top: 5068 __cpuset_zone_allowed 25.3400 <-- 200.0 1348 mwait_idle 17.2821 928 clear_page 16.2807 195 find_get_page 2.7465 17 clear_page_end 2.4286 106 pfn_to_page 2.3043 344 zone_watermark_ok 2.2051 44 nr_free_pages 1.5172 66 page_remove_rmap 1.5000 54 __strnlen_user 1.3846 119 find_vma 1.3678 62 page_to_pfn 1.2157 73 __down_read_trylock 1.1061 1133 get_page_from_freelist 1.0648 <-- ~1064.0 Tradeoff: Unpatched: 8292*39.4857 + 1250*1.3426 = 329093.6744 Patched: 5068*25.3400 + 1133*1.0648 = 129629.5384 Not-signed-off-by: David Rientjes --- arch/x86_64/mm/numa.c | 9 +++++++-- arch/x86_64/mm/srat.c | 2 ++ include/linux/mmzone.h | 1 + include/linux/numa.h | 7 +++++++ include/linux/sched.h | 4 ++++ kernel/cpuset.c | 9 +++++++-- mm/page_alloc.c | 17 +++++++++++++++-- 7 files changed, 43 insertions(+), 6 deletions(-) diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c index 9a9e452..46ede0b 100644 --- a/arch/x86_64/mm/numa.c +++ b/arch/x86_64/mm/numa.c @@ -11,6 +11,7 @@ #include #include #include #include +#include #include #include @@ -187,6 +188,7 @@ #define E820_ADDR_HOLE_SIZE(start, end) (e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) << \ PAGE_SHIFT) char *cmdline __initdata; +int numa_emu; /* * Sets up nodeid to range from addr to addr + sz. If the end boundary is @@ -381,8 +383,11 @@ void __init numa_initmem_init(unsigned l int i; #ifdef CONFIG_NUMA_EMU - if (cmdline && !numa_emulation(start_pfn, end_pfn)) - return; + if (cmdline) { + numa_emu = !numa_emulation(start_pfn, end_pfn); + if (numa_emu) + return; + } #endif #ifdef CONFIG_ACPI_NUMA diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c index 66f375f..eed080c 100644 --- a/arch/x86_64/mm/srat.c +++ b/arch/x86_64/mm/srat.c @@ -436,6 +436,8 @@ int __node_distance(int a, int b) { int index; + if (numa_emu_enabled) + return 10; if (!acpi_slit) return a == b ? 10 : 20; index = acpi_slit->localities * node_to_pxm(a); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f45163c..81e047d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -151,6 +151,7 @@ struct zone { unsigned long lowmem_reserve[MAX_NR_ZONES]; #ifdef CONFIG_NUMA + int node; /* * zone reclaim becomes active if more unmapped pages exist. */ diff --git a/include/linux/numa.h b/include/linux/numa.h index a31a730..ff2720d 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -10,4 +10,11 @@ #endif #define MAX_NUMNODES (1 << NODES_SHIFT) +#ifdef CONFIG_NUMA_EMU +extern int numa_emu; +#define numa_emu_enabled numa_emu +#else +#define numa_emu_enabled 0 +#endif + #endif /* _LINUX_NUMA_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 34ed0d9..5a2a7f7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -973,6 +973,10 @@ #ifdef CONFIG_NUMA struct mempolicy *mempolicy; short il_next; #endif +#ifdef CONFIG_NUMA_EMU + struct zonelist *last_zonelist; + u32 last_zone_used; +#endif #ifdef CONFIG_CPUSETS struct cpuset *cpuset; nodemask_t mems_allowed; diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 4ea6f0d..df19ecf 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -35,6 +35,7 @@ #include #include #include #include +#include #include #include #include @@ -677,6 +678,10 @@ void cpuset_update_task_memory_state(voi tsk->flags |= PF_SPREAD_SLAB; else tsk->flags &= ~PF_SPREAD_SLAB; + if (numa_emu_enabled) { + tsk->last_zonelist = NULL; + tsk->last_zone_used = 0; + } task_unlock(tsk); mutex_unlock(&callback_mutex); mpol_rebind_task(tsk, &tsk->mems_allowed); @@ -2245,7 +2250,7 @@ int cpuset_zonelist_valid_mems_allowed(s int i; for (i = 0; zl->zones[i]; i++) { - int nid = zl->zones[i]->zone_pgdat->node_id; + int nid = zl->zones[i]->node; if (node_isset(nid, current->mems_allowed)) return 1; @@ -2318,7 +2323,7 @@ int __cpuset_zone_allowed(struct zone *z if (in_interrupt()) return 1; - node = z->zone_pgdat->node_id; + node = z->node; might_sleep_if(!(gfp_mask & __GFP_HARDWALL)); if (node_isset(node, current->mems_allowed)) return 1; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 54a4f53..c80d6a6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -34,6 +34,7 @@ #include #include #include #include +#include #include #include #include @@ -870,6 +871,14 @@ get_page_from_freelist(gfp_t gfp_mask, u struct zone **z = zonelist->zones; struct page *page = NULL; int classzone_idx = zone_idx(*z); + unsigned index = 0; + + if (numa_emu_enabled) { + if (zonelist == current->last_zonelist && + (alloc_flags & __GFP_HARDWALL) && (alloc_flags & ALLOC_CPUSET)) + z += current->last_zone_used; + current->last_zonelist = zonelist; + } /* * Go through the zonelist once, looking for a zone with enough free. @@ -897,8 +906,11 @@ get_page_from_freelist(gfp_t gfp_mask, u page = buffered_rmqueue(zonelist, *z, order, gfp_mask); if (page) { + if (numa_emu_enabled) + current->last_zone_used = index; break; } + index++; } while (*(++z) != NULL); return page; } @@ -1203,7 +1215,7 @@ #endif #ifdef CONFIG_NUMA static void show_node(struct zone *zone) { - printk("Node %d ", zone->zone_pgdat->node_id); + printk("Node %d ", zone->node); } #else #define show_node(zone) do { } while (0) @@ -1965,7 +1977,7 @@ __meminit int init_currently_empty_zone( zone->zone_start_pfn = zone_start_pfn; - memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn); + memmap_init(size, zone->node, zone_idx(zone), zone_start_pfn); zone_init_free_lists(pgdat, zone, zone->spanned_pages); @@ -2006,6 +2018,7 @@ static void __meminit free_area_init_cor zone->spanned_pages = size; zone->present_pages = realsize; #ifdef CONFIG_NUMA + zone->node = nid; zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio) / 100; #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org