From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 26 Jul 2007 23:59:20 +0100 Subject: Re: NUMA policy issues with ZONE_MOVABLE Message-ID: <20070726225920.GA10225@skynet.ie> References: <20070725111646.GA9098@skynet.ie> <20070726132336.GA18825@skynet.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: From: mel@skynet.ie (Mel Gorman) Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: linux-mm@kvack.org, Lee Schermerhorn , ak@suse.de, KAMEZAWA Hiroyuki , akpm@linux-foundation.org, pj@sgi.com List-ID: On (26/07/07 11:07), Christoph Lameter didst pronounce: > On Thu, 26 Jul 2007, Mel Gorman wrote: > > > > How about changing __alloc_pages to lookup the zonelist on its own based > > > on a node parameter and a set of allowed nodes? That may significantly > > > clean up the memory policy layer and the cpuset layer. But it will > > > increase the effort to scan zonelists on each allocation. A large system > > > with 1024 nodes may have more than 1024 zones on each nodelist! > > > > > > > That sounds like it would require the creation of a zonelist for each > > allocation attempt. That is not ideal as there is no place to allocate > > the zonelist during __alloc_pages(). It's not like it can call > > kmalloc(). > > Nope it would just require scanning the full zonelists on every alloc as > you already propose. > Right. For this current problem, I would rather not to that. I would rather fix the bug at hand for 2.6.23 and aim to reduce the number of zonelists in the next timeframe after a spell in -mm and wider testing. This is to reduce the risk of introducing performance regressions for a bugfix. > > > Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab > > > allocations do not use highmem. > > > > It would fail if policy_zone didn't exist, that was my point. Without > > policy_zone, we apply policy to all allocations and that causes > > problems. > > policy_zone can not exist due to ZONE_DMA32 ZONE_NORMAL issues. See my > other email. > > > > I ran the patch on a wide variety of machines, NUMA and non-NUMA. The > > non-NUMA machines showed no differences as you would expect for > > kernbench and aim9. On NUMA machines, I saw both small gains and small > > regressions. By and large, the performance was the same or within 0.08% > > for kernbench which is within noise basically. > > Sound okay. > > > It might be more pronounced on larger NUMA machines though, I cannot > > generate those figures. > > I say lets go with the filtering. That would allow us to also catch other > issues that are now developing on x86_64 with ZONE_NORMAL and ZONE_DMA32. > > > I'll try adding a should_filter to zonelist that is only set for > > MPOL_BIND and see what it looks like. > > Maybe that is not worth it. This patch filters only when MPOL_BIND is in use. In non-numa, the checks do not exist and in NUMA cases, the filtering usually does not take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE and then deal with reducing zonelists to see if there is any performance gain as well as a simplification in how policies and cpusets are implemented. Testing shows no difference on non-numa as you'd expect and on NUMA machines, there are very small differences on NUMA (kernbench figures range from -0.02% to 0.15% differences on machines). Lee, can you test this patch in relation to MPOL_BIND? I'll look at the numactl tests tomorrow as well. Comments? diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index e147cf5..5bdd656 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -166,7 +166,7 @@ extern enum zone_type policy_zone; static inline void check_highest_zone(enum zone_type k) { - if (k > policy_zone) + if (k > policy_zone && k != ZONE_MOVABLE) policy_zone = k; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index da8eb8a..eb7cb56 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -411,6 +411,24 @@ struct zonelist { #endif }; +#ifdef CONFIG_NUMA +/* + * Only custom zonelists like MPOL_BIND need to be filtered as part of + * policies. As described in the comment for struct zonelist_cache, these + * zonelists will not have a zlcache so zlcache_ptr will not be set. Use + * that to determine if the zonelists needs to be filtered or not. + */ +static inline int alloc_should_filter_zonelist(struct zonelist *zonelist) +{ + return !zonelist->zlcache_ptr; +} +#else +static inline int alloc_should_filter_zonelist(struct zonelist *zonelist) +{ + return 0; +} +#endif /* CONFIG_NUMA */ + #ifdef CONFIG_ARCH_POPULATES_NODE_MAP struct node_active_region { unsigned long start_pfn; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 71b84b4..172abff 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -149,7 +149,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes) lower zones etc. Avoid empty zones because the memory allocator doesn't like them. If you implement node hot removal you have to fix that. */ - k = policy_zone; + k = MAX_NR_ZONES - 1; while (1) { for_each_node_mask(nd, *nodes) { struct zone *z = &NODE_DATA(nd)->node_zones[k]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 40954fb..99c5a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ + enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */ zonelist_scan: /* @@ -1166,6 +1167,18 @@ zonelist_scan: z = zonelist->zones; do { + /* + * In NUMA, this could be a policy zonelist which contains + * zones that may not be allowed by the current gfp_mask. + * Check the zone is allowed by the current flags + */ + if (unlikely(alloc_should_filter_zonelist(zonelist))) { + if (highest_zoneidx == -1) + highest_zoneidx = gfp_zone(gfp_mask); + if (zone_idx(*z) > highest_zoneidx) + continue; + } + if (NUMA_BUILD && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes)) continue; -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org