From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 1 Dec 2006 12:32:05 +0000 Subject: Re: Slab: Better fallback allocation behavior Message-ID: <20061201123205.GA3528@skynet.ie> References: Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: From: mel@skynet.ie (Mel Gorman) Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: akpm@osdl.org, linux-mm@kvack.org List-ID: Hi Christoph, On (29/11/06 17:01), Christoph Lameter didst pronounce: > Currently we simply attempt to allocate from all allowed nodes using > GFP_THISNODE. I thought GFP_THISNODE meant we never fallback to other nodes and no policies are ever applied. > However, GFP_THISNODE does not do reclaim (it wont do any at > all if the recent GFP_THISNODE patch is accepted). If we truly run out of > memory in the whole system then fallback_alloc may return NULL although > memory may still be available if we would perform more thorough reclaim. > > This patch changes fallback_alloc() so that we first only inspect all the > per node queues for available slabs. If we find any then we allocate from > those. This avoids slab fragmentation by first getting rid of all partial > allocated slabs on every node before allocating new memory. > > If we cannot satisfy the allocation from any per node queue then we extend > a slab. We now call into the page allocator without specifying > GFP_THISNODE. The page allocator will then implement its own fallback (in > the given cpuset context), perform necessary reclaim (again considering > not a single node but the whole set of allowed nodes) and then return > pages for a new slab. > So, why have __GFP_THISNODE in slabs at all? > We identify from which node the pages were allocated and then insert the > pages into the corresponding per node structure. In order to do so we need > to modify cache_grow() to take a parameter that specifies the new slab. > kmem_getpages() can no longer set the GFP_THISNODE flag since we need to > be able to use kmem_getpage to allocate from an arbitrary node. > GFP_THISNODE needs to be specified when calling cache_grow(). > ok. > One key advantage is that the decision from which node to allocate new > memory is removed from slab fallback processing. The patch allows to go > back to use of the page allocators fallback/reclaim logic. > > Signed-off-by: Christoph Lameter > > Index: linux-2.6.19-rc6-mm2/mm/slab.c > =================================================================== > --- linux-2.6.19-rc6-mm2.orig/mm/slab.c 2006-11-29 18:40:15.650296908 -0600 > +++ linux-2.6.19-rc6-mm2/mm/slab.c 2006-11-29 18:40:37.767454017 -0600 > @@ -1610,12 +1610,7 @@ static void *kmem_getpages(struct kmem_c > flags |= __GFP_COMP; > #endif > > - /* > - * Under NUMA we want memory on the indicated node. We will handle > - * the needed fallback ourselves since we want to serve from our > - * per node object lists first for other nodes. > - */ > - flags |= cachep->gfpflags | GFP_THISNODE; > + flags |= cachep->gfpflags; > > page = alloc_pages_node(nodeid, flags, cachep->gfporder); > if (!page) > @@ -2569,7 +2564,7 @@ static struct slab *alloc_slabmgmt(struc > if (OFF_SLAB(cachep)) { > /* Slab management obj is off-slab. */ > slabp = kmem_cache_alloc_node(cachep->slabp_cache, > - local_flags, nodeid); > + local_flags & ~GFP_THISNODE, nodeid); This also removes the __GFP_NOWARN and __GFP_NORETRY flags. Is that intended or did you mean ~__GFP_THISNODE? > if (!slabp) > return NULL; > } else { > @@ -2712,10 +2707,10 @@ static void slab_map_pages(struct kmem_c > * Grow (by 1) the number of slabs within a cache. This is called by > * kmem_cache_alloc() when there are no active objs left in a cache. > */ > -static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) > +static int cache_grow(struct kmem_cache *cachep, > + gfp_t flags, int nodeid, void *objp) > { > struct slab *slabp; > - void *objp; > size_t offset; > gfp_t local_flags; > unsigned long ctor_flags; > @@ -2767,12 +2762,14 @@ static int cache_grow(struct kmem_cache > * Get mem for the objs. Attempt to allocate a physical page from > * 'nodeid'. > */ > - objp = kmem_getpages(cachep, flags, nodeid); > + if (!objp) > + objp = kmem_getpages(cachep, flags, nodeid); > if (!objp) > goto failed; > > /* Get slab management. */ > - slabp = alloc_slabmgmt(cachep, objp, offset, local_flags, nodeid); > + slabp = alloc_slabmgmt(cachep, objp, offset, > + local_flags & ~GFP_THISNODE, nodeid); Same comment about GFP_THISNODE vs __GFP_THISNODE > if (!slabp) > goto opps1; > > @@ -3010,7 +3007,7 @@ alloc_done: > > if (unlikely(!ac->avail)) { > int x; > - x = cache_grow(cachep, flags, node); > + x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); > Ok, so we first try and stick to the current node and there is no fallback, reclaim or policy enforcement. As a side-effect (I think, slab.c boggles the mind and I'm not as familiar with it as I should be), callers of kmem_cache_alloc() now imply __GFP_THISNODE | __GFP_NORETRY and __GFP_NORETRY. Again, just checking, is this intentional? > /* cache_grow can reenable interrupts, then ac could change. */ > ac = cpu_cache_get(cachep); > @@ -3246,9 +3243,11 @@ static void *alternate_node_alloc(struct > > /* > * Fallback function if there was no memory available and no objects on a > - * certain node and we are allowed to fall back. We mimick the behavior of > - * the page allocator. We fall back according to a zonelist determined by > - * the policy layer while obeying cpuset constraints. > + * certain node and fall back is permitted. First we scan all the > + * available nodelists for available objects. If that fails then we > + * perform an allocation without specifying a node. This allows the page > + * allocator to do its reclaim / fallback magic. We then insert the > + * slab into the proper nodelist and then allocate from it. > */ > void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) > { > @@ -3256,15 +3255,51 @@ void *fallback_alloc(struct kmem_cache * > ->node_zonelists[gfp_zone(flags)]; > struct zone **z; > void *obj = NULL; > + int nid; > > +retry: > + /* > + * Look through allowed nodes for objects available > + * from existing per node queues. > + */ > for (z = zonelist->zones; *z && !obj; z++) { > - int nid = zone_to_nid(*z); > + nid = zone_to_nid(*z); > + > + if (cpuset_zone_allowed(*z, flags) && > + cache->nodelists[nid] && > + cache->nodelists[nid]->free_objects) > + obj = ____cache_alloc_node(cache, > + flags | GFP_THISNODE, nid); > + } Would we not get similar behavior if you just didn't specify GFP_THISNODE? Again, GFP_THISNODE vs __GFP_THISNODE, intentional? > > - if (zone_idx(*z) <= ZONE_NORMAL && > - cpuset_zone_allowed(*z, flags) && > - cache->nodelists[nid]) > - obj = ____cache_alloc_node(cache, > - flags | __GFP_THISNODE, nid); > + if (!obj) { > + /* > + * This allocation will be performed within the constraints > + * of the current cpuset / memory policy requirements. > + * We may trigger various forms of reclaim on the allowed > + * set and go into memory reserves if necessary. > + */ > + obj = kmem_getpages(cache, flags, -1); > + if (obj) { > + /* > + * Insert into the appropriate per node queues > + */ > + nid = page_to_nid(virt_to_page(obj)); > + if (cache_grow(cache, flags, nid, obj)) { > + obj = ____cache_alloc_node(cache, > + flags | GFP_THISNODE, nid); > + if (!obj) > + /* > + * Another processor may allocate the > + * objects in the slab since we are > + * not holding any locks. > + */ > + goto retry; > + } else { > + kmem_freepages(cache, obj); > + obj = NULL; > + } > + } > } > return obj; > } > @@ -3321,7 +3356,7 @@ retry: > > must_grow: > spin_unlock(&l3->list_lock); > - x = cache_grow(cachep, flags, nodeid); > + x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); > if (x) The obvious callers to ____cache_alloc_node seem to specify GFP_THISNODE now, when is fallback_alloc called when it looks like if (!(flags & __GFP_THISNODE)) /* Unable to grow the cache. Fall back to other * nodes. */ return fallback_alloc(cachep, flags); > goto retry; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org