From: Andrew Morton <akpm@osdl.org>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm@kvack.org, Paul Jackson <pj@sgi.com>
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Thu, 14 Sep 2006 22:00:11 -0700 [thread overview]
Message-ID: <20060914220011.2be9100a.akpm@osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0609131649110.20799@schroedinger.engr.sgi.com>
On Wed, 13 Sep 2006 16:50:41 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> This patch insures that the slab node lists in the NUMA case only contain
> slabs that belong to that specific node. All slab allocations use
> GFP_THISNODE when calling into the page allocator. If an allocation fails
> then we fall back in the slab allocator according to the zonelists
> appropriate for a certain context.
>
> This allows a replication of the behavior of alloc_pages and alloc_pages
> node in the slab layer.
>
> Currently allocations requested from the page allocator may be redirected
> via cpusets to other nodes. This results in remote pages on nodelists and
> that in turn results in interrupt latency issues during cache draining.
> Plus the slab is handing out memory as local when it is really remote.
>
> Fallback for slab memory allocations will occur within the slab
> allocator and not in the page allocator. This is necessary in order
> to be able to use the existing pools of objects on the nodes that
> we fall back to before adding more pages to a slab.
>
> The fallback function insures that the nodes we fall back to obey
> cpuset restrictions of the current context. We do not allocate
> objects from outside of the current cpuset context like before.
>
> Note that the implementation of locality constraints within the slab
> allocator requires importing logic from the page allocator. This is a
> mischmash that is not that great. Other allocators (uncached allocator,
> vmalloc, huge pages) face similar problems and have similar minimal
> reimplementations of the basic fallback logic of the page allocator.
> There is another way of implementing a slab by avoiding per node lists
> (see modular slab) but this wont work within the existing slab.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> Index: linux-2.6.18-rc6-mm2/mm/slab.c
> ===================================================================
> --- linux-2.6.18-rc6-mm2.orig/mm/slab.c 2006-09-13 18:04:57.000000000 -0500
> +++ linux-2.6.18-rc6-mm2/mm/slab.c 2006-09-13 18:20:41.356901622 -0500
> @@ -1566,6 +1566,14 @@ static void *kmem_getpages(struct kmem_c
> */
> flags |= __GFP_COMP;
> #endif
> +#ifdef CONFIG_NUMA
> + /*
> + * Under NUMA we want memory on the indicated node. We will handle
> + * the needed fallback ourselves since we want to serve from our
> + * per node object lists first for other nodes.
> + */
> + flags |= GFP_THISNODE;
> +#endif
hm. GFP_THISNODE is dangerous. For example, its use in
kernel/profile.c:create_hash_tables() has gone and caused non-NUMA machines
to use __GFP_NOWARN | __GFP_NORETRY in this situation.
OK, that's relatively harmless here, but why on earth did non-NUMA
machines want to make this change?
Would it not be saner to do away with the dangerous GFP_THISNODE and then
open-code __GFP_THIS_NODE in those places which want that behaviour?
And to then make non-NUMA __GFP_THISNODE equal literal zero, so we can
remove the above ifdefs?
> flags |= cachep->gfpflags;
>
> page = alloc_pages_node(nodeid, flags, cachep->gfporder);
> @@ -3085,6 +3093,15 @@ static __always_inline void *__cache_all
>
> objp = ____cache_alloc(cachep, flags);
> out:
> +
> +#ifdef CONFIG_NUMA
> + /*
> + * We may just have run out of memory on the local know.
> + * __cache_alloc_node knows how to locate memory on other nodes
> + */
> + if (!objp)
> + objp = __cache_alloc_node(cachep, flags, numa_node_id());
> +#endif
What happened to my `#define NUMA_BUILD 0 or 1' proposal? If we had that,
the above could be
if (NUMA_BUILD && !objp)
objp = ...
> /*
> + * Fallback function if there was no memory available and no objects on a
> + * certain node and we are allowed to fall back. We mimick the behavior of
> + * the page allocator. We fall back according to a zonelist determined by
> + * the policy layer while obeying cpuset constraints.
> + */
> +void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> +{
> + struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
> + ->node_zonelists[gfp_zone(flags)];
> + struct zone **z;
> + void *obj = NULL;
> +
> + for (z = zonelist->zones; *z && !obj; z++)
> + if (zone_idx(*z) <= ZONE_NORMAL &&
> + cpuset_zone_allowed(*z, flags))
> + obj = __cache_alloc_node(cache,
> + flags | __GFP_THISNODE,
> + zone_to_nid(*z));
> + return obj;
> +}
hm, there's cpuset_zone_allowed() again.
I have a feeling that we need to nuke that thing: take a 128-node machine,
create a cpuset which has 64 memnodes, consume all the memory in 60 of
them, do some heavy page allocation, then stick a thermometer into
get_page_from_freelist()?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-09-15 5:00 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-09-13 23:50 Christoph Lameter
2006-09-15 5:00 ` Andrew Morton [this message]
2006-09-15 6:49 ` Paul Jackson
2006-09-15 7:23 ` Andrew Morton
2006-09-15 7:44 ` Paul Jackson
2006-09-15 8:06 ` Andrew Morton
2006-09-15 15:53 ` David Rientjes
2006-09-15 23:03 ` David Rientjes
2006-09-16 0:04 ` Paul Jackson
2006-09-16 1:36 ` Andrew Morton
2006-09-16 2:23 ` Christoph Lameter
2006-09-16 4:34 ` Andrew Morton
2006-09-16 3:28 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16 3:40 ` Paul Jackson
2006-09-16 3:45 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16 2:47 ` Christoph Lameter
2006-09-17 3:45 ` David Rientjes
2006-09-17 11:17 ` Paul Jackson
2006-09-17 12:41 ` Christoph Lameter
2006-09-17 13:03 ` Paul Jackson
2006-09-17 20:36 ` David Rientjes
2006-09-17 21:20 ` Paul Jackson
2006-09-17 22:27 ` Paul Jackson
2006-09-17 23:49 ` David Rientjes
2006-09-18 2:20 ` Paul Jackson
2006-09-18 16:34 ` Paul Jackson
2006-09-18 17:49 ` David Rientjes
2006-09-18 20:46 ` Paul Jackson
2006-09-19 20:52 ` David Rientjes
2006-09-19 21:26 ` Christoph Lameter
2006-09-19 21:50 ` David Rientjes
2006-09-21 22:11 ` David Rientjes
2006-09-22 10:10 ` Nick Piggin
2006-09-22 16:26 ` Paul Jackson
2006-09-22 16:36 ` Christoph Lameter
2006-09-15 8:28 ` Andrew Morton
2006-09-16 3:38 ` Paul Jackson
2006-09-16 4:42 ` Andi Kleen
2006-09-16 11:38 ` Paul Jackson
2006-09-16 4:48 ` Andrew Morton
2006-09-16 11:30 ` Paul Jackson
2006-09-16 15:18 ` Andrew Morton
2006-09-17 9:28 ` Paul Jackson
2006-09-17 9:51 ` Nick Piggin
2006-09-17 11:15 ` Paul Jackson
2006-09-17 12:44 ` Nick Piggin
2006-09-17 13:19 ` Paul Jackson
2006-09-17 13:52 ` Nick Piggin
2006-09-17 21:19 ` Paul Jackson
2006-09-18 12:44 ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20 ` Christoph Lameter
2006-09-18 20:43 ` Peter Zijlstra
2006-09-19 14:35 ` Nick Piggin
2006-09-19 14:44 ` Christoph Lameter
2006-09-19 15:02 ` Nick Piggin
2006-09-19 14:51 ` Peter Zijlstra
2006-09-19 15:10 ` Nick Piggin
2006-09-19 15:05 ` Peter Zijlstra
2006-09-19 15:39 ` Christoph Lameter
2006-09-17 16:29 ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18 2:11 ` Paul Jackson
2006-09-18 5:09 ` Andrew Morton
2006-09-18 7:49 ` Paul Jackson
2006-09-16 11:48 ` Paul Jackson
2006-09-16 15:38 ` Andrew Morton
2006-09-16 21:51 ` Paul Jackson
2006-09-16 23:10 ` Andrew Morton
2006-09-17 4:37 ` Christoph Lameter
2006-09-17 4:55 ` Andrew Morton
2006-09-17 12:09 ` Paul Jackson
2006-09-17 12:36 ` Christoph Lameter
2006-09-17 13:06 ` Paul Jackson
2006-09-19 19:17 ` David Rientjes
2006-09-19 19:19 ` David Rientjes
2006-09-19 19:31 ` Christoph Lameter
2006-09-19 21:12 ` David Rientjes
2006-09-19 21:28 ` Christoph Lameter
2006-09-19 21:53 ` Paul Jackson
2006-09-15 17:08 ` Christoph Lameter
2006-09-15 17:37 ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38 ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42 ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060914220011.2be9100a.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=pj@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox