[PATCH] GFP_THISNODE for the slab allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] GFP_THISNODE for the slab allocator
@ 2006-09-13 23:50 Christoph Lameter
  2006-09-15  5:00 ` Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-13 23:50 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm

This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node. All slab allocations use
GFP_THISNODE when calling into the page allocator. If an allocation fails
then we fall back in the slab allocator according to the zonelists
appropriate for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages
node in the slab layer.

Currently allocations requested from the page allocator may be redirected
via cpusets to other nodes. This results in remote pages on nodelists and
that in turn results in interrupt latency issues during cache draining.
Plus the slab is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab
allocator and not in the page allocator. This is necessary in order
to be able to use the existing pools of objects on the nodes that
we fall back to before adding more pages to a slab.

The fallback function insures that the nodes we fall back to obey
cpuset restrictions of the current context. We do not allocate
objects from outside of the current cpuset context like before.

Note that the implementation of locality constraints within the slab
allocator requires importing logic from the page allocator. This is a
mischmash that is not that great. Other allocators (uncached allocator,
vmalloc, huge pages) face similar problems and have similar minimal
reimplementations of the basic fallback logic of the page allocator.
There is another way of implementing a slab by avoiding per node lists
(see modular slab) but this wont work within the existing slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm2/mm/slab.c
===================================================================
--- linux-2.6.18-rc6-mm2.orig/mm/slab.c	2006-09-13 18:04:57.000000000 -0500
+++ linux-2.6.18-rc6-mm2/mm/slab.c	2006-09-13 18:20:41.356901622 -0500
@@ -1566,6 +1566,14 @@ static void *kmem_getpages(struct kmem_c
 	 */
 	flags |= __GFP_COMP;
 #endif
+#ifdef CONFIG_NUMA
+	/*
+	 * Under NUMA we want memory on the indicated node. We will handle
+	 * the needed fallback ourselves since we want to serve from our
+	 * per node object lists first for other nodes.
+	 */
+	flags |= GFP_THISNODE;
+#endif
 	flags |= cachep->gfpflags;

 	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
@@ -3085,6 +3093,15 @@ static __always_inline void *__cache_all

 	objp = ____cache_alloc(cachep, flags);
 out:
+
+#ifdef CONFIG_NUMA
+	/*
+	 * We may just have run out of memory on the local know.
+	 * __cache_alloc_node knows how to locate memory on other nodes
+	 */
+ 	if (!objp)
+ 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
+#endif
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 					    caller);
@@ -3103,7 +3120,7 @@ static void *alternate_node_alloc(struct
 {
 	int nid_alloc, nid_here;

-	if (in_interrupt())
+	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_node_id();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
@@ -3116,6 +3133,28 @@ static void *alternate_node_alloc(struct
 }

 /*
+ * Fallback function if there was no memory available and no objects on a
+ * certain node and we are allowed to fall back. We mimick the behavior of
+ * the page allocator. We fall back according to a zonelist determined by
+ * the policy layer while obeying cpuset constraints.
+ */
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+{
+	struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
+					->node_zonelists[gfp_zone(flags)];
+	struct zone **z;
+	void *obj = NULL;
+
+	for (z = zonelist->zones; *z && !obj; z++)
+		if (zone_idx(*z) <= ZONE_NORMAL &&
+				cpuset_zone_allowed(*z, flags))
+			obj = __cache_alloc_node(cache,
+					flags | __GFP_THISNODE,
+					zone_to_nid(*z));
+	return obj;
+}
+
+/*
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
@@ -3168,11 +3207,15 @@ retry:
 must_grow:
 	spin_unlock(&l3->list_lock);
 	x = cache_grow(cachep, flags, nodeid);
+	if (x)
+		goto retry;

-	if (!x)
-		return NULL;
+	if (!(flags & __GFP_THISNODE))
+		/* Unable to grow the cache. Fall back to other nodes. */
+		return fallback_alloc(cachep, flags);
+
+	return NULL;

-	goto retry;
 done:
 	return obj;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-13 23:50 [PATCH] GFP_THISNODE for the slab allocator Christoph Lameter
@ 2006-09-15  5:00 ` Andrew Morton
  2006-09-15  6:49   ` Paul Jackson
                     ` (4 more replies)
  0 siblings, 5 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-15  5:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Paul Jackson

On Wed, 13 Sep 2006 16:50:41 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> This patch insures that the slab node lists in the NUMA case only contain
> slabs that belong to that specific node. All slab allocations use
> GFP_THISNODE when calling into the page allocator. If an allocation fails
> then we fall back in the slab allocator according to the zonelists
> appropriate for a certain context.
> 
> This allows a replication of the behavior of alloc_pages and alloc_pages
> node in the slab layer.
> 
> Currently allocations requested from the page allocator may be redirected
> via cpusets to other nodes. This results in remote pages on nodelists and
> that in turn results in interrupt latency issues during cache draining.
> Plus the slab is handing out memory as local when it is really remote.
> 
> Fallback for slab memory allocations will occur within the slab
> allocator and not in the page allocator. This is necessary in order
> to be able to use the existing pools of objects on the nodes that
> we fall back to before adding more pages to a slab.
> 
> The fallback function insures that the nodes we fall back to obey
> cpuset restrictions of the current context. We do not allocate
> objects from outside of the current cpuset context like before.
> 
> Note that the implementation of locality constraints within the slab
> allocator requires importing logic from the page allocator. This is a
> mischmash that is not that great. Other allocators (uncached allocator,
> vmalloc, huge pages) face similar problems and have similar minimal
> reimplementations of the basic fallback logic of the page allocator.
> There is another way of implementing a slab by avoiding per node lists
> (see modular slab) but this wont work within the existing slab.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.18-rc6-mm2/mm/slab.c
> ===================================================================
> --- linux-2.6.18-rc6-mm2.orig/mm/slab.c	2006-09-13 18:04:57.000000000 -0500
> +++ linux-2.6.18-rc6-mm2/mm/slab.c	2006-09-13 18:20:41.356901622 -0500
> @@ -1566,6 +1566,14 @@ static void *kmem_getpages(struct kmem_c
>  	 */
>  	flags |= __GFP_COMP;
>  #endif
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * Under NUMA we want memory on the indicated node. We will handle
> +	 * the needed fallback ourselves since we want to serve from our
> +	 * per node object lists first for other nodes.
> +	 */
> +	flags |= GFP_THISNODE;
> +#endif

hm.  GFP_THISNODE is dangerous.  For example, its use in
kernel/profile.c:create_hash_tables() has gone and caused non-NUMA machines
to use __GFP_NOWARN | __GFP_NORETRY in this situation.

OK, that's relatively harmless here, but why on earth did non-NUMA
machines want to make this change?

Would it not be saner to do away with the dangerous GFP_THISNODE and then
open-code __GFP_THIS_NODE in those places which want that behaviour?

And to then make non-NUMA __GFP_THISNODE equal literal zero, so we can
remove the above ifdefs?

>  	flags |= cachep->gfpflags;
>  
>  	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
> @@ -3085,6 +3093,15 @@ static __always_inline void *__cache_all
>  
>  	objp = ____cache_alloc(cachep, flags);
>  out:
> +
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * We may just have run out of memory on the local know.
> +	 * __cache_alloc_node knows how to locate memory on other nodes
> +	 */
> + 	if (!objp)
> + 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
> +#endif

What happened to my `#define NUMA_BUILD 0 or 1' proposal?  If we had that,
the above could be

	if (NUMA_BUILD && !objp)
		objp = ...


>  /*
> + * Fallback function if there was no memory available and no objects on a
> + * certain node and we are allowed to fall back. We mimick the behavior of
> + * the page allocator. We fall back according to a zonelist determined by
> + * the policy layer while obeying cpuset constraints.
> + */
> +void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> +{
> +	struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
> +					->node_zonelists[gfp_zone(flags)];
> +	struct zone **z;
> +	void *obj = NULL;
> +
> +	for (z = zonelist->zones; *z && !obj; z++)
> +		if (zone_idx(*z) <= ZONE_NORMAL &&
> +				cpuset_zone_allowed(*z, flags))
> +			obj = __cache_alloc_node(cache,
> +					flags | __GFP_THISNODE,
> +					zone_to_nid(*z));
> +	return obj;
> +}

hm, there's cpuset_zone_allowed() again.

I have a feeling that we need to nuke that thing: take a 128-node machine,
create a cpuset which has 64 memnodes, consume all the memory in 60 of
them, do some heavy page allocation, then stick a thermometer into
get_page_from_freelist()?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  5:00 ` Andrew Morton
@ 2006-09-15  6:49   ` Paul Jackson
  2006-09-15  7:23     ` Andrew Morton
  2006-09-15 17:08   ` Christoph Lameter
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-15  6:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm

Andrew wrote:
> hm, there's cpuset_zone_allowed() again.
> 
> I have a feeling that we need to nuke that thing: take a 128-node machine,
> create a cpuset which has 64 memnodes, consume all the memory in 60 of
> them, do some heavy page allocation, then stick a thermometer into
> get_page_from_freelist()?

Hmmm ... are you worried that if get_page_from_freelist() has to scan
many nodes before it finds memory, that it will end up spending more
CPU cycles than we'd like calling cpuset_zone_allowed()?

The essential thing that cpuset_zone_allowed() does, in the most common
case, is to determine if a zone is on one of the nodes the task is allowed
to use.

The get_page_from_freelist() and cpuset_zone_allowed() code is optimized
for the case that memory is usually found in the first few zones in the
zonelist.

Here's the relevant portion of the get_page_from_freelist() code, as it
stands now:

============================================================
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
{
        struct zone **z = zonelist->zones;
	...
        do {
                if ((alloc_flags & ALLOC_CPUSET) &&
                                !cpuset_zone_allowed(*z, gfp_mask))
                        continue;
                ... if zone z has free pages, use them ...
        } while (*(++z) != NULL);
============================================================

For the purposes of discussion here, let me open code the hot code
path down into cpuset_zone_allowed(), so we can see more what's
happening here.  Here's the open coded rewrite:

============================================================
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
{
        struct zone **z = zonelist->zones;
	...
	int do_cpuset_check = !in_interrupt() && alloc_flags & ALLOC_CPUSET;

        do {
		int node = z->zone_pgdat->node_id
                if (do_cpuset_check &&
				!node_isset(node, current->mems_allowed) &&
				!cpuset_zone_allowed_slow_path_check())
                        continue;
                ... if zone z has free pages, use them ...
        } while (*(++z) != NULL);
============================================================

With this open coding, we can see what cpuset_zone_allowed() is doing
here.  The key thing it must do each loop (each zone z) is to ask if
that zone's node is set in current->mems_allowed.

My hypothetical routine 'cpuset_zone_allowed_slow_path_check()'
contains the infrequently executed code path.  Usually, either we are
not doing the cpuset check (because we are in interrupt), or we are
checking and the check passes because the 'node' is allowed in
current->mems_allowed.

This code is optimized for the case that we find memory in a node
fairly near the front of the zonelist.  If we have to go scavanging
down a long list of zones before we find a node with free memory, then
yes, we are sucking wind calling cpuset_zone_allowed(), or my
hypothetical cpuset_zone_allowed_slow_path_check(), many times.

I guess that was your concern.

I don't think we should be tuning especially hard for that case.

On a big honking NUMA box, if we have to go scavanging for memory
dozens or hundreds of nodes removed from the scene of the memory fault,
then **even if we found that precious free page of memory instantly**
(in zero cost CPU cycles in the above code) we're -still- screwed.

Well, the user of that machine is still screwed.  They have overloaded
its memory, forcing poor NUMA placement. It's obviously not as bad as
swap hell, but it's not good either.  There is nothing that the above
code can do to make the "Non-Uniform" part of "NUMA" magically
disappear.  Recall that these zonelists are sorted by distance from
the starting node; so the further down the list we go, the slower the
memory we get, relative to the tasks current CPU.

We shouldn't be heavily tuning for this case, and I am not aware of any
real world situations where real users would have reasonably determined
otherwise, had they had full realization of what was going on.

By 'not heavily tuning', I mean we should be more interested in minimizing
kernel text size and cache footprint here than in optimizing CPU cycles
for the case of having to frequently scan a long way down a long zonelist.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  6:49   ` Paul Jackson
@ 2006-09-15  7:23     ` Andrew Morton
  2006-09-15  7:44       ` Paul Jackson
                         ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-15  7:23 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, David Rientjes

On Thu, 14 Sep 2006 23:49:26 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > hm, there's cpuset_zone_allowed() again.
> > 
> > I have a feeling that we need to nuke that thing: take a 128-node machine,
> > create a cpuset which has 64 memnodes, consume all the memory in 60 of
> > them, do some heavy page allocation, then stick a thermometer into
> > get_page_from_freelist()?
> 
> Hmmm ... are you worried that if get_page_from_freelist() has to scan
> many nodes before it finds memory, that it will end up spending more
> CPU cycles than we'd like calling cpuset_zone_allowed()?
> 
> The essential thing that cpuset_zone_allowed() does, in the most common
> case, is to determine if a zone is on one of the nodes the task is allowed
> to use.
> 
> The get_page_from_freelist() and cpuset_zone_allowed() code is optimized
> for the case that memory is usually found in the first few zones in the
> zonelist.
> 
> Here's the relevant portion of the get_page_from_freelist() code, as it
> stands now:
> 
> 
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
>                 struct zonelist *zonelist, int alloc_flags)
> {
>         struct zone **z = zonelist->zones;
> 	...
>         do {
>                 if ((alloc_flags & ALLOC_CPUSET) &&
>                                 !cpuset_zone_allowed(*z, gfp_mask))
>                         continue;
>                 ... if zone z has free pages, use them ...
>         } while (*(++z) != NULL);
> ============================================================
> 
> 
> For the purposes of discussion here, let me open code the hot code
> path down into cpuset_zone_allowed(), so we can see more what's
> happening here.  Here's the open coded rewrite:
> 
> 
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
>                 struct zonelist *zonelist, int alloc_flags)
> {
>         struct zone **z = zonelist->zones;
> 	...
> 	int do_cpuset_check = !in_interrupt() && alloc_flags & ALLOC_CPUSET;
> 
>         do {
> 		int node = z->zone_pgdat->node_id
>                 if (do_cpuset_check &&
> 				!node_isset(node, current->mems_allowed) &&
> 				!cpuset_zone_allowed_slow_path_check())
>                         continue;
>                 ... if zone z has free pages, use them ...
>         } while (*(++z) != NULL);
> ============================================================
> 
> 
> With this open coding, we can see what cpuset_zone_allowed() is doing
> here.  The key thing it must do each loop (each zone z) is to ask if
> that zone's node is set in current->mems_allowed.
> 
> My hypothetical routine 'cpuset_zone_allowed_slow_path_check()'
> contains the infrequently executed code path.  Usually, either we are
> not doing the cpuset check (because we are in interrupt), or we are
> checking and the check passes because the 'node' is allowed in
> current->mems_allowed.
> 
> This code is optimized for the case that we find memory in a node
> fairly near the front of the zonelist.  If we have to go scavanging
> down a long list of zones before we find a node with free memory, then
> yes, we are sucking wind calling cpuset_zone_allowed(), or my
> hypothetical cpuset_zone_allowed_slow_path_check(), many times.
> 
> I guess that was your concern.

You got it.

> I don't think we should be tuning especially hard for that case.

Well some bright spark went and had the idea of using cpusets and fake numa
nodes as a means of memory paritioning, didn't he?

David (cc'ed here) did some testing for me.  A fake-64-node machine with
60/64ths of its memory in a 60-node "container".  We filled up 40-odd of
those nodes with malloc+memset+sleep and then ran a kernel build in the
remainder.

System time went through the roof.  We still need to get the profile, but
I'll eat my hat if the cause isn't get_page_from_freelist() waddling across
60-odd zones for each page allocation.

> On a big honking NUMA box, if we have to go scavanging for memory
> dozens or hundreds of nodes removed from the scene of the memory fault,
> then **even if we found that precious free page of memory instantly**
> (in zero cost CPU cycles in the above code) we're -still- screwed.
> 
> Well, the user of that machine is still screwed.  They have overloaded
> its memory, forcing poor NUMA placement. It's obviously not as bad as
> swap hell, but it's not good either.  There is nothing that the above
> code can do to make the "Non-Uniform" part of "NUMA" magically
> disappear.  Recall that these zonelists are sorted by distance from
> the starting node; so the further down the list we go, the slower the
> memory we get, relative to the tasks current CPU.
> 
> We shouldn't be heavily tuning for this case, and I am not aware of any
> real world situations where real users would have reasonably determined
> otherwise, had they had full realization of what was going on.

gotcha ;)

> By 'not heavily tuning', I mean we should be more interested in minimizing
> kernel text size and cache footprint here than in optimizing CPU cycles
> for the case of having to frequently scan a long way down a long zonelist.
> 

There are two problems:

a) the linear search across nodes which are not in the cpuset

b) the linear search across nodes which _are_ in the cpuset, but which
   are used up.

I'm thinking a) is easily solved by adding an array of the zones inside the
`struct cpuset', and change get_page_from_freelist() to only look at those
zones.

And b) can, I think, be solved by caching the most-recently-allocated-from
zone* inside the cpuset as well.  This might alter page allocation
behaviour a bit.  And we'd need to do an exhaustive search at some point in
there.

The nasty part is locking that array of zones, and its length, and the
cached zone*.  I guess it'd need to be RCUed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  7:23     ` Andrew Morton
@ 2006-09-15  7:44       ` Paul Jackson
  2006-09-15  8:06         ` Andrew Morton
  2006-09-15  8:28       ` Andrew Morton
  2006-09-16 11:48       ` Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-15  7:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes

Andrew wrote:
> Well some bright spark went and had the idea of using cpusets and fake numa
> nodes as a means of memory paritioning, didn't he?

If that bright spark is lurking here, perhaps he could educate
me a little.  I mostly ignored the fake numa node stuff when it
went by, because I figured it was just an amusing novelty.

Perhaps its time I learned why it is valuable.  Can someone
explain it to me, and describe a bit the situations in which
it is useful.  Seems like NUMA mechanisms are being (ab)used
for micro-partitioning memory.

As Andrew speculates, this could lead to reconsidering and
fancifying up some of the mechanisms, to cover a wider range
of situations efficiently.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  7:44       ` Paul Jackson
@ 2006-09-15  8:06         ` Andrew Morton
  2006-09-15 15:53           ` David Rientjes
  2006-09-15 23:03           ` David Rientjes
  0 siblings, 2 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-15  8:06 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes

On Fri, 15 Sep 2006 00:44:02 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > Well some bright spark went and had the idea of using cpusets and fake numa
> > nodes as a means of memory paritioning, didn't he?
> 
> If that bright spark is lurking here, perhaps he could educate
> me a little.  I mostly ignored the fake numa node stuff when it
> went by, because I figured it was just an amusing novelty.

numa=fake=N is an x86_64-only hack which Andi stuck in there in the early
days just for developer NUMA testing.  Once opterons became commodity it
bitrotted because there was no need for it.

Then it occurred to me (although apparently another
brightspark@somewhere.jp has the same idea earlier on) that if you can
slice a UMA machine into 128 or 256 little pieces and manage them using
cpusets, you have *all* the infrastructure you need to do crude but
effective machine partitioning.

David has fixed numa=fake (it was badly busted) and has been experimenting
with a 3GB machine sliced into 64 "nodes".  So he can build containers
whose memory allocation is variable in 40-odd-megabyte hunks.

Testing looks promising: a group of processes in container A remains
constrained to its allocation - if it gets too fat it starts getting
reclaimed or swapped.

I _think_ it goes all the way up to getting oom-killed (David?).  The
oom-killer appears to be doing the right thing - we don't want it to be
killing processes which aren't inside the offending container.

> Perhaps its time I learned why it is valuable.  Can someone
> explain it to me, and describe a bit the situations in which
> it is useful.  Seems like NUMA mechanisms are being (ab)used
> for micro-partitioning memory.

yup.  afaict the only problem which has been encountered with this is that
search complexity in the page allocator.

> As Andrew speculates, this could lead to reconsidering and
> fancifying up some of the mechanisms, to cover a wider range
> of situations efficiently.

Yes.  Speeding up get_page_from_freelist() is less than totally trivial. 
I've been putting off thinking about it until we're pretty sure that there
aren't any other showstoppers.

I'm (very) impressed at how well the infrastructre which you and Christoph
have put together has held up under this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  7:23     ` Andrew Morton
  2006-09-15  7:44       ` Paul Jackson
@ 2006-09-15  8:28       ` Andrew Morton
  2006-09-16  3:38         ` Paul Jackson
  2006-09-16 11:48       ` Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-15  8:28 UTC (permalink / raw)
  To: Paul Jackson, clameter, linux-mm, David Rientjes

On Fri, 15 Sep 2006 00:23:25 -0700
Andrew Morton <akpm@osdl.org> wrote:

> 
> There are two problems:
> 
> a) the linear search across nodes which are not in the cpuset
> 
> b) the linear search across nodes which _are_ in the cpuset, but which
>    are used up.
> 
> I'm thinking a) is easily solved by adding an array of the zones inside the
> `struct cpuset', and change get_page_from_freelist() to only look at those
> zones.
> 
> And b) can, I think, be solved by caching the most-recently-allocated-from
> zone* inside the cpuset as well.  This might alter page allocation
> behaviour a bit.  And we'd need to do an exhaustive search at some point in
> there.

err, if we cache the most-recently-allocated-from zone in the cpuset then
we don't need the array-of-zones, do we?  We'll only need to do a zone
waddle when switching from one zone to the next, which is super-rare.

That's much simpler.

> The nasty part is locking that array of zones, and its length, and the
> cached zone*.  I guess it'd need to be RCUed.

And locking becomes simpler too.  It's just a check of
cpuset_zone_allowed(current->cpuset->current_allocation_zone), in
get_page_from_freelist(), isn't it?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  8:06         ` Andrew Morton
@ 2006-09-15 15:53           ` David Rientjes
  2006-09-15 23:03           ` David Rientjes
  1 sibling, 0 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-15 15:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, clameter, linux-mm

On Fri, 15 Sep 2006, Andrew Morton wrote:

> David has fixed numa=fake (it was badly busted) and has been experimenting
> with a 3GB machine sliced into 64 "nodes".  So he can build containers
> whose memory allocation is variable in 40-odd-megabyte hunks.
> 

The 40-odd-megabyte hunks are for numa=fake=64 (63*48M + 1*47M on my 
machine).  I've gone as high as numa=fake=128 (127*24M + 1*23M).

> I _think_ it goes all the way up to getting oom-killed (David?).  The
> oom-killer appears to be doing the right thing - we don't want it to be
> killing processes which aren't inside the offending container.
> 

Yes, oomkiller will kill processes from within the offending cpuset.  I 
originally noticed it when I tried the kernel build by only giving it 64M.  
It's trivial to do with anything that mlocks the memory; my favorite: 
'usemem -m 2048 -n 2 -M' inside a cpuset configured for only 1G of memory 
with any number of batch jobs outside the cpuset that should, and never 
are, killed.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  5:00 ` Andrew Morton
  2006-09-15  6:49   ` Paul Jackson
@ 2006-09-15 17:08   ` Christoph Lameter
  2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-15 17:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Paul Jackson

On Thu, 14 Sep 2006, Andrew Morton wrote:

> hm.  GFP_THISNODE is dangerous.  For example, its use in
> kernel/profile.c:create_hash_tables() has gone and caused non-NUMA machines
> to use __GFP_NOWARN | __GFP_NORETRY in this situation.
> 
> OK, that's relatively harmless here, but why on earth did non-NUMA
> machines want to make this change?

Right. We could define GFP_THISNODE to be 0 in the non-NUMA. Note the 
missing __ __GFP_xx cannot be redefined to be 0 otherwise we get into
trouble bitchecking.

> Would it not be saner to do away with the dangerous GFP_THISNODE and then
> open-code __GFP_THIS_NODE in those places which want that behaviour?

That would bypass various processes in the page allocator. We are already
copying the fallback lists processing to other allocators but this would 
mean even more of the page allocator would be replicated elsewhere.

> And to then make non-NUMA __GFP_THISNODE equal literal zero, so we can
> remove the above ifdefs?

We can easily make GFP_THISNODE 0 which will make it easy to use.

> > + 	if (!objp)
> > + 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
> > +#endif
> 
> What happened to my `#define NUMA_BUILD 0 or 1' proposal?  If we had that,
> the above could be
> 
> 	if (NUMA_BUILD && !objp)
> 		objp = ...

Ok. Lets do that then.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA
  2006-09-15  5:00 ` Andrew Morton
  2006-09-15  6:49   ` Paul Jackson
  2006-09-15 17:08   ` Christoph Lameter
@ 2006-09-15 17:37   ` Christoph Lameter
  2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
  2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
  4 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-15 17:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

The NUMA_BUILD constant is always available and will be set to 1
on NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can
easily be done without #ifdef CONFIG_NUMA

F.e.

if (NUMA_BUILD && <numa_condition>) {
...
}

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm2/include/linux/kernel.h
===================================================================
--- linux-2.6.18-rc6-mm2.orig/include/linux/kernel.h	2006-09-13 20:00:38.000000000 -0500
+++ linux-2.6.18-rc6-mm2/include/linux/kernel.h	2006-09-15 12:19:55.293331280 -0500
@@ -352,4 +352,11 @@ struct sysinfo {
 /* Trap pasters of __FUNCTION__ at compile-time */
 #define __FUNCTION__ (__func__)
 
+/* This helps us to avoid #ifdef CONFIG_NUMA */
+#ifdef CONFIG_NUMA
+#define NUMA_BUILD 1
+#else
+#define NUMA_BUILD 0
+#endif
+
 #endif
Index: linux-2.6.18-rc6-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.18-rc6-mm2.orig/mm/page_alloc.c	2006-09-15 12:17:47.000000000 -0500
+++ linux-2.6.18-rc6-mm2/mm/page_alloc.c	2006-09-15 12:27:01.079243677 -0500
@@ -957,7 +957,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	 */
 	do {
 		zone = *z;
-		if (unlikely((gfp_mask & __GFP_THISNODE) &&
+		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
 			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
 				break;
 		if ((alloc_flags & ALLOC_CPUSET) &&
@@ -1330,14 +1330,12 @@ unsigned int nr_free_pagecache_pages(voi
 {
 	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER));
 }
-#ifdef CONFIG_NUMA
-static void show_node(struct zone *zone)
+
+static inline void show_node(struct zone *zone)
 {
-	printk("Node %ld ", zone_to_nid(zone));
+	if (NUMA_BUILD)
+		printk("Node %ld ", zone_to_nid(zone));
 }
-#else
-#define show_node(zone)	do { } while (0)
-#endif
 
 /*
  * The node's effective length of inactive_list(s).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] Disable GFP_THISNODE in the non-NUMA case
  2006-09-15  5:00 ` Andrew Morton
                     ` (2 preceding siblings ...)
  2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
@ 2006-09-15 17:38   ` Christoph Lameter
  2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
  4 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-15 17:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

GFP_THISNODE must be set to 0 in the non numa case otherwise we disable
retry and warnings for failing allocations in the SMP and UP case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm2/include/linux/gfp.h
===================================================================
--- linux-2.6.18-rc6-mm2.orig/include/linux/gfp.h	2006-09-15 12:17:39.000000000 -0500
+++ linux-2.6.18-rc6-mm2/include/linux/gfp.h	2006-09-15 12:29:06.607417253 -0500
@@ -67,7 +67,12 @@ struct vm_area_struct;
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
 			 __GFP_HIGHMEM)
 
+#ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
+#else
+#define GFP_THISNODE	0
+#endif
+
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] GFP_THISNODE for the slab allocator V2
  2006-09-15  5:00 ` Andrew Morton
                     ` (3 preceding siblings ...)
  2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
@ 2006-09-15 17:42   ` Christoph Lameter
  4 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-15 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node. All slab allocations use
GFP_THISNODE when calling into the page allocator. If an allocation fails
then we fall back in the slab allocator according to the zonelists
appropriate for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages
node in the slab layer.

Currently allocations requested from the page allocator may be redirected
via cpusets to other nodes. This results in remote pages on nodelists and
that in turn results in interrupt latency issues during cache draining.
Plus the slab is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab
allocator and not in the page allocator. This is necessary in order
to be able to use the existing pools of objects on the nodes that
we fall back to before adding more pages to a slab.

The fallback function insures that the nodes we fall back to obey
cpuset restrictions of the current context. We do not allocate
objects from outside of the current cpuset context like before.

Note that the implementation of locality constraints within the slab
allocator requires importing logic from the page allocator. This is a
mischmash that is not that great. Other allocators (uncached allocator,
vmalloc, huge pages) face similar problems and have similar minimal
reimplementations of the basic fallback logic of the page allocator.
There is another way of implementing a slab by avoiding per node lists
(see modular slab) but this wont work within the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another 
  #ifdef

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm2/mm/slab.c
===================================================================
--- linux-2.6.18-rc6-mm2.orig/mm/slab.c	2006-09-15 12:27:19.675539677 -0500
+++ linux-2.6.18-rc6-mm2/mm/slab.c	2006-09-15 12:40:51.533035438 -0500
@@ -1566,7 +1566,13 @@ static void *kmem_getpages(struct kmem_c
 	 */
 	flags |= __GFP_COMP;
 #endif
-	flags |= cachep->gfpflags;
+
+	/*
+	 * Under NUMA we want memory on the indicated node. We will handle
+	 * the needed fallback ourselves since we want to serve from our
+	 * per node object lists first for other nodes.
+	 */
+	flags |= cachep->gfpflags | GFP_THISNODE;
 
 	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
 	if (!page)
@@ -3075,16 +3081,23 @@ static __always_inline void *__cache_all
 
 	local_irq_save(save_flags);
 
-#ifdef CONFIG_NUMA
-	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+	if (unlikely(NUMA_BUILD &&
+			(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))) {
 		objp = alternate_node_alloc(cachep, flags);
 		if (objp != NULL)
 			goto out;
 	}
-#endif
 
 	objp = ____cache_alloc(cachep, flags);
 out:
+
+	/*
+	 * We may just have run out of memory on the local know.
+	 * __cache_alloc_node knows how to locate memory on other nodes
+	 */
+ 	if (NUMA_BUILD && !objp)
+ 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
+
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 					    caller);
@@ -3103,7 +3116,7 @@ static void *alternate_node_alloc(struct
 {
 	int nid_alloc, nid_here;
 
-	if (in_interrupt())
+	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_node_id();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
@@ -3116,6 +3129,28 @@ static void *alternate_node_alloc(struct
 }
 
 /*
+ * Fallback function if there was no memory available and no objects on a
+ * certain node and we are allowed to fall back. We mimick the behavior of
+ * the page allocator. We fall back according to a zonelist determined by
+ * the policy layer while obeying cpuset constraints.
+ */
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+{
+	struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
+					->node_zonelists[gfp_zone(flags)];
+	struct zone **z;
+	void *obj = NULL;
+
+	for (z = zonelist->zones; *z && !obj; z++)
+		if (zone_idx(*z) <= ZONE_NORMAL &&
+				cpuset_zone_allowed(*z, flags))
+			obj = __cache_alloc_node(cache,
+					flags | __GFP_THISNODE,
+					zone_to_nid(*z));
+	return obj;
+}
+
+/*
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
@@ -3168,14 +3203,30 @@ retry:
 must_grow:
 	spin_unlock(&l3->list_lock);
 	x = cache_grow(cachep, flags, nodeid);
+	if (x)
+		goto retry;
 
-	if (!x)
-		return NULL;
+	if (!(flags & __GFP_THISNODE))
+		/* Unable to grow the cache. Fall back to other nodes. */
+		return fallback_alloc(cachep, flags);
+
+	return NULL;
 
-	goto retry;
 done:
 	return obj;
 }
+#else
+static inline void *__cache_alloc_node(struct kmem_cache *cachep,
+		 gfp_t flags, int nodeid)
+{
+	return NULL;
+}
+
+static inline void *alternate_node_alloc(struct kmem_cache *cachep,
+		gfp_t flags)
+{
+	return NULL;
+}
 #endif
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  8:06         ` Andrew Morton
  2006-09-15 15:53           ` David Rientjes
@ 2006-09-15 23:03           ` David Rientjes
  2006-09-16  0:04             ` Paul Jackson
                               ` (2 more replies)
  1 sibling, 3 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-15 23:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, clameter, linux-mm, rientjes

On Fri, 15 Sep 2006, Andrew Morton wrote:

> Yes.  Speeding up get_page_from_freelist() is less than totally trivial. 
> I've been putting off thinking about it until we're pretty sure that there
> aren't any other showstoppers.
> 
> I'm (very) impressed at how well the infrastructre which you and Christoph
> have put together has held up under this.
> 

I used numa=fake=64 for 64 nodes of 48M each (with my numa=fake fix).  I 
created a 2G cpuset with 43 nodes (43*48M = ~2G) and attached 'usemem -m 
1500 -s 10000000 &' to it for 1.5G of anonymous memory.  I then used 
readprofile to time and profile a kernel build of 2.6.18-rc5 with x86_64 
defconfig in the remaining 21 nodes.

Kernel build within the 2G container:
	real	5m23.057s
	user	9m13.395s
	sys	1m15.417s

Unrestricted kernel build (no NUMA emulation or usemem):
	real	5m3.213s
	user	9m19.483s
	sys	0m32.014s	<-- over twice faster

In 2G container:
	10599 __cpuset_zone_allowed			50.4714
	 3521 mwait_idle				45.1410
	 1149 clear_page				20.1579
	   24 clear_page_end				 3.4286
	  215 find_get_page				 3.0282
	  110 pfn_to_page				 2.3913
	  130 __down_read_trylock			 1.9697
	   86 page_remove_rmap				 1.9545
	  150 find_vma					 1.7241
	   46 __strnlen_user				 1.1795
	   32 nr_free_pages				 1.1034
	   55 page_to_pfn				 1.0784
	   22 page_add_file_rmap			 1.0000
	  829 get_page_from_freelist			 0.8904
	 1548 do_page_fault				 0.8586
	   17 file_ra_state_init			 0.8500
	   63 _atomic_dec_and_lock			 0.7500
	   85 ia32_sysenter_target			 0.7083
	   47 cond_resched				 0.6912
	  198 copy_user_generic				 0.6644

Unrestricted:
	 3719 mwait_idle				47.6795
	 1083 clear_page				19.0000
	   20 clear_page_end				 2.8571
	  175 find_get_page				 2.4648
	   77 page_remove_rmap				 1.7500
	  114 __down_read_trylock			 1.7273
	   77 pfn_to_page				 1.6739
	  144 find_vma					 1.6552
	   60 __strnlen_user				 1.5385
	   71 page_to_pfn				 1.3922
	   24 page_add_file_rmap			 1.0909
	   17 fput					 1.0000
	   80 _atomic_dec_and_lock			 0.9524
	    4 up_write					 0.8000
	 1439 do_page_fault				 0.7981
	   13 compat_sys_open				 0.7647
	  227 copy_user_generic				 0.7617
	   89 ia32_sysenter_target			 0.7417
	   21 memcmp					 0.6562
	   13 file_ra_state_init			 0.6500
	...
	  389 get_page_from_freelist			 0.4178

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15 23:03           ` David Rientjes
@ 2006-09-16  0:04             ` Paul Jackson
  2006-09-16  1:36               ` Andrew Morton
  2006-09-16  2:47             ` Christoph Lameter
  2006-09-17  3:45             ` David Rientjes
  2 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-16  0:04 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, clameter, linux-mm

Thanks for doing this, David.

> I used numa=fake=64 for 64 nodes of 48M each (with my numa=fake fix).  I 
> created a 2G cpuset with 43 nodes (43*48M = ~2G) and attached 'usemem -m 
> 1500 -s 10000000 &' to it for 1.5G of anonymous memory.  I then used 
> readprofile to time and profile a kernel build of 2.6.18-rc5 with x86_64 
> defconfig in the remaining 21 nodes.

I got confused here.  Was the kernel build running in the
2G cpuset (which only had 0.5G remaining free), or was it
running on the remaining 21 nodes, outside the 2G cpuset?

Separate question - would it be easy to run this again, with
a little patch from me that open coded cpuset_zone_allowed()
in get_page_from_freelist()?  The patch I have in mind would
not be acceptable for the real kernel, but it would give us
an idea of whether just a local code change might be sufficient
here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  0:04             ` Paul Jackson
@ 2006-09-16  1:36               ` Andrew Morton
  2006-09-16  2:23                 ` Christoph Lameter
                                   ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-16  1:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, clameter, linux-mm

On Fri, 15 Sep 2006 17:04:55 -0700
Paul Jackson <pj@sgi.com> wrote:

> Thanks for doing this, David.
> 
> > I used numa=fake=64 for 64 nodes of 48M each (with my numa=fake fix).  I 
> > created a 2G cpuset with 43 nodes (43*48M = ~2G) and attached 'usemem -m 
> > 1500 -s 10000000 &' to it for 1.5G of anonymous memory.  I then used 
> > readprofile to time and profile a kernel build of 2.6.18-rc5 with x86_64 
> > defconfig in the remaining 21 nodes.
> 
> I got confused here.  Was the kernel build running in the
> 2G cpuset (which only had 0.5G remaining free), or was it
> running on the remaining 21 nodes, outside the 2G cpuset?

The former.

> Separate question - would it be easy to run this again, with
> a little patch from me that open coded cpuset_zone_allowed()
> in get_page_from_freelist()?

I guess it would, but that'll be a next-week thing.

I need to get David's stuff working on my machine, but you know how it is..

>  The patch I have in mind would
> not be acceptable for the real kernel, but it would give us
> an idea of whether just a local code change might be sufficient
> here.

I doubt it - if we still hit z->zone_pgdat->node_id for all 40-odd zones,
I expect the cost will be comparable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  1:36               ` Andrew Morton
@ 2006-09-16  2:23                 ` Christoph Lameter
  2006-09-16  4:34                   ` Andrew Morton
  2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
  2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-16  2:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, David Rientjes, linux-mm

On Fri, 15 Sep 2006, Andrew Morton wrote:

> I doubt it - if we still hit z->zone_pgdat->node_id for all 40-odd zones,
> I expect the cost will be comparable.

This is the zone_to_nid() macro.

Could we add a node_id field to the zone? So zone_to_nid() becomes one 
lookup?

Note that zone_pgdat is out of the hot zone in the zone structure. Its 
therefore slow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15 23:03           ` David Rientjes
  2006-09-16  0:04             ` Paul Jackson
@ 2006-09-16  2:47             ` Christoph Lameter
  2006-09-17  3:45             ` David Rientjes
  2 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-16  2:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Paul Jackson, linux-mm

On Fri, 15 Sep 2006, David Rientjes wrote:

> I used numa=fake=64 for 64 nodes of 48M each (with my numa=fake fix).  I 
> created a 2G cpuset with 43 nodes (43*48M = ~2G) and attached 'usemem -m 
> 1500 -s 10000000 &' to it for 1.5G of anonymous memory.  I then used 
> readprofile to time and profile a kernel build of 2.6.18-rc5 with x86_64 
> defconfig in the remaining 21 nodes.

Hmmm... The patch in mm for zone reduction will only get rid of 
ZONE_HIGHMEM which is not the zonelists at all.

If you have a clean x86 machine whose DMA engines can do I/O to all of 
memory then you could run with a single ZONE_NORMAL per node which may cut 
the number of tests down to a third. For that you would need the Optional 
ZONE_DMA patch that was to linux-mm this week and configure the kernel 
without ZONE_DMA and ZONE_DMA32.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] Add node to zone for the NUMA case.
  2006-09-16  1:36               ` Andrew Morton
  2006-09-16  2:23                 ` Christoph Lameter
@ 2006-09-16  3:28                 ` Christoph Lameter
  2006-09-16  3:40                   ` Paul Jackson
  2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-16  3:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, David Rientjes, linux-mm

Add the node in order to optimize zone_to_nid.


Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.18-rc6-mm2.orig/include/linux/mmzone.h	2006-09-15 12:26:08.000000000 -0500
+++ linux-2.6.18-rc6-mm2/include/linux/mmzone.h	2006-09-15 22:11:53.103125905 -0500
@@ -168,6 +168,7 @@ struct zone {
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
 #ifdef CONFIG_NUMA
+	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
Index: linux-2.6.18-rc6-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.18-rc6-mm2.orig/mm/page_alloc.c	2006-09-15 12:43:12.000000000 -0500
+++ linux-2.6.18-rc6-mm2/mm/page_alloc.c	2006-09-15 22:13:02.412755301 -0500
@@ -2477,6 +2477,7 @@ static void __meminit free_area_init_cor
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
 		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
Index: linux-2.6.18-rc6-mm2/include/linux/mm.h
===================================================================
--- linux-2.6.18-rc6-mm2.orig/include/linux/mm.h	2006-09-15 12:43:12.000000000 -0500
+++ linux-2.6.18-rc6-mm2/include/linux/mm.h	2006-09-15 22:13:23.418455663 -0500
@@ -449,7 +449,7 @@ static inline int page_zone_id(struct pa
 
 static inline unsigned long zone_to_nid(struct zone *zone)
 {
-	return zone->zone_pgdat->node_id;
+	return zone->node;
 }
 
 #ifdef NODE_NOT_IN_PAGE_FLAGS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  8:28       ` Andrew Morton
@ 2006-09-16  3:38         ` Paul Jackson
  2006-09-16  4:42           ` Andi Kleen
  2006-09-16  4:48           ` Andrew Morton
  0 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-16  3:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes, ak

[Adding Andi to cc list, since I mention him below. -pj]

Andrew wrote:
> I'm thinking a) is easily solved by adding an array of the zones inside the
> `struct cpuset', and change get_page_from_freelist() to only look at those
> zones.
> ...
> err, if we cache the most-recently-allocated-from zone in the cpuset then
> we don't need the array-of-zones, do we?  We'll only need to do a zone
> waddle when switching from one zone to the next, which is super-rare.
> 
> That's much simpler.
> ...
> And locking becomes simpler too.  It's just a check of
> cpuset_zone_allowed(current->cpuset->current_allocation_zone)

This will blow chunks performance wise, with the current cpuset locking
scheme.

Just one current_allocation_zone would not be enough.  Each node that
the cpuset allowed would require its own current_allocation_zone.  For
example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
on CPU 32, Node 16, might be able to find free memory right on that
Node 16.  But another task in the same cpuset running on CPU 112, Node
56 might have to scan past a dozen Nodes to Node 68 to find memory.

Accessing anything from a cpuset that depends on what nodes it allows
requires taking the global mutex callback_mutex (in kernel/cpuset.c).
We don't want to put a global mutex on the page alloc hot code path.

Anything we need to access frequently from a tasks cpuset has to be
cached in its task struct.

Three alternative possibilities:

1)  Perhaps these most-recently-allocated-from-zone's shouldn't be
    properties of the cpuset, nor even of the task, but of the zone structs.

    If each zone struct on the zonelist had an additional flag bit marking
    the zones that had no free memory, then we could navigate the zonelist
    pretty quickly.  One more bit per zone struct would be enough to track
    a simple rescan mechanism, so that we could detect when a node that
    had formerly run out of memory once again had free memory.

    One or two bits per zone struct would be way cheaper, so far as
    data space requirements.

    Downside - it still hits each zone struct - suboptimal cache trashing.
    One less pointer chase than z->zone_pgdat->node_id, but still not
    great.

2)  It may be sufficient to locally optimize get_page_from_freelist()'s
    calls to cpuset_zone_allowed() - basically open code cpuset_zone_allowed,
    or at least refine its invocation.

    This might require a second nodemask in the task struct, for the typically
    larger set of nodes that GFP_KERNEL allocations can use, more than just
    the nodes that GFP_USER can use.  Such a second nodemask in the task struct
    would enable me to avoid taking the global callback_mutex for some GFP_KERNEL 
    allocations on tight memory systems.

    Downside #1 - still requires z->zone_pgdat->node_id.  Andrew suspects
    that this is enough of a problem in itself.  From the profile, which
    showed cpuset_zone_allowed(), not get_page_from_freelist(), at the
    top of the list, given that the node id is evaluated in the
    get_page_from_freelist() routine, I was figuring that the real
    problem was in the cpuset_zone_allowed() code.  Perhaps some testing
    of a simple hack approximation to this patch will tell us - next week.

    Downside #2 - may require the above mentioned additional nodemask_t
    in the task struct.

3)  The custom zonelist option - which was part of my original cpuset
    proposal, and which Andi K and I have gone back and forth on, with
    each of us liking and disliking it, at different times.  See further
    my latest writeup on this option:

      http://lkml.org/lkml/2005/11/5/252
      Date	Sat, 5 Nov 2005 20:18:41 -0800
      From	Paul Jackson <pj@sgi.com>
      Subject	Re: [PATCH]: Clean up of __alloc_pages

My current plan - see if somehow I can code up and get tested (2),
since a rough approximation to it would be trivial to code.  If that
works, go with it, unless someone convinces me otherwise.  If (2) can't
do the job, try (1), since that seems easier to code.  If that fails,
or someone shoots that down, or Andi makes a good enough case for (3),
give (3) a go - that's the hardest path, and risks the most collateral
damage to the behaviour of the memory paging subsystem.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] Add node to zone for the NUMA case.
  2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
@ 2006-09-16  3:40                   ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-16  3:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, rientjes, linux-mm

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  1:36               ` Andrew Morton
  2006-09-16  2:23                 ` Christoph Lameter
  2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
@ 2006-09-16  3:45                 ` Paul Jackson
  2 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-16  3:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rientjes, clameter, linux-mm

Andrew, replying to pj:
> > Separate question - would it be easy to run this again, with
> > a little patch from me that open coded cpuset_zone_allowed()
> > in get_page_from_freelist()?
>
> I guess it would,

Ah - I was asking (in my mind) David, not Andrew, if this test
could be rerun.

> but that'll be a next-week thing.

Good idea.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  2:23                 ` Christoph Lameter
@ 2006-09-16  4:34                   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-16  4:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, David Rientjes, linux-mm

On Fri, 15 Sep 2006 19:23:43 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 15 Sep 2006, Andrew Morton wrote:
> 
> > I doubt it - if we still hit z->zone_pgdat->node_id for all 40-odd zones,
> > I expect the cost will be comparable.
> 
> This is the zone_to_nid() macro.
> 
> Could we add a node_id field to the zone? So zone_to_nid() becomes one 
> lookup?
> 
> Note that zone_pgdat is out of the hot zone in the zone structure. Its 
> therefore slow.

Guys, it's still 50 cachelines per page.  This one needs more than tweaking
- algorithmic changes are needed.

Is there something wrong with caching the most-recent-zone?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  3:38         ` Paul Jackson
@ 2006-09-16  4:42           ` Andi Kleen
  2006-09-16 11:38             ` Paul Jackson
  2006-09-16  4:48           ` Andrew Morton
  1 sibling, 1 reply; 82+ messages in thread
From: Andi Kleen @ 2006-09-16  4:42 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andrew Morton, clameter, linux-mm, rientjes

On Saturday 16 September 2006 05:38, Paul Jackson wrote:

> 3)  The custom zonelist option - which was part of my original cpuset
>     proposal, and which Andi K and I have gone back and forth on, with
>     each of us liking and disliking it, at different times.  See further
>     my latest writeup on this option:

I'm currently back in the camp of liking it. It should be the fastest
in the fast path as far as I know and the slow path code 
is probably not as bad as I originally thought (didn't you already
have it coded up at some point?)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  3:38         ` Paul Jackson
  2006-09-16  4:42           ` Andi Kleen
@ 2006-09-16  4:48           ` Andrew Morton
  2006-09-16 11:30             ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-16  4:48 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes, ak

On Fri, 15 Sep 2006 20:38:16 -0700
Paul Jackson <pj@sgi.com> wrote:

> [Adding Andi to cc list, since I mention him below. -pj]
> 
> Andrew wrote:
> > I'm thinking a) is easily solved by adding an array of the zones inside the
> > `struct cpuset', and change get_page_from_freelist() to only look at those
> > zones.
> > ...
> > err, if we cache the most-recently-allocated-from zone in the cpuset then
> > we don't need the array-of-zones, do we?  We'll only need to do a zone
> > waddle when switching from one zone to the next, which is super-rare.
> > 
> > That's much simpler.
> > ...
> > And locking becomes simpler too.  It's just a check of
> > cpuset_zone_allowed(current->cpuset->current_allocation_zone)
> 
> This will blow chunks performance wise, with the current cpuset locking
> scheme.
> 
> Just one current_allocation_zone would not be enough.  Each node that
> the cpuset allowed would require its own current_allocation_zone.  For
> example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
> on CPU 32, Node 16, might be able to find free memory right on that
> Node 16.  But another task in the same cpuset running on CPU 112, Node
> 56 might have to scan past a dozen Nodes to Node 68 to find memory.
> 
> Accessing anything from a cpuset that depends on what nodes it allows
> requires taking the global mutex callback_mutex (in kernel/cpuset.c).
> We don't want to put a global mutex on the page alloc hot code path.
> 
> Anything we need to access frequently from a tasks cpuset has to be
> cached in its task struct.
> 
> Three alternative possibilities:
> 
> 1)  Perhaps these most-recently-allocated-from-zone's shouldn't be
>     properties of the cpuset, nor even of the task, but of the zone structs.
> 
>     If each zone struct on the zonelist had an additional flag bit marking
>     the zones that had no free memory, then we could navigate the zonelist
>     pretty quickly.  One more bit per zone struct would be enough to track
>     a simple rescan mechanism, so that we could detect when a node that
>     had formerly run out of memory once again had free memory.
> 
>     One or two bits per zone struct would be way cheaper, so far as
>     data space requirements.
> 
>     Downside - it still hits each zone struct - suboptimal cache trashing.
>     One less pointer chase than z->zone_pgdat->node_id, but still not
>     great.
> 
> 2)  It may be sufficient to locally optimize get_page_from_freelist()'s
>     calls to cpuset_zone_allowed() - basically open code cpuset_zone_allowed,
>     or at least refine its invocation.
> 
>     This might require a second nodemask in the task struct, for the typically
>     larger set of nodes that GFP_KERNEL allocations can use, more than just
>     the nodes that GFP_USER can use.  Such a second nodemask in the task struct
>     would enable me to avoid taking the global callback_mutex for some GFP_KERNEL 
>     allocations on tight memory systems.
> 
>     Downside #1 - still requires z->zone_pgdat->node_id.  Andrew suspects
>     that this is enough of a problem in itself.  From the profile, which
>     showed cpuset_zone_allowed(), not get_page_from_freelist(), at the
>     top of the list, given that the node id is evaluated in the
>     get_page_from_freelist() routine, I was figuring that the real
>     problem was in the cpuset_zone_allowed() code.  Perhaps some testing
>     of a simple hack approximation to this patch will tell us - next week.
> 
>     Downside #2 - may require the above mentioned additional nodemask_t
>     in the task struct.
> 
> 3)  The custom zonelist option - which was part of my original cpuset
>     proposal, and which Andi K and I have gone back and forth on, with
>     each of us liking and disliking it, at different times.  See further
>     my latest writeup on this option:
> 
>       http://lkml.org/lkml/2005/11/5/252
>       Date	Sat, 5 Nov 2005 20:18:41 -0800
>       From	Paul Jackson <pj@sgi.com>
>       Subject	Re: [PATCH]: Clean up of __alloc_pages
> 
> My current plan - see if somehow I can code up and get tested (2),
> since a rough approximation to it would be trivial to code.  If that
> works, go with it, unless someone convinces me otherwise.  If (2) can't
> do the job, try (1), since that seems easier to code.  If that fails,
> or someone shoots that down, or Andi makes a good enough case for (3),
> give (3) a go - that's the hardest path, and risks the most collateral
> damage to the behaviour of the memory paging subsystem.
> 

hm.

Why is it not sufficient to cache the most-recent zone*  in task_struct?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  4:48           ` Andrew Morton
@ 2006-09-16 11:30             ` Paul Jackson
  2006-09-16 15:18               ` Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-16 11:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes, ak

Andrew wrote:
> Why is it not sufficient to cache the most-recent zone*  in task_struct?

Because ...

pj - quoting himself:
> Just one current_allocation_zone would not be enough.  Each node that
> the cpuset allowed would require its own current_allocation_zone.  For
> example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
> on CPU 32, Node 16, might be able to find free memory right on that
> Node 16.  But another task in the same cpuset running on CPU 112, Node
> 56 might have to scan past a dozen Nodes to Node 68 to find memory.

Extending the above example, the task on CPU 32 and the one on CPU
112 could be the same task, running in the same cpuset the whole time,
after being rescheduled from one CPU to another.  The task would need
not one cached most-recent zone*, but one for each node it might find
itself on.

I'm pretty sure you don't want to put MAX_NUMNODES 'struct zone'
pointers in each task struct.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16  4:42           ` Andi Kleen
@ 2006-09-16 11:38             ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-16 11:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, clameter, linux-mm, rientjes

Andi wrote:
> I'm currently back in the camp of liking it. It should be the fastest
> in the fast path as far as I know and the slow path code 
> is probably not as bad as I originally thought

Unfortunately, I don't think that this proposal, alternative (3) "The
custom zonelist option", handles the fake numa node case that Andrew is
raising with the desired performance.  For Andrew's particular load, it
would still have long zonelists that had to be scanned before finding a
node with free memory.


> (didn't you already have it coded up at some point?)

Yup - in the link I provided describing this:

  http://lkml.org/lkml/2005/11/5/252

there is a link to my original patch:

  http://lkml.org/lkml/2004/8/2/256

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15  7:23     ` Andrew Morton
  2006-09-15  7:44       ` Paul Jackson
  2006-09-15  8:28       ` Andrew Morton
@ 2006-09-16 11:48       ` Paul Jackson
  2006-09-16 15:38         ` Andrew Morton
  2 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-16 11:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes

Andrew, replying to pj:
> > We shouldn't be heavily tuning for this case, and I am not aware of any
> > real world situations where real users would have reasonably determined
> > otherwise, had they had full realization of what was going on.
> 
> gotcha ;)

In the thrill of the hunt, I overlooked one itsy bitsy detail.

This load still seems a tad artificial to me.  What real world load
would run with 2/3's of the nodes having max'd out memory?

I'm suspecting that its worth some effort, to improve it, but not worth
major effort to get ideal performance.

I'm still open to more persuasion however.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 11:30             ` Paul Jackson
@ 2006-09-16 15:18               ` Andrew Morton
  2006-09-17  9:28                 ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-16 15:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes, ak

On Sat, 16 Sep 2006 04:30:36 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > Why is it not sufficient to cache the most-recent zone*  in task_struct?
> 
> Because ...
> 
> pj - quoting himself:
> > Just one current_allocation_zone would not be enough.  Each node that
> > the cpuset allowed would require its own current_allocation_zone.  For
> > example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
> > on CPU 32, Node 16, might be able to find free memory right on that
> > Node 16.  But another task in the same cpuset running on CPU 112, Node
> > 56 might have to scan past a dozen Nodes to Node 68 to find memory.
> 
> Extending the above example, the task on CPU 32 and the one on CPU
> 112 could be the same task, running in the same cpuset the whole time,
> after being rescheduled from one CPU to another.  The task would need
> not one cached most-recent zone*, but one for each node it might find
> itself on.
> 
> I'm pretty sure you don't want to put MAX_NUMNODES 'struct zone'
> pointers in each task struct.
> 

Could cache a zone* and the cpu number.  If the cpu number has changed
since last time, do another lookup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 11:48       ` Paul Jackson
@ 2006-09-16 15:38         ` Andrew Morton
  2006-09-16 21:51           ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-16 15:38 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes

On Sat, 16 Sep 2006 04:48:47 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew, replying to pj:
> > > We shouldn't be heavily tuning for this case, and I am not aware of any
> > > real world situations where real users would have reasonably determined
> > > otherwise, had they had full realization of what was going on.
> > 
> > gotcha ;)
> 
> In the thrill of the hunt, I overlooked one itsy bitsy detail.
> 
> This load still seems a tad artificial to me.  What real world load
> would run with 2/3's of the nodes having max'd out memory?

Pretty much all loads?  If you haven't consumed most of the "container"'s
memory then you have overprovisioned its size.

It could just be pagecache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 15:38         ` Andrew Morton
@ 2006-09-16 21:51           ` Paul Jackson
  2006-09-16 23:10             ` Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-16 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes

Andrew wrote:
> Pretty much all loads?  If you haven't consumed most of the "container"'s
> memory then you have overprovisioned its size.

Not so on real NUMA boxes.  If you configure your system so that
you are having to go a long way off-node for much of your memory,
then your performance is screwed.

No one in their right mind would run a memory hog that eats 40 nodes
of memory and a kernel build both in the same 60 node, small CPU
count cpuset on a real NUMA box.

The primary motivation for cpusets is to improve memory locality on
NUMA boxes.  You're using fake numa and cpusets to simulate destroying
memory locality.

On a real 64 node NUMA box, there would be 64 differently sorted
zonelists, each one centered on a different node.  The kernel build
would be running on different CPUs, associated with different nodes
than the memory hog, and it would be using zonelists that had the
unloaded (still has free memory) nodes at the front the list.

Aha - maybe this is the problem - the fake numa stuff is missing the
properly sorted zone lists.

We normally pick which zone list to use based on which CPU the
thread is running on, but that assumes there are at least as many
CPUs as Nodes, so that we have a many-to-one map from CPUs to Nodes.
In this fake numa setup, we've got way more Nodes than CPUs.

The current macro wrapper that calls __alloc_pages() in gfp.h relies
on per-node data to select a zonelist, and the node is selected based
on the tasks current CPU.  You can see this in the following code
from mmzone.h and gfp.h:

  #define numa_node_id()   (cpu_to_node(raw_smp_processor_id()))

  nid = numa_node_id();

  return __alloc_pages(gfp_mask, order,
        NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));

For the fake numa mechanism to work well, it would need to assign a
different zone list to the kernel build tasks in this test case than
it assigned to the memory hogs.  These different zone lists would
have to have a different sort, with unloaded nodes at the front.
The definition of numa_node_id() would have to depend on more than
just the processor id for this to work.

Perhaps the memory hog and the kernel build should be in separate
cpusets, and the zonelists they were assigned should depend on their
cpuset (not just their current CPU, as it is now), so that they
got zonelists with a different sort to them.  Then the kernel build
wouldn't have to waddle past all the used up memory hog nodes to find
some free memory

...

Well, I intentionally overstated the case a bit.  I doubt that we
should put a big effort -at this time- into elaborating the zonelists
and numa_node_id() mechanisms for x86_64 fake numa configurations.

Rather, we should recognize that it is only in such configurations that
the existing code in get_page_from_freelist() and cpuset_zone_allowed()
has the suboptimal performance observed here, due to the different
zonelist configuration.

In an earlier message, Andrew wrote:
> Guys, it's still 50 cachelines per page.  This one needs more than tweaking
> - algorithmic changes are needed.

I disagree (respectfully disagree -- Andrew has *way* more experience
hacking memory code than I ;).

I still don't know if this x86_64 fake numa mechanism has any real
life outside of entertaining kernel memory hackers.

We should first see how far the localized tweaks get us.

I suspect the tweaks will help quite a bit and provide an adequate
basis for effective machine partitioning, as proposed by the various
bright sparks.  If this partitioning proves useful, then I agree
that we should elaborate the algorithms and data structures to also
handle this usage well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 21:51           ` Paul Jackson
@ 2006-09-16 23:10             ` Andrew Morton
  2006-09-17  4:37               ` Christoph Lameter
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-16 23:10 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes

On Sat, 16 Sep 2006 14:51:17 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > Pretty much all loads?  If you haven't consumed most of the "container"'s
> > memory then you have overprovisioned its size.
> 
> Not so on real NUMA boxes.

I meant pretty much all loads when employing this trick of reusing the NUMA
code for containerisation.

>  If you configure your system so that
> you are having to go a long way off-node for much of your memory,
> then your performance is screwed.
> 
> No one in their right mind would run a memory hog that eats 40 nodes
> of memory and a kernel build both in the same 60 node, small CPU
> count cpuset on a real NUMA box.
> 
> The primary motivation for cpusets is to improve memory locality on
> NUMA boxes.  You're using fake numa and cpusets to simulate destroying
> memory locality.
> 
> On a real 64 node NUMA box, there would be 64 differently sorted
> zonelists, each one centered on a different node.  The kernel build
> would be running on different CPUs, associated with different nodes
> than the memory hog, and it would be using zonelists that had the
> unloaded (still has free memory) nodes at the front the list.
> 
> Aha - maybe this is the problem - the fake numa stuff is missing the
> properly sorted zone lists.

I don't see how any of this could help.  If one has a memory container
which is constructed from 50 zones, that linear search is just going to do
a lot of linear searching when the container approaches anything like
fullness.

It could well be a single CPU machine...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-15 23:03           ` David Rientjes
  2006-09-16  0:04             ` Paul Jackson
  2006-09-16  2:47             ` Christoph Lameter
@ 2006-09-17  3:45             ` David Rientjes
  2006-09-17 11:17               ` Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: David Rientjes @ 2006-09-17  3:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, clameter, linux-mm, rientjes

On Fri, 15 Sep 2006, David Rientjes wrote:

> On Fri, 15 Sep 2006, Andrew Morton wrote:
> 
> > Yes.  Speeding up get_page_from_freelist() is less than totally trivial. 
> > I've been putting off thinking about it until we're pretty sure that there
> > aren't any other showstoppers.
> > 
> > I'm (very) impressed at how well the infrastructre which you and Christoph
> > have put together has held up under this.
> > 
> 
> I used numa=fake=64 for 64 nodes of 48M each (with my numa=fake fix).  I 
> created a 2G cpuset with 43 nodes (43*48M = ~2G) and attached 'usemem -m 
> 1500 -s 10000000 &' to it for 1.5G of anonymous memory.  I then used 
> readprofile to time and profile a kernel build of 2.6.18-rc5 with x86_64 
> defconfig in the remaining 21 nodes.
> 
> Kernel build within the 2G container:
> 	real	5m23.057s
> 	user	9m13.395s
> 	sys	1m15.417s
> 

Kernel build within 2G cpuset with Christoph's patch:
	real	5m15.758s
	user	9m10.398s
	sys	1m9.072s

> Unrestricted kernel build (no NUMA emulation or usemem):
> 	real	5m3.213s
> 	user	9m19.483s
> 	sys	0m32.014s	<-- over twice faster
> 
> In 2G container:
> 	10599 __cpuset_zone_allowed			50.4714
> 	 3521 mwait_idle				45.1410
> 	 1149 clear_page				20.1579
> 	   24 clear_page_end				 3.4286
> 	  215 find_get_page				 3.0282
> 	  110 pfn_to_page				 2.3913
> 	  130 __down_read_trylock			 1.9697
> 	   86 page_remove_rmap				 1.9545
> 	  150 find_vma					 1.7241
> 	   46 __strnlen_user				 1.1795
> 	   32 nr_free_pages				 1.1034
> 	   55 page_to_pfn				 1.0784
> 	   22 page_add_file_rmap			 1.0000
> 	  829 get_page_from_freelist			 0.8904
> 	 1548 do_page_fault				 0.8586
> 	   17 file_ra_state_init			 0.8500
> 	   63 _atomic_dec_and_lock			 0.7500
> 	   85 ia32_sysenter_target			 0.7083
> 	   47 cond_resched				 0.6912
> 	  198 copy_user_generic				 0.6644
> 

In 2G cpuset with Christoph's patch:
	  9232 __cpuset_zone_allowed                     43.9619
	  2083 mwait_idle                                26.7051
	   973 clear_page                                17.0702
	   131 pfn_to_page                                2.8478
	   201 find_get_page                              2.8310
	    18 clear_page_end                             2.5714
	   142 __down_read_trylock                        2.1515
	    87 page_remove_rmap                           1.9773
	   149 find_vma                                   1.7126
	    85 page_to_pfn                                1.6667
	    51 __strnlen_user                             1.3077
	    34 nr_free_pages                              1.1724
	    25 page_add_file_rmap                         1.1364
	    18 fput                                       1.0588
	    21 file_ra_state_init                         1.0500
	   108 ia32_sysenter_target                       0.9000
	    74 _atomic_dec_and_lock                       0.8810
	   818 get_page_from_freelist                     0.8786
	   133 zone_watermark_ok                          0.8526
	  1524 do_page_fault                              0.8453
	...

> Unrestricted:
> 	 3719 mwait_idle				47.6795
> 	 1083 clear_page				19.0000
> 	   20 clear_page_end				 2.8571
> 	  175 find_get_page				 2.4648
> 	   77 page_remove_rmap				 1.7500
> 	  114 __down_read_trylock			 1.7273
> 	   77 pfn_to_page				 1.6739
> 	  144 find_vma					 1.6552
> 	   60 __strnlen_user				 1.5385
> 	   71 page_to_pfn				 1.3922
> 	   24 page_add_file_rmap			 1.0909
> 	   17 fput					 1.0000
> 	   80 _atomic_dec_and_lock			 0.9524
> 	    4 up_write					 0.8000
> 	 1439 do_page_fault				 0.7981
> 	   13 compat_sys_open				 0.7647
> 	  227 copy_user_generic				 0.7617
> 	   89 ia32_sysenter_target			 0.7417
> 	   21 memcmp					 0.6562
> 	   13 file_ra_state_init			 0.6500
> 	...
> 	  389 get_page_from_freelist			 0.4178
> 

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 23:10             ` Andrew Morton
@ 2006-09-17  4:37               ` Christoph Lameter
  2006-09-17  4:55                 ` Andrew Morton
  2006-09-19 19:17                 ` David Rientjes
  0 siblings, 2 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-17  4:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, linux-mm, rientjes

On Sat, 16 Sep 2006, Andrew Morton wrote:

> I don't see how any of this could help.  If one has a memory container
> which is constructed from 50 zones, that linear search is just going to do
> a lot of linear searching when the container approaches anything like
> fullness.

One would not construct a memory container from 50 zones but build a 
single zone as a memory container of that size.

This could work by creating a new fake node and allocating a certain 
amount of  memory from the old zone for the fake node. Then one would have 
a zone that is the container and not a container that consists of 
gazillions of fake nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  4:37               ` Christoph Lameter
@ 2006-09-17  4:55                 ` Andrew Morton
  2006-09-17 12:09                   ` Paul Jackson
  2006-09-17 12:36                   ` Christoph Lameter
  2006-09-19 19:17                 ` David Rientjes
  1 sibling, 2 replies; 82+ messages in thread
From: Andrew Morton @ 2006-09-17  4:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, linux-mm, rientjes

On Sat, 16 Sep 2006 21:37:04 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sat, 16 Sep 2006, Andrew Morton wrote:
> 
> > I don't see how any of this could help.  If one has a memory container
> > which is constructed from 50 zones, that linear search is just going to do
> > a lot of linear searching when the container approaches anything like
> > fullness.
> 
> One would not construct a memory container from 50 zones but build a 
> single zone as a memory container of that size.
> 
> This could work by creating a new fake node and allocating a certain 
> amount of  memory from the old zone for the fake node. Then one would have 
> a zone that is the container and not a container that consists of 
> gazillions of fake nodes.

Well yes, there are various things one could do if one wanted to make lots
of kernel changes.  I believe Magnus posted some patches along these lines
a while back.

But it's not clear that we _need_ to make such changes. 
nodes-as-containers works OK out-of-the-box.  Apart from the fact that
get_page_from_freelist() sucks.  And speeding that up will speed up other
workloads.

Would prefer to make the kernel faster, rather than more complex...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-16 15:18               ` Andrew Morton
@ 2006-09-17  9:28                 ` Paul Jackson
  2006-09-17  9:51                   ` Nick Piggin
  2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
  0 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-17  9:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes, ak

Andrew wrote:
> Could cache a zone* and the cpu number.  If the cpu number has changed
> since last time, do another lookup.

Hmmm ... getting closer.  This doesn't work as stated, because
consecutive requests to allocate a page could use different zonelists,
perhaps from MPOL_BIND, while still coming from the same cpu number.
The cached zone* would be in the wrong zonelist in that case.

How about two struct zone pointers in the task struct?

One caching the zonelist pointer passed into get_page_from_freelist(),
and the other caching the pointer you've been suggesting all along,
to the zone where we found free memory last time we looked.

If that same task tries to allocate a page with a different zonelist
then we fallback to a brute force lookup and reset the cached state.

(Note to self) The cpuset_update_task_memory_state() routine will
have to zap these two cached zone pointers.  That's easy.

Also, as you noticed earlier, we need a way to notice if a once full
zone that we've been skipping over gets some more free memory.

One way to do that would be to add one more (a third) zone* to the
task struct.  This third zone* would point to the next zone to retry
for free memory.

  Once each time we call get_page_from_freelist(), we'd retry one
  zone, to see if it had gained some free memory.

    If it still had no free memory, increment the retry pointer,
    wrapping when it got up to the zone* we were currently getting
    memory from.

    If we discovered some new free memory on the retried node, then
    start using that zone* instead of the one we were using.

Now we're up to three zone* pointers in the task struct:
  base  -- the base zonelist pointer passed to get_page_from_freelist()
  cur   -- the current zone we're getting memory from
  retry -- the next zone to recheck for free memory

If we make the cur and retry pointers be 32 bit indices, instead of
pointers, this saves 64 bits in the task struct on 64 bit arch's.

Calls to get_page_from_freelist() with GFP_HARDWALL -not- set, and
those with ALLOC_CPUSET -not- set, must bypass this cached state.

The micro-optimizations I had in mind to the cpuset_zone_allowed()
call from get_page_from_freelist() are probably still worth doing,
as that code path, from a linear search of the zonelist, is still
necessary in various situations.

How does this sound?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  9:28                 ` Paul Jackson
@ 2006-09-17  9:51                   ` Nick Piggin
  2006-09-17 11:15                     ` Paul Jackson
  2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
  1 sibling, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2006-09-17  9:51 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andrew Morton, clameter, linux-mm, rientjes, ak

Paul Jackson wrote:
> Andrew wrote:
> 
>>Could cache a zone* and the cpu number.  If the cpu number has changed
>>since last time, do another lookup.
> 
> 
> Hmmm ... getting closer.  This doesn't work as stated, because
> consecutive requests to allocate a page could use different zonelists,
> perhaps from MPOL_BIND, while still coming from the same cpu number.
> The cached zone* would be in the wrong zonelist in that case.
> 
> How about two struct zone pointers in the task struct?
> 
> One caching the zonelist pointer passed into get_page_from_freelist(),
> and the other caching the pointer you've been suggesting all along,
> to the zone where we found free memory last time we looked.
> 
> If that same task tries to allocate a page with a different zonelist
> then we fallback to a brute force lookup and reset the cached state.
> 
> (Note to self) The cpuset_update_task_memory_state() routine will
> have to zap these two cached zone pointers.  That's easy.
> 
> 
> Also, as you noticed earlier, we need a way to notice if a once full
> zone that we've been skipping over gets some more free memory.
> 
> One way to do that would be to add one more (a third) zone* to the
> task struct.  This third zone* would point to the next zone to retry
> for free memory.
> 
>   Once each time we call get_page_from_freelist(), we'd retry one
>   zone, to see if it had gained some free memory.
> 
>     If it still had no free memory, increment the retry pointer,
>     wrapping when it got up to the zone* we were currently getting
>     memory from.
> 
>     If we discovered some new free memory on the retried node, then
>     start using that zone* instead of the one we were using.
> 
> 
> Now we're up to three zone* pointers in the task struct:
>   base  -- the base zonelist pointer passed to get_page_from_freelist()
>   cur   -- the current zone we're getting memory from
>   retry -- the next zone to recheck for free memory
> 
> If we make the cur and retry pointers be 32 bit indices, instead of
> pointers, this saves 64 bits in the task struct on 64 bit arch's.
> 
> Calls to get_page_from_freelist() with GFP_HARDWALL -not- set, and
> those with ALLOC_CPUSET -not- set, must bypass this cached state.
> 
> The micro-optimizations I had in mind to the cpuset_zone_allowed()
> call from get_page_from_freelist() are probably still worth doing,
> as that code path, from a linear search of the zonelist, is still
> necessary in various situations.
> 
> How does this sound?
> 

Too complex? ;) Why not just start with caching the first allowed
zone and see how far that gets you?

With respect to a new design, there have been various noises about
using nodemask bits to specify the node to allocate from, I wonder
what happened with that? Your cpuset code would end up being
something like a bitwise and over a fairly small bit of memory
(even for hundreds of nodes/containers).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  9:51                   ` Nick Piggin
@ 2006-09-17 11:15                     ` Paul Jackson
  2006-09-17 12:44                       ` Nick Piggin
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 11:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, clameter, linux-mm, rientjes, ak

Nick wrote:
> Too complex? ;)

I quite agree it looks more complex than we wanted.

> Why not just start with caching the first allowed
> zone and see how far that gets you?

I thought I had explained clearly why that doesn't work.

I'll try again.

I am presuming here that by 'first allowed zone' you are
referring by yet another phrase to what Andrew has called
'most-recently-allocated-from zone', and what I described with:

  cur   -- the current zone we're getting memory from

If that presumption is wrong, then my reply following is bogus,
and you'll have to explain what you meant.

I can't just cache this zone, because I at least have to also cache
something else, such as the zonelist I found that zone within, so
I know not to use that cached zone if I am later passed a different
zonelist.

So I need to cache at least two zone pointers, the base zonelist and
the first allowed zone.

Then I do need to do something to avoid using that cached zone
long after some closer zone gets some free memory again.  Caching a
revolving retry zone pointer is one way to do that.  Perhaps there
are simpler ways ... I'm open to suggestions.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  3:45             ` David Rientjes
@ 2006-09-17 11:17               ` Paul Jackson
  2006-09-17 12:41                 ` Christoph Lameter
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 11:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, clameter, linux-mm

David wrote:
> In 2G container:
> 	10599 __cpuset_zone_allowed			50.4714
> 	 3521 mwait_idle				45.1410
> 	 1149 clear_page				20.1579
>        ....
>
> In 2G cpuset with Christoph's patch:
> 	  9232 __cpuset_zone_allowed                     43.9619
> 	  2083 mwait_idle                                26.7051
> 	   973 clear_page                                17.0702

There happened to be fewer calls to __cpuset_zone_allowed in the
second test (thanks for doing this!).  If I divide that out, to
get the cost per call, it's
  original test:    10599/50.4714 == 210.00011
  christoph patch:   9232/43.9619 == 210.00002

That's -extremely- close.

Aha - notice the following code in kernel/cpuset.c:

int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
{
        int node;                       /* node that zone z is on */
        ...
        node = z->zone_pgdat->node_id;

Looks like an open coded zone_to_nid() invocation that wasn't
addressed by Christoph's patch.

Tsk tsk ... shame on whomever open coded that one ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  4:55                 ` Andrew Morton
@ 2006-09-17 12:09                   ` Paul Jackson
  2006-09-17 12:36                   ` Christoph Lameter
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 12:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes

Andrew wrote:
> Apart from the fact that
> get_page_from_freelist() sucks.  And speeding that up will speed up other
> workloads.

With the exception of this fake NUMA trick, I still haven't seen any
evidence that there is a performance issue here.  Sure, there is the
potential for painful linear searches here, but in actual use, that's
not what happens.

A linear search is not linear in performance if the search is heavily
biased toward succeeding on the first few probes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  4:55                 ` Andrew Morton
  2006-09-17 12:09                   ` Paul Jackson
@ 2006-09-17 12:36                   ` Christoph Lameter
  2006-09-17 13:06                     ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-17 12:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, linux-mm, rientjes

On Sat, 16 Sep 2006, Andrew Morton wrote:

> Well yes, there are various things one could do if one wanted to make lots
> of kernel changes.  I believe Magnus posted some patches along these lines
> a while back.

I doubt that there would be many kernel chances. This follows straight
from the ability to do node hot plug.

 > But it's not clear that we _need_ to make such changes. 
> nodes-as-containers works OK out-of-the-box.  Apart from the fact that
> get_page_from_freelist() sucks.  And speeding that up will speed up other
> workloads.

What you are doing is using nodes to partition memory into small chunks
that are then collected in a cpuset. That is not the way how nodes
or cpusets were designed to work.
 
> Would prefer to make the kernel faster, rather than more complex...

Less nodes and less zones mean smaller zonelists and therefore a faster 
kernel since we have to traverse shorter lists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 11:17               ` Paul Jackson
@ 2006-09-17 12:41                 ` Christoph Lameter
  2006-09-17 13:03                   ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-17 12:41 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, akpm, linux-mm

On Sun, 17 Sep 2006, Paul Jackson wrote:

> Aha - notice the following code in kernel/cpuset.c:
> 
> int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
> {
>         int node;                       /* node that zone z is on */
>         ...
>         node = z->zone_pgdat->node_id;
> 
> Looks like an open coded zone_to_nid() invocation that wasn't
> addressed by Christoph's patch.
> 
> Tsk tsk ... shame on whomever open coded that one ;).

Are you sure that you are looking at a current tree? This is zone_to_nid 
here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 11:15                     ` Paul Jackson
@ 2006-09-17 12:44                       ` Nick Piggin
  2006-09-17 13:19                         ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2006-09-17 12:44 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, clameter, linux-mm, rientjes, ak

Paul Jackson wrote:
> Nick wrote:
> 
>>Too complex? ;)
> 
> 
> I quite agree it looks more complex than we wanted.
> 
> 
> 
>>Why not just start with caching the first allowed
>>zone and see how far that gets you?
> 
> 
> I thought I had explained clearly why that doesn't work.
> 
> I'll try again.
> 
> I am presuming here that by 'first allowed zone' you are
> referring by yet another phrase to what Andrew has called
> 'most-recently-allocated-from zone', and what I described with:
> 
>   cur   -- the current zone we're getting memory from
> 
> If that presumption is wrong, then my reply following is bogus,
> and you'll have to explain what you meant.
> 
> I can't just cache this zone, because I at least have to also cache
> something else, such as the zonelist I found that zone within, so
> I know not to use that cached zone if I am later passed a different
> zonelist.
> 
> So I need to cache at least two zone pointers, the base zonelist and
> the first allowed zone.
> 
> Then I do need to do something to avoid using that cached zone
> long after some closer zone gets some free memory again.  Caching a
> revolving retry zone pointer is one way to do that.  Perhaps there
> are simpler ways ... I'm open to suggestions.

Oh no, I'm quite aware (and agree) that you'll _also_ need to cache
your zonelist. So I agree with you up to there.

The part of your suggestion that I think is too complex to worry about
initially, is worrying about full/low/high watermarks and skipping over
full zones in your cache.

The reason is that it will no longer be a identically functioning
cache, but would include heuristics where you fall back to checking
previously skipped zones at given intervals... I really hate having to
add a heuristic "magic" type of thing if we can avoid it.

So: just cache the *first* zone that the cpuset allows. If that is
full and we have to search subsequent zones, so be it. I hope it would
work reasonably well in the common case, though.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 12:41                 ` Christoph Lameter
@ 2006-09-17 13:03                   ` Paul Jackson
  2006-09-17 20:36                     ` David Rientjes
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 13:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: rientjes, akpm, linux-mm

Christoph wrote:
> Are you sure that you are looking at a current tree? This is zone_to_nid 
> here.

You're two steps ahead of me.  Yes, it's zone_to_nid() in the
current tree.

So ... any idea why your patch made only 0.000042%
difference in the cost per call of __cpuset_zone_allowed()?

That is bizarrely close to zero.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 12:36                   ` Christoph Lameter
@ 2006-09-17 13:06                     ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 13:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm, rientjes

Christoph wrote:
> What you are doing is using nodes to partition memory into small chunks
> that are then collected in a cpuset. That is not the way how nodes
> or cpusets were designed to work.

Agreed, and a tad frustrating.

However, in Andrew's favor, he has discovered that despite our best
design efforts, this node/cpuset/... stuff actually does work "out
of the box" when (ab)used in this strange fashion.

Except for one performance glitch, where a loop in the routine
get_page_from_freelist() loops too many times, he's got the makings
of memory containers 'for free.'

If he could just get the performance of that loop in this fake NUMA
setup from linear in the number of filled up fake nodes, back to a
small constant, he'd be good to go with this new mechanism.

The only useful part of this debate is how many words of cached data
per something (cpu, cpuset, node, zonelist, task, ...)  it will take
to get this loop cost back to a small constant, even when presented
with such a sorry excuse for a zonelist.

Andrew started at one word; I started at MAX_NUMNODES words.  I've got
him up to two words; he has me down to three words.

We're converging fast.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 12:44                       ` Nick Piggin
@ 2006-09-17 13:19                         ` Paul Jackson
  2006-09-17 13:52                           ` Nick Piggin
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 13:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, clameter, linux-mm, rientjes, ak

Nick wrote:
> Oh no, I'm quite aware (and agree) that you'll _also_ need to cache
> your zonelist. So I agree with you up to there.

Ah - good.  Sorry for my misreading.

> The part of your suggestion that I think is too complex to worry about
> initially, is worrying about full/low/high watermarks and skipping over
> full zones in your cache.

Now I'm confused again.  I wasn't aware of giving the slightest
consideration to full/low/high watermarks in this design.

Could you quote the portion of my design in which you found this
consideration of watermarks?

I apparently did a lousy job of explaining something, and I'm not
even sure what part of my design I so messed up.

> So: just cache the *first* zone that the cpuset allows. If that is
> full and we have to search subsequent zones, so be it. I hope it would
> work reasonably well in the common case, though.

Well, hoping that I'm not misreading again, this seems like it won't
help.  In the case that Andrew and David present, the cpuset was
allowing pretty much every node (60 of 64, IIRC).  The performance
problem came in skipping over the nodes that David's test filled up
with a memory hog, to get to a node that still had memory it could
provide to the task running the kernel build.

So I don't think that it's finding the first node allowed by the
cpuset that is the painful part here.  I think it is finding the
first node that still has any free memory pages.

So I'm pretty sure that I have to cache the first node that isn't
full.  And I'm pretty sure that's what Andrew has been asking for
consistently.

Either I'm misreading your suggest to "Just cache the *first* zone
that the cpuset allows", or else the two of us have come away with
a very different understanding of this thread so far.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 13:19                         ` Paul Jackson
@ 2006-09-17 13:52                           ` Nick Piggin
  2006-09-17 21:19                             ` Paul Jackson
  2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
  0 siblings, 2 replies; 82+ messages in thread
From: Nick Piggin @ 2006-09-17 13:52 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, clameter, linux-mm, rientjes, ak

Paul Jackson wrote:
> Nick wrote:

>>The part of your suggestion that I think is too complex to worry about
>>initially, is worrying about full/low/high watermarks and skipping over
>>full zones in your cache.
> 
> 
> Now I'm confused again.  I wasn't aware of giving the slightest
> consideration to full/low/high watermarks in this design.
> 
> Could you quote the portion of my design in which you found this
> consideration of watermarks?

So that's the part where you wanted to see if a zone has any free
memory pages. What you are doing is not actually seeing if a zone
has _any_ free memory pages, but testing whether a given allocation
type is within its corresponding watermarks.

Among other things, these watermarks depends on whether the
allocation is GFP_WAIT or not, and GFP_HIGH or not... so either
you'll be invalidating your cache all the time or you won't obey
the watermarks very well.

Not only that, but you'll sometimes not allocate from more
pereferable zones that now have enough pages but previously didn't.

>>So: just cache the *first* zone that the cpuset allows. If that is
>>full and we have to search subsequent zones, so be it. I hope it would
>>work reasonably well in the common case, though.
> 
> 
> Well, hoping that I'm not misreading again, this seems like it won't
> help.  In the case that Andrew and David present, the cpuset was
> allowing pretty much every node (60 of 64, IIRC).  The performance
> problem came in skipping over the nodes that David's test filled up
> with a memory hog, to get to a node that still had memory it could
> provide to the task running the kernel build.
> 
> So I don't think that it's finding the first node allowed by the
> cpuset that is the painful part here.  I think it is finding the
> first node that still has any free memory pages.
> 
> So I'm pretty sure that I have to cache the first node that isn't
> full.  And I'm pretty sure that's what Andrew has been asking for
> consistently.

Yes I misunderstood the requirements. I thought it was when a small
number of nodes were allowed by the cpuset.

Hmm, if a large number of nodes are allowed by the cpuset, and
you're operating in low memory conditions, you're going to want to
do a reasonable amount of iterating over the zones anyway so you
can do things like kick each one's kswapd.

What we could do then, is allocate pages in batches (we already do),
but only check watermarks if we have to go to the buddly allocator
(we don't currently do this, but really should anyway, considering
that the watermark checks are based on pages in the buddy allocator
rather than pages in buddy + pcp).

At that point, having a cache of the last pcp to have free pages
might be an option: any pcp pages are fair game because they are
already allocated from the POV of the watermark checking / kswapd
kicking.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  9:28                 ` Paul Jackson
  2006-09-17  9:51                   ` Nick Piggin
@ 2006-09-17 16:29                   ` Andrew Morton
  2006-09-18  2:11                     ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-17 16:29 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes, ak

On Sun, 17 Sep 2006 02:28:34 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > Could cache a zone* and the cpu number.  If the cpu number has changed
> > since last time, do another lookup.
> 
> Hmmm ... getting closer.  This doesn't work as stated, because
> consecutive requests to allocate a page could use different zonelists,
> perhaps from MPOL_BIND, while still coming from the same cpu number.
> The cached zone* would be in the wrong zonelist in that case.
> 
> How about two struct zone pointers in the task struct?
> 
> One caching the zonelist pointer passed into get_page_from_freelist(),
> and the other caching the pointer you've been suggesting all along,
> to the zone where we found free memory last time we looked.
> 
> If that same task tries to allocate a page with a different zonelist
> then we fallback to a brute force lookup and reset the cached state.
> 
> (Note to self) The cpuset_update_task_memory_state() routine will
> have to zap these two cached zone pointers.  That's easy.
> 
> 
> Also, as you noticed earlier, we need a way to notice if a once full
> zone that we've been skipping over gets some more free memory.

Do we?

I don't believe that we need to do this in the nodes-for-containers
application.  If the machine is UMA then all nodes are equal.  If the
machine is physically NUMA then one hopes that the administrator will
construct the container cpusets in a way which minimises off-node traffic,
but I haven't thought about that aspect of it much.

Perhaps we do need to do this in the legacy (;)) cpuset application? 
Perhaps what happens is that the "first" memory nodes in a cpuset are those
which are "closest" to that cpuset's CPUs??

IOW: in which operational scenarios and configurations would you view this
go-back-to-the-earlier-zone-if-some-memory-came-free-in-it approach
to be needed?

> One way to do that would be to add one more (a third) zone* to the
> task struct.  This third zone* would point to the next zone to retry
> for free memory.
> 
>   Once each time we call get_page_from_freelist(), we'd retry one
>   zone, to see if it had gained some free memory.
> 
>     If it still had no free memory, increment the retry pointer,
>     wrapping when it got up to the zone* we were currently getting
>     memory from.
> 
>     If we discovered some new free memory on the retried node, then
>     start using that zone* instead of the one we were using.
> 
> 
> Now we're up to three zone* pointers in the task struct:
>   base  -- the base zonelist pointer passed to get_page_from_freelist()
>   cur   -- the current zone we're getting memory from
>   retry -- the next zone to recheck for free memory
> 
> If we make the cur and retry pointers be 32 bit indices, instead of
> pointers, this saves 64 bits in the task struct on 64 bit arch's.
> 
> Calls to get_page_from_freelist() with GFP_HARDWALL -not- set, and
> those with ALLOC_CPUSET -not- set, must bypass this cached state.
> 
> The micro-optimizations I had in mind to the cpuset_zone_allowed()
> call from get_page_from_freelist() are probably still worth doing,
> as that code path, from a linear search of the zonelist, is still
> necessary in various situations.
> 
> How does this sound?

Getting there.  I don't think we're particularly worried about
CONFIG_NUMA-only space in the task_struct, btw..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 13:03                   ` Paul Jackson
@ 2006-09-17 20:36                     ` David Rientjes
  2006-09-17 21:20                       ` Paul Jackson
  2006-09-17 22:27                       ` Paul Jackson
  0 siblings, 2 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-17 20:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, akpm, linux-mm, rientjes

On Sun, 17 Sep 2006, Paul Jackson wrote:

> You're two steps ahead of me.  Yes, it's zone_to_nid() in the
> current tree.
> 
> So ... any idea why your patch made only 0.000042%
> difference in the cost per call of __cpuset_zone_allowed()?
> 
> That is bizarrely close to zero.
> 

The reason why is because I'm an idiot.  I, too, was looking at an old 
tree where z->zone_pgdat->node_id still existed in __cpuset_zone_allowed.  
I changed it, and another reference in cpuset_zonelist_valid_mems_allowed 
to z->node.

		unpatched	patched
	real	5m23.057s	5m9.831s
	user	9m13.395s	9m14.855s
	sys	1m15.417s	0m55.007s

Unpatched:
	10599 __cpuset_zone_allowed                     50.4714
	 3521 mwait_idle                                45.1410
	 1149 clear_page                                20.1579
	   24 clear_page_end                             3.4286
	  215 find_get_page                              3.0282
	  110 pfn_to_page                                2.3913
	  130 __down_read_trylock                        1.9697
	   86 page_remove_rmap                           1.9545
	  150 find_vma                                   1.7241
	   46 __strnlen_user                             1.1795
	   32 nr_free_pages                              1.1034
	   55 page_to_pfn                                1.0784
	   22 page_add_file_rmap                         1.0000
	  829 get_page_from_freelist                     0.8904

So __cpuset_zone_allowed is 10599/50.4714 = 210.000118879.

Patched:
	5822 __cpuset_zone_allowed			29.1100
	1680 mwait_idle					21.5385
	1046 clear_page					18.3509
	 215 find_get_page				 3.0282
	  15 clear_page_end				 2.1429
	  86 page_remove_rmap				 1.9545
	  81 pfn_to_page				 1.7609
	 116 __down_read_trylock			 1.7576
	 132 find_vma					 1.5172
	  20 fput					 1.1765
	  43 __strnlen_user				 1.1026
	  53 page_to_pfn				 1.0392
	  22 page_add_file_rmap				 1.0000
	 804 get_page_from_freelist			 0.8636

So __cpuset_zone_allowed is 5822/29.1100 = 200.000000000 which is 4.8% 
faster.

Note: both versions also include my numa=fake fixes that are not yet in 
mm (which are necessary for me to even boot my machine with numa=fake=64).

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 13:52                           ` Nick Piggin
@ 2006-09-17 21:19                             ` Paul Jackson
  2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 21:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, clameter, linux-mm, rientjes, ak

Nick wrote:
> So that's the part where you wanted to see if a zone has any free
> memory pages. What you are doing is not actually seeing if a zone
> has _any_ free memory pages, but testing whether a given allocation
> type is within its corresponding watermarks.

Ah - now your point makes sense.  Yes, I should not have been looking
for _any_ free memory, but rather for memory satisfying the watermark
and other conditions of the current request.

And the question of whether the cached 'base' and 'cur' pointers should
be invalidated everytime a request has differing watermarks ... well I
can think of several answers to that question ... all sucky.

However ... Andrew has prodded me into some more simplification, which
will toss this 'retry' pointer in the ash heap of history.  See my
upcoming reply to his latest post.

I trust you will not mind that this 'retry' pointer gets thrown out ;).

Nick wrote:
> What we could do then, is allocate pages in batches (we already do),
> but only check watermarks if we have to go to the buddly allocator
> (we don't currently do this, but really should anyway, considering
> that the watermark checks are based on pages in the buddy allocator
> rather than pages in buddy + pcp).

I'll have to leave this matter to you.  It's not something I understand
well enough to be useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 20:36                     ` David Rientjes
@ 2006-09-17 21:20                       ` Paul Jackson
  2006-09-17 22:27                       ` Paul Jackson
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 21:20 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, linux-mm

David wrote:
> So __cpuset_zone_allowed is 5822/29.1100 = 200.000000000 which is 4.8% 
> faster.

Excellent - that sounds reasonable.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 20:36                     ` David Rientjes
  2006-09-17 21:20                       ` Paul Jackson
@ 2006-09-17 22:27                       ` Paul Jackson
  2006-09-17 23:49                         ` David Rientjes
  1 sibling, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-17 22:27 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, linux-mm

David,

Could you run the following on your fake numa booted box, and
report the results:

	find /sys/devices -name distance | xargs head

Following Andrew's suggestion, I'm toying with the idea that since
one fake numa node is as good as another, there is no reason to worry
about retrying skipped over nodes or re-validating the cached zones
on such systems

Roughly, my plan is:

    If the node on which we most recently found memory is 'just as
    good as' the first node in the zonelist, then go ahead and cache
    that node and continue to use it as long as we can.  We're in
    the fake NUMA case, and one node is as good as another.

    If that node is 'further away' than the first node in the zonelist,
    don't cache it.  We're in the real NUMA case, and we're happy to
    carry on just as we have in the past, always scanning from the
    beginning of the zonelist.

However this requires some way to determine whether two fake nodes
are really on the same hardware node.

Hmmm ... there's a good chance that the kernel 'node_distance()'
routine, as shown in the above /sys/devices distance table, is not
the way to determine this.  Perhaps that table must reflect the
fake reality, not the underlying hardware reality.

Though, if node_distance() doesn't tell us this, there's a chance
this will cause problems elsewhere and we will end up wanting to
fix node_distance() in the fake NUMA case to note that all nodes
are actually local, which is value 10, I believe.  The code in
arch/x86_64/mm/srat.c:slit_valid() may conflict with this, and the
concerns its comment raises about a SLIT table with all 10's may also
point to conflicts with this.

You've been looking at this fake NUMA code recently, David.

Perhaps you can recommend some other way from within the
mm/page_alloc.c code to efficiently (just a couple cache lines)
answer the question:

    Given two node numbers, are they really just two fake nodes
    on the same hardware node, or are they really on two distinct
    hardware nodes?

Granted, I'm not -entirely- following Andrew's lead here.  He's been
hoping that this most-recently-used-node cache would benefit both
fake and real NUMA systems, while I've thinking we don't really have
a problem on the real NUMA systems, and it is better not to mess with
the memory allocation pattern there (if it ain't broke, don't fix ...)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 22:27                       ` Paul Jackson
@ 2006-09-17 23:49                         ` David Rientjes
  2006-09-18  2:20                           ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: David Rientjes @ 2006-09-17 23:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, linux-mm, rientjes

On Sun, 17 Sep 2006, Paul Jackson wrote:

> David,
> 
> Could you run the following on your fake numa booted box, and
> report the results:
> 
> 	find /sys/devices -name distance | xargs head
> 

With NUMA emulation, the distance from a node to itself is 10 and the 
distance to all other fake nodes is 20.

So for numa=fake=4,
root@numa:/$ cat /sys/devices/system/node/node*/distance
10 20 20 20
20 10 20 20
20 20 10 20
20 20 20 10

> You've been looking at this fake NUMA code recently, David.
> 
> Perhaps you can recommend some other way from within the
> mm/page_alloc.c code to efficiently (just a couple cache lines)
> answer the question:
> 
>     Given two node numbers, are they really just two fake nodes
>     on the same hardware node, or are they really on two distinct
>     hardware nodes?
> 

The cpumap for all fake nodes are always 00000000 except for node0 which 
reports the true hardware configuration.

Using the previous example,
root@numa:/$ cat /sys/devices/system/node/node*/cpumap
00000003
00000000
00000000
00000000

(Note: 00000003 because numa.* is a dual-core machine).

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
@ 2006-09-18  2:11                     ` Paul Jackson
  2006-09-18  5:09                       ` Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-18  2:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes, ak

Andrew wrote:
> IOW: in which operational scenarios and configurations would you view this
> go-back-to-the-earlier-zone-if-some-memory-came-free-in-it approach
> to be needed?

On fake numa systems, I agree that going back to earlier zones is
not needed.  As you have stated, all nodes are equally good on such
a system.

And besides, right now, I could not give you -any- operational scenario
in which the fake numa approach would be needed.  Perhaps you have
some in mind ...?  I'd be interested to learn how you view these fake
numa based memory containers being used.

On real numa systems, if we don't go back to earlier zones fairly
soon after it is possible to do so, then we are significantly changing
the memory placement behaviour of the system.  That can be risky and is
better not done without good motivation.

If some app running for a while on one cpu, allowed to use memory
on several nodes, had its allocations temporarilly pushed off its
local node, further down its zonelist, it might expect to have its
allocations go back to its local node, just by freeing up memory there.

Many of our most important HPC (High Performance Computing) apps rely
on what they call 'first touch' placement.  That means to them that
memory will be allocated on the node associated with the allocating
thread, or on the closest node thereto.  They will run massive jobs,
with sometimes just a few of the many threads in the job allocating
massive amounts of memory, by the simple expedient of controlling
on which cpu the allocator thread is running as it allocates by
touching the memory pages for the first time.

Their performance can depend critically on getting that memory
placement correct, so that the computational threads are, on average,
as close as can be to their data.

This is the sort of memory placement change that has a decent chance
of coming back around and biting me in the backside, a year or two
down the road, when some app that happened, perhaps unwittingly,
to be sensitive to this change, tripped over it.

I am certainly not saying for sure such a problem would arise.
Good programming practices would suggest not relying on such node
overflow to get memory placed.  But good programming practices are
not always perfectly followed.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17 23:49                         ` David Rientjes
@ 2006-09-18  2:20                           ` Paul Jackson
  2006-09-18 16:34                             ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-18  2:20 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, linux-mm

David wrote:
> >     Given two node numbers, are they really just two fake nodes
> >     on the same hardware node, or are they really on two distinct
> >     hardware nodes?
> > 
> 
> The cpumap for all fake nodes are always 00000000 except for node0 which 
> reports the true hardware configuration.

Thanks.

I doubt that this is a unique signature identifying fake numa systems.

I've seen other systems that had memory-only nodes, which I suspect
would show up with all 000000 cpumaps on those nodes.

Perhaps we should add a hook to allow testing if we are running on
fake numa system:
    
    For example, we could add a macro to a header that, in the case
    CONFIG_NUMA_EMU was enabled, evaluated to 1 if numa emulation
    was enabled.  Currently, the true state of numa emulation only
    seems to be known to code within:

      arch/x86_64/mm/numa.c

    and currently only available for x86_64 arch's and only available
    if CONFIG_NUMA_EMU is enabled.

    With the usual conditional macro header magic, we could make a
    test for NUMA emulation available in generic kernel code.


Andrew,

    Do you have any plans to build a hybrid system with both real and
    emulated NUMA present?  That could complicate things.

    My current notion is to have a simple modal switch:
    
     - Fake numa systems would never try to go-back-to-the-earlier-zone.
    
     - Real NUMA systems not use this zone caching at all, always
       scanning the zonelist from the front.

    Such trivial modal behaviour wouldn't work on a hybrid system with
    both real and emulated NUMA.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18  2:11                     ` Paul Jackson
@ 2006-09-18  5:09                       ` Andrew Morton
  2006-09-18  7:49                         ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2006-09-18  5:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, linux-mm, rientjes, ak

On Sun, 17 Sep 2006 19:11:01 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > IOW: in which operational scenarios and configurations would you view this
> > go-back-to-the-earlier-zone-if-some-memory-came-free-in-it approach
> > to be needed?
> 
> On fake numa systems, I agree that going back to earlier zones is
> not needed.  As you have stated, all nodes are equally good on such
> a system.
> 
> And besides, right now, I could not give you -any- operational scenario
> in which the fake numa approach would be needed.  Perhaps you have
> some in mind ...?  I'd be interested to learn how you view these fake
> numa based memory containers being used.
> 

Confused.  It's pretty obvious isn't it?  To partition the machine with
jobA taking 10% of memory, jobB using 50% and jobC the rest.  If any job
exceeds its allocation it gets subjected to page reclaim and disruption of
other jobs is minimised.

Crude, simple and, I hope, effective.

The beauty of this is that it 100% leverages the existing page reclaim
code.  All the other containerisation approaches I've seen thus far are a
house of horrors in that area.

> 
> On real numa systems, if we don't go back to earlier zones fairly
> soon after it is possible to do so, then we are significantly changing
> the memory placement behaviour of the system.  That can be risky and is
> better not done without good motivation.
> 
> If some app running for a while on one cpu, allowed to use memory
> on several nodes, had its allocations temporarilly pushed off its
> local node, further down its zonelist, it might expect to have its
> allocations go back to its local node, just by freeing up memory there.
> 
> Many of our most important HPC (High Performance Computing) apps rely
> on what they call 'first touch' placement.  That means to them that
> memory will be allocated on the node associated with the allocating
> thread, or on the closest node thereto.  They will run massive jobs,
> with sometimes just a few of the many threads in the job allocating
> massive amounts of memory, by the simple expedient of controlling
> on which cpu the allocator thread is running as it allocates by
> touching the memory pages for the first time.
> 
> Their performance can depend critically on getting that memory
> placement correct, so that the computational threads are, on average,
> as close as can be to their data.
> 
> This is the sort of memory placement change that has a decent chance
> of coming back around and biting me in the backside, a year or two
> down the road, when some app that happened, perhaps unwittingly,
> to be sensitive to this change, tripped over it.
> 
> I am certainly not saying for sure such a problem would arise.
> Good programming practices would suggest not relying on such node
> overflow to get memory placed.  But good programming practices are
> not always perfectly followed.
> 

Yeah, I can certainly sympathise with that concern.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18  5:09                       ` Andrew Morton
@ 2006-09-18  7:49                         ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-18  7:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, rientjes, ak

Andrew wrote:
> Confused.  It's pretty obvious isn't it? 

Yes, it is obvious.

I've spent too many years disciplining myself to develop technology in
response to customer needs and situations, suppressing my instincts to
build neat stuff and trust that they will come.

The part of me that asked that question was expecting every techie
proposal to come with a "Customer Environment Impact Statement."

Meanwhile, in parallel, another part of my mind knew damn well what
you were up to, and was working overtime to make it so.

Yeah, I guess that could be confusing.  Personally I find it amusing.

Brains are weird.

> Crude, simple and, I hope, effective.

Yup - sure looks that way.  Good job.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-17 13:52                           ` Nick Piggin
  2006-09-17 21:19                             ` Paul Jackson
@ 2006-09-18 12:44                             ` Peter Zijlstra
  2006-09-18 20:20                               ` Christoph Lameter
  2006-09-19 14:35                               ` Nick Piggin
  1 sibling, 2 replies; 82+ messages in thread
From: Peter Zijlstra @ 2006-09-18 12:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul Jackson, akpm, clameter, linux-mm, rientjes, ak

On Sun, 2006-09-17 at 23:52 +1000, Nick Piggin wrote:

> What we could do then, is allocate pages in batches (we already do),
> but only check watermarks if we have to go to the buddly allocator
> (we don't currently do this, but really should anyway, considering
> that the watermark checks are based on pages in the buddy allocator
> rather than pages in buddy + pcp).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -767,6 +767,42 @@ void split_page(struct page *page, unsig
 		set_page_refcounted(page + i);
 }
 
+static struct page *get_pcp_page(struct zonelist *zonelist, struct zone *zone,
+				 gfp_t gfp_mask, int order)
+{
+	unsigned long flags;
+	int cold, cpu;
+	struct per_cpu_pages *pcp;
+	struct page *page;
+
+	if (unlikely(order != 0))
+		return NULL;
+
+	cold = !!(gfp_mask & __GFP_COLD);
+again:
+	page = NULL;
+	cpu = get_cpu();
+	pcp = &zone_pcp(zone, cpu)->pcp[cold];
+	local_irq_save(flags);
+	if (pcp->count) {
+		page = list_entry(pcp->list.next, struct page, lru);
+		list_del(&page->lru);
+		pcp->count--;
+		__count_zone_vm_events(PGALLOC, zone, 1);
+		zone_statistics(zonelist, zone);
+	}
+	local_irq_restore(flags);
+	put_cpu();
+
+	if (page) {
+		BUG_ON(bad_range(zone, page));
+		if (prep_new_page(page, 0, gfp_mask))
+			goto again;
+	}
+
+	return page;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -781,13 +817,17 @@ static struct page *buffered_rmqueue(str
 	int cpu;
 
 again:
-	cpu  = get_cpu();
+	cpu = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 
 		pcp = &zone_pcp(zone, cpu)->pcp[cold];
 		local_irq_save(flags);
-		if (!pcp->count) {
+		/*
+		 * Even though we checked the pcps earlier we could have
+		 * been preempted and scheduled to another cpu.
+		 */
+		if (likely(!pcp->count)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 						pcp->batch, &pcp->list);
 			if (unlikely(!pcp->count))
@@ -882,6 +922,16 @@ get_page_from_freelist(gfp_t gfp_mask, u
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
+
+			/*
+			 * If there are pages in the pcp lists take those
+			 * without checking the watermarks, since the zone
+			 * free_pages count is without the pcp count.
+			 */
+			page = get_pcp_page(zonelist, *z, order, gfp_mask);
+			if (page)
+				break;
+
 			if (alloc_flags & ALLOC_WMARK_MIN)
 				mark = (*z)->pages_min;
 			else if (alloc_flags & ALLOC_WMARK_LOW)
@@ -896,9 +946,8 @@ get_page_from_freelist(gfp_t gfp_mask, u
 		}
 
 		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
-		if (page) {
+		if (page)
 			break;
-		}
 	} while (*(++z) != NULL);
 	return page;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18  2:20                           ` Paul Jackson
@ 2006-09-18 16:34                             ` Paul Jackson
  2006-09-18 17:49                               ` David Rientjes
  2006-09-19 20:52                               ` David Rientjes
  0 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-18 16:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: rientjes, clameter, akpm, linux-mm

pj wrote:
>     Do you have any plans to build a hybrid system with both real and
>     emulated NUMA present?  That could complicate things.

This might be the crux of the matter.

We currently have a large SMP (aka multi-core) wave washing over the
upper end of the large volume markets, as we deal with the fact
that a single core's compute power (and electric power ;) can't
continue to grow as fast we need.

Inevitably, in a few years, a NUMA wave will follow, as we deal with
an overloaded shared memory bus, and begin to distribute the memory
bandwidth across multiple buses.

We should architect consistently with this anticipated evolution.

Eventually, a memory container mechanism that doesn't work on real
NUMA boxes would be useless.

I'm inclined to think that this means node_distance between two fake
nodes on the same real node should be 10, the value always used to
indicate that two node numbers refer to one and the same physical
hardware.

For now, it could be that we can't handle hybrid systems, and that fake
numa systems simply have a distance table of all 10's, driven by the
kernel boot command "numa=fake=N".  But that apparatus will have to be
extended at some point, to support hybrid fake and real NUMA combined.
And this will have to mature from being an arch=x86_64 only thing to
being generically available.  And it will have to become a mechanism
that can be applied on a running system, creating (and removing) fake
nodes on the fly, without a reboot, so long as the required physical
memory is free and available.

A comment above arch/x86_64/mm/srat.c slit_valid() raises concerns
about a SLIT table with all 10's.  I suspect we will just have to find
out the hard way what that problem is.  Change the table to all 10's
on these fake numa systems and see what hurts.

The generic kernel code should deal with this, and in particular, the
get_page_from_freelist() loop that provoked this discussion should be
coded so that it caches the last used node iff that node is distance
10 from the node at the front of the zonelist.

The only way to make this kind of stuff hold up over the long term
is to get a good conceptual model, and stick with it.  This fake
numa provides for multiple logical nodes on a single physical node.

The modal approach I recommended yesterday, where a system either
supported fake NUMA or real NUMA, but not both, had the stench of
an intermediate solution that would not hold over the long run.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18 16:34                             ` Paul Jackson
@ 2006-09-18 17:49                               ` David Rientjes
  2006-09-18 20:46                                 ` Paul Jackson
  2006-09-19 20:52                               ` David Rientjes
  1 sibling, 1 reply; 82+ messages in thread
From: David Rientjes @ 2006-09-18 17:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, linux-mm, rientjes

On Mon, 18 Sep 2006, Paul Jackson wrote:

> For now, it could be that we can't handle hybrid systems, and that fake
> numa systems simply have a distance table of all 10's, driven by the
> kernel boot command "numa=fake=N".  But that apparatus will have to be
> extended at some point, to support hybrid fake and real NUMA combined.
> And this will have to mature from being an arch=x86_64 only thing to
> being generically available.  And it will have to become a mechanism
> that can be applied on a running system, creating (and removing) fake
> nodes on the fly, without a reboot, so long as the required physical
> memory is free and available.
> 

Magnus Damm wrote a series of patches that divided real NUMA nodes into 
several smaller emulated nodes (real nodes - 1) for the x86_64.  They are 
from 2.6.14-mm1:

http://marc.theaimsgroup.com/?l=linux-mm&m=113161386520342&w=2

As already said, the only flag that exists to determine whether 
CONFIG_NUMA_EMU is enabled and numa=fake is being used (and used 
correctly) is the numa_fake int in arch/x86_64/mm/numa.c.  Any abstraction 
of this to generic kernel code should probably follow in the footsteps of 
Magnus' other patch series which moved must of NUMA emulation to generic 
architectures.  He used it primarily for implementing numa=fake on i386:

http://marc.theaimsgroup.com/?l=linux-mm&m=112806587501884&w=2

At the time it was suggested to emulate an SMP NUMA system where each 
node doesn't have all of its CPU's online.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
@ 2006-09-18 20:20                               ` Christoph Lameter
  2006-09-18 20:43                                 ` Peter Zijlstra
  2006-09-19 14:35                               ` Nick Piggin
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-18 20:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Paul Jackson, akpm, linux-mm, rientjes, ak

On Mon, 18 Sep 2006, Peter Zijlstra wrote:

> On Sun, 2006-09-17 at 23:52 +1000, Nick Piggin wrote:
> 
> > What we could do then, is allocate pages in batches (we already do),
> > but only check watermarks if we have to go to the buddly allocator
> > (we don't currently do this, but really should anyway, considering
> > that the watermark checks are based on pages in the buddy allocator
> > rather than pages in buddy + pcp).

buffered_rmqueue has never checked watermarks. Seems that this is a 
fragment of a larger discussion and someone added those checks?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-18 20:20                               ` Christoph Lameter
@ 2006-09-18 20:43                                 ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2006-09-18 20:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Paul Jackson, akpm, linux-mm, rientjes, ak

On Mon, 2006-09-18 at 13:20 -0700, Christoph Lameter wrote:
> On Mon, 18 Sep 2006, Peter Zijlstra wrote:
> 
> > On Sun, 2006-09-17 at 23:52 +1000, Nick Piggin wrote:
> > 
> > > What we could do then, is allocate pages in batches (we already do),
> > > but only check watermarks if we have to go to the buddly allocator
> > > (we don't currently do this, but really should anyway, considering
> > > that the watermark checks are based on pages in the buddy allocator
> > > rather than pages in buddy + pcp).
> 
> buffered_rmqueue has never checked watermarks. Seems that this is a 
> fragment of a larger discussion and someone added those checks?

get_page_from_freelist() seems to check the watermarks before calling
buffered_rmqueue(), so if the watermarks fail, we never get to
buffered_rmqueue().

This patch adds a path to the per cpu pagelists before checking the
watermarks, however it will not refill the pcps when empty.

So now we can deplete the pcps even though we fail the watermark; which
is correct since the free_pages count is excluding the pcp pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18 17:49                               ` David Rientjes
@ 2006-09-18 20:46                                 ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-18 20:46 UTC (permalink / raw)
  To: Magnus Damm; +Cc: clameter, akpm, linux-mm, rientjes

Magnus,

You were a year ahead of your time ;).  This linux-mm thread just
menioned your Nov 10, 2005 patch set to extend the x86_64 NUMA
emulation to run on NUMA systems:

  http://marc.theaimsgroup.com/?l=linux-mm&m=113161386520342&w=2

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
  2006-09-18 20:20                               ` Christoph Lameter
@ 2006-09-19 14:35                               ` Nick Piggin
  2006-09-19 14:44                                 ` Christoph Lameter
  2006-09-19 14:51                                 ` Peter Zijlstra
  1 sibling, 2 replies; 82+ messages in thread
From: Nick Piggin @ 2006-09-19 14:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Jackson, akpm, clameter, linux-mm, rientjes, ak

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

Peter Zijlstra wrote:
> On Sun, 2006-09-17 at 23:52 +1000, Nick Piggin wrote:
> 
> 
>>What we could do then, is allocate pages in batches (we already do),
>>but only check watermarks if we have to go to the buddly allocator
>>(we don't currently do this, but really should anyway, considering
>>that the watermark checks are based on pages in the buddy allocator
>>rather than pages in buddy + pcp).
> 
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Hi Peter,

Thanks for the patch! I have a slight preference for the following
version, which speculatively tests pcp->count without disabling
interrupts (the chance of being preempted or scheduled in this
window is basically the same as the chance of being preempted after
checking watermarks). What do you think?

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-pcp-bypass-wmark.patch --]
[-- Type: text/plain, Size: 1209 bytes --]

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2006-09-20 00:06:46.000000000 +1000
+++ linux-2.6/mm/page_alloc.c	2006-09-20 00:20:28.000000000 +1000
@@ -880,6 +880,16 @@ get_page_from_freelist(gfp_t gfp_mask, u
 				!cpuset_zone_allowed(*z, gfp_mask))
 			continue;
 
+		if (likely(order == 0)) {
+			int cold = !!(gfp_mask & __GFP_COLD);
+			int cpu  = raw_smp_processor_id();
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_pcp(*z, cpu)->pcp[cold];
+			if (likely(pcp->count))
+				goto skip_watermarks;
+		}
+
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
 			if (alloc_flags & ALLOC_WMARK_MIN)
@@ -889,16 +899,17 @@ get_page_from_freelist(gfp_t gfp_mask, u
 			else
 				mark = (*z)->pages_high;
 			if (!zone_watermark_ok(*z, order, mark,
-				    classzone_idx, alloc_flags))
+				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
 				    !zone_reclaim(*z, gfp_mask, order))
 					continue;
+			}
 		}
 
+skip_watermarks:
 		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
-		if (page) {
+		if (page)
 			break;
-		}
 	} while (*(++z) != NULL);
 	return page;
 }

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 14:35                               ` Nick Piggin
@ 2006-09-19 14:44                                 ` Christoph Lameter
  2006-09-19 15:02                                   ` Nick Piggin
  2006-09-19 14:51                                 ` Peter Zijlstra
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-19 14:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Zijlstra, Paul Jackson, akpm, linux-mm, rientjes, ak

Could we simply try to get a page from the pcp of the best zone before 
doing any other processing? That way we may actually improve the performance 
of alloc pages.

Could we inline the attempt to get a page from the pcp?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 14:35                               ` Nick Piggin
  2006-09-19 14:44                                 ` Christoph Lameter
@ 2006-09-19 14:51                                 ` Peter Zijlstra
  2006-09-19 15:10                                   ` Nick Piggin
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2006-09-19 14:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul Jackson, akpm, clameter, linux-mm, rientjes, ak

On Wed, 2006-09-20 at 00:35 +1000, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Sun, 2006-09-17 at 23:52 +1000, Nick Piggin wrote:
> > 
> > 
> >>What we could do then, is allocate pages in batches (we already do),
> >>but only check watermarks if we have to go to the buddly allocator
> >>(we don't currently do this, but really should anyway, considering
> >>that the watermark checks are based on pages in the buddy allocator
> >>rather than pages in buddy + pcp).
> > 
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Hi Peter,
> 
> Thanks for the patch! I have a slight preference for the following
> version, which speculatively tests pcp->count without disabling
> interrupts (the chance of being preempted or scheduled in this
> window is basically the same as the chance of being preempted after
> checking watermarks). What do you think?

The race here allows to wrongly bypass the watermark check. My version
raced the other way about, where you could find a non empty pcp where an
empty one was otherwise expected.

OTOH it is much shorter, I'll see if I can shorten mine and keep the
race safe (and perhaps do what Christoph suggests).

> plain text document attachment (mm-pcp-bypass-wmark.patch)
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2006-09-20 00:06:46.000000000 +1000
> +++ linux-2.6/mm/page_alloc.c	2006-09-20 00:20:28.000000000 +1000
> @@ -880,6 +880,16 @@ get_page_from_freelist(gfp_t gfp_mask, u
>  				!cpuset_zone_allowed(*z, gfp_mask))
>  			continue;
>  
> +		if (likely(order == 0)) {
> +			int cold = !!(gfp_mask & __GFP_COLD);
> +			int cpu  = raw_smp_processor_id();
> +			struct per_cpu_pages *pcp;
> +
> +			pcp = &zone_pcp(*z, cpu)->pcp[cold];
> +			if (likely(pcp->count))
> +				goto skip_watermarks;
> +		}
> +
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
>  			if (alloc_flags & ALLOC_WMARK_MIN)
> @@ -889,16 +899,17 @@ get_page_from_freelist(gfp_t gfp_mask, u
>  			else
>  				mark = (*z)->pages_high;
>  			if (!zone_watermark_ok(*z, order, mark,
> -				    classzone_idx, alloc_flags))
> +				    classzone_idx, alloc_flags)) {
>  				if (!zone_reclaim_mode ||
>  				    !zone_reclaim(*z, gfp_mask, order))
>  					continue;
> +			}
>  		}
>  
> +skip_watermarks:
>  		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
> -		if (page) {
> +		if (page)
>  			break;
> -		}
>  	} while (*(++z) != NULL);
>  	return page;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 14:44                                 ` Christoph Lameter
@ 2006-09-19 15:02                                   ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2006-09-19 15:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Paul Jackson, akpm, linux-mm, rientjes, ak

Christoph Lameter wrote:
> Could we simply try to get a page from the pcp of the best zone before 
> doing any other processing? That way we may actually improve the performance 
> of alloc pages.

There is very little other processing in my patch. The only thing is
really the cpuset check, which really seems (IMO) to be required.

I don't expect performance to change much right now... scalability will be
improved for CPUs allocating from the same zones (due to batched access of
zone data), but that shouldn't be too common on machines these days.

I expect the bigger win will come from not having to cycle through a huge
number of zones *every* allocation when memory is low, in Andrew's multi
node container situation[*]. This won't work if you only check the best
zone.

[*] If it is still a problem, they should be able to increase the pcp batch
size until it isn't.

> Could we inline the attempt to get a page from the pcp?

Not sure. My instinct is not to inline unless it makes the code smaller,
provides obvious constant folding benefits, or runs faster. The former
two don't seem to be the case, and the latter is probably not measurable.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 15:10                                   ` Nick Piggin
@ 2006-09-19 15:05                                     ` Peter Zijlstra
  2006-09-19 15:39                                       ` Christoph Lameter
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2006-09-19 15:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul Jackson, akpm, clameter, linux-mm, rientjes, ak

On Wed, 2006-09-20 at 01:10 +1000, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Wed, 2006-09-20 at 00:35 +1000, Nick Piggin wrote:
> > 
> 
> >>Thanks for the patch! I have a slight preference for the following
> >>version, which speculatively tests pcp->count without disabling
> >>interrupts (the chance of being preempted or scheduled in this
> >>window is basically the same as the chance of being preempted after
> >>checking watermarks). What do you think?
> > 
> > 
> > The race here allows to wrongly bypass the watermark check. My version
> > raced the other way about, where you could find a non empty pcp where an
> > empty one was otherwise expected.
> 
> I really doubt it matters. You could be preempted after that check
> anyway, and by the time you return the previous watermark check is
> meaningless.
> 
> If we really want to be strict about watermark checks, it has to be
> done with the zone lock held, no other option. I doubt anybody
> bothered, because the watermarks (even PF_MEMALLOC pool) are all
> heuristics anyway and it is a better idea to keep fastpath code fast.

Yes, you're absolutely right. I forgot to look at the bigger picture.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 14:51                                 ` Peter Zijlstra
@ 2006-09-19 15:10                                   ` Nick Piggin
  2006-09-19 15:05                                     ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2006-09-19 15:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Jackson, akpm, clameter, linux-mm, rientjes, ak

Peter Zijlstra wrote:
> On Wed, 2006-09-20 at 00:35 +1000, Nick Piggin wrote:
> 

>>Thanks for the patch! I have a slight preference for the following
>>version, which speculatively tests pcp->count without disabling
>>interrupts (the chance of being preempted or scheduled in this
>>window is basically the same as the chance of being preempted after
>>checking watermarks). What do you think?
> 
> 
> The race here allows to wrongly bypass the watermark check. My version
> raced the other way about, where you could find a non empty pcp where an
> empty one was otherwise expected.

I really doubt it matters. You could be preempted after that check
anyway, and by the time you return the previous watermark check is
meaningless.

If we really want to be strict about watermark checks, it has to be
done with the zone lock held, no other option. I doubt anybody
bothered, because the watermarks (even PF_MEMALLOC pool) are all
heuristics anyway and it is a better idea to keep fastpath code fast.

> OTOH it is much shorter, I'll see if I can shorten mine and keep the
> race safe (and perhaps do what Christoph suggests).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] mm: exempt pcp alloc from watermarks
  2006-09-19 15:05                                     ` Peter Zijlstra
@ 2006-09-19 15:39                                       ` Christoph Lameter
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-19 15:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Paul Jackson, akpm, linux-mm, rientjes, ak

Nick, note that there is a PF_xx-flag that shows you if a process needs to 
consider policies and cpusets. Maybe checking that would allow a fast 
path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-17  4:37               ` Christoph Lameter
  2006-09-17  4:55                 ` Andrew Morton
@ 2006-09-19 19:17                 ` David Rientjes
  2006-09-19 19:19                   ` David Rientjes
  2006-09-19 19:31                   ` Christoph Lameter
  1 sibling, 2 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-19 19:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Paul Jackson, linux-mm, rientjes

On Sat, 16 Sep 2006, Christoph Lameter wrote:

> One would not construct a memory container from 50 zones but build a 
> single zone as a memory container of that size.
> 

Now I used numa=fake=64 for 64 nodes of 48M each and created a 2G cpuset 
(43 nodes) and attached 'usemem -m 1500 -s 100000000 &' for 1.5G of 
anonymous memory.  Then I timed and profiled the kernel build in the 
remaining 1G on my 3G box (21 nodes).  This was the old case.

I made a modification in my own tree that allowed numa=fake=N to break the 
memory into N nodes that are not powers of 2 (by writing a new hash 
function for pfn_to_nid).  I booted with numa=fake=3 which gives me one 
node of 2G and another of 1G.  I then placed each in their own cpusets and 
repeated the experiment.

		64 nodes		3 nodes
	real	5m19.722s		5m2.171s
	user	9m11.790s		9m15.999s
	sys	1m9.164s		0m31.030s

64 nodes:
	3786 mwait_idle				48.5385
	8590 __cpuset_zone_allowed		40.9048
	 978 clear_page				17.1579
	  22 clear_page_end			 3.1429
	 202 find_get_page			 2.8451
	 115 pfn_to_page			 2.5000
	 355 zone_watermark_ok			 2.2756
	 131 __down_read_trylock		 1.9848
	  86 page_remove_rmap			 1.9545
	 146 find_vma				 1.6782
	...
	1129 get_page_from_freelist		 1.2127

3 nodes:
	3940 mwait_idle				50.5128
	1114 clear_page				19.5439
	  19 clear_page_end			 2.7143
	 184 find_get_page			 2.5915
	  98 page_remove_rmap			 2.2273
	 122 __down_read_trylock		 1.8485
	 140 find_vma				 1.6092
	  67 pfn_to_page			 1.4565
	  47 __strnlen_user			 1.2051
	  24 page_add_file_rmap			 1.0909
	...
	 457 get_page_from_freelist		 0.4909
	  33 lru_cache_add_active		 0.4853
	...
	  90 __cpuset_zone_allowed		 0.4286

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 19:17                 ` David Rientjes
@ 2006-09-19 19:19                   ` David Rientjes
  2006-09-19 19:31                   ` Christoph Lameter
  1 sibling, 0 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-19 19:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Paul Jackson, linux-mm

On Tue, 19 Sep 2006, David Rientjes wrote:
> I made a modification in my own tree that allowed numa=fake=N to break the 
> memory into N nodes that are not powers of 2 (by writing a new hash 
> function for pfn_to_nid).  I booted with numa=fake=3 which gives me one 
> node of 2G and another of 1G.  I then placed each in their own cpusets and 
> repeated the experiment.
> 

Correction: numa=fake=3 gives me three nodes, each of 1024M.  In my 
experiment I used 0-1 > mems for the usemem cpuset and 2 > mems for the 
kernel build.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 19:17                 ` David Rientjes
  2006-09-19 19:19                   ` David Rientjes
@ 2006-09-19 19:31                   ` Christoph Lameter
  2006-09-19 21:12                     ` David Rientjes
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-19 19:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Paul Jackson, linux-mm

I guess this is the expected result. the __cpuset_zone_allowed bottleneck 
is gone.

What is workable would be to dynamically create new nodes.

The system would consist of node 0 .... MAX_PHYSNODES -1  which would 
be physical nodes.

Additional nodes beyond X MAX_PHYSNODES - 1 ... MAX_NUMNODES -1 would be 
contrainers.

A container could be created through a node hotplug API. When a node is 
created one specifies how much memory from which nodes should be assigned 
to that container / node.

Then the system will take elements off the freelists of the source nodes 
and use these as building blocks for the new nodes. All the page 
flags must be updated with the new container node number or the 
section_to_node table must be updated.

Then we should have a fully functioning node with proper statistics for 
operations. Swap / zone reclaim should work as usual on NUMA systems. The 
slab will generate its node specific structures for the new node. One 
can then cage applications using cpusets in that node.

If the container node is brought down then all pages in the node must be 
freed and will coalesce back into large higher order pages that were taken 
off the source nodes freelists. The page->flags need to be updated to 
point to the source node and then the higher order pages can be freed
back to the origin node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-18 16:34                             ` Paul Jackson
  2006-09-18 17:49                               ` David Rientjes
@ 2006-09-19 20:52                               ` David Rientjes
  2006-09-19 21:26                                 ` Christoph Lameter
  2006-09-21 22:11                                 ` David Rientjes
  1 sibling, 2 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-19 20:52 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, linux-mm, rientjes

On Mon, 18 Sep 2006, Paul Jackson wrote:
> For now, it could be that we can't handle hybrid systems, and that fake
> numa systems simply have a distance table of all 10's, driven by the
> kernel boot command "numa=fake=N".  But that apparatus will have to be
> extended at some point, to support hybrid fake and real NUMA combined.
> And this will have to mature from being an arch=x86_64 only thing to
> being generically available.  And it will have to become a mechanism
> that can be applied on a running system, creating (and removing) fake
> nodes on the fly, without a reboot, so long as the required physical
> memory is free and available.
> 
> A comment above arch/x86_64/mm/srat.c slit_valid() raises concerns
> about a SLIT table with all 10's.  I suspect we will just have to find
> out the hard way what that problem is.  Change the table to all 10's
> on these fake numa systems and see what hurts.
> 
> The generic kernel code should deal with this, and in particular, the
> get_page_from_freelist() loop that provoked this discussion should be
> coded so that it caches the last used node iff that node is distance
> 10 from the node at the front of the zonelist.
> 
> The only way to make this kind of stuff hold up over the long term
> is to get a good conceptual model, and stick with it.  This fake
> numa provides for multiple logical nodes on a single physical node.
> 
> The modal approach I recommended yesterday, where a system either
> supported fake NUMA or real NUMA, but not both, had the stench of
> an intermediate solution that would not hold over the long run.
> 

Any solution for how to use numa=fake=N as a means of resource management 
will come with two prerequisites:

   1.	An increase in N does not lead to degraded performance due to
	get_page_from_freelist in any more than a negligible manner.

   2.	The current infrastructure of cpusets is not changed in a risky
	(or major) way to affect real NUMA machines.

We can assume for the matter of this implementation that the system 
administrator or whomever is responsible for using this means of 
memory management configures their machine appropriately.  So there is no 
worry about allocating 256 nodes when the intent is to split the machine 
down the middle and having two "containers" for their tasks.

My current fixes to numa=fake:

   1.	It allows N nodes to be split evenly over the memory map with no
	restriction on N being a power of 2.

   2.	Memory can be allocated in 4M contiguous chunks on a machine up
	to 64G.  This means my 3G box can be booted with numa=fake=256
	or even higher if NODE_SHIFT is increased.

   3.	Nodes can be asymmetric so that you can configure as many nodes
	as you want with the sizes you specify.

This provides the system administrator with all the functionality that he 
needs so that the machine can be configured appropriately for its foreseen 
workload.  It is not, however, a dynamic solution that Christoph suggests 
where nodes can be partitioned along themselves at runtime.

The problem that is being faced is prerequisite #1 whereas 
get_page_from_freelist needs to be more efficient when considering fake 
nodes as opposed to real nodes.  In favor of prerequisite #2, 
__node_distance in arch/x86_64/can be modified so that the distance 
between fake nodes is always 10 (no distance).  This is intuitive: we have 
a UMA machine that is acting in a NUMA environment and the notion of 
memory locality is no longer a consideration.  Fake nodes currently are 
not emulated among real nodes through Andi Kleen's implementation so the 
change to __node_distance is trivial once we have abstracted whether NUMA 
emulation is, in fact, being used.

The concern about that approach is the comment in slit_valid which 
suggests that the local node should always have a smaller distance than 
the others considering the NUMA heuristics.  One of those heuristics is in 
find_next_best_node where the next best node is determined by preferring 
those that have the least distance (in addition to those with CPUS, etc).  
This is not an issue since all nodes in a fake NUMA system will have the 
same distance and thus this heuristic becomes a non-factor.  The other 
heuristic is in build_zonelists where pages are reclaimed on a local node 
as opposed to from other nodes.  Since the distance, again, is always the 
same on each node in a fake NUMA system, this should not exceed 
RECLAIM_DISTANCE and thus it is acceptable at all times to reclaim from 
across nodes (since this is, after all, an UMA machine) and no preference 
is given to being across zones instead.

There is _no_ problem with the penalty on the first node in the same 
distance group in the build_zonelists iteration.  The load on each 
individual node will be the same since the distance between the nodes 
never changes and thus the ordering of nodes is appropriate even for our 
emulation.  To be convinced of this, for any node N, node_load[N] is 
always the same for each node in our UMA machine: find_next_best_node 
treats each node equally.

The return value on a numa_emulation call in arch/x86_64/mm/numa.c is 0 
when CONFIG_NUMA_EMU is enabled and the command line was parsed and setup 
correctly.  It becomes trivial to abstract this to the global kernel code 
by creating a macro that can be tested against for any generic 
architecture (and may become useful later when NUMA emulation is 
abstracted generically as suggested prior).

So, with these changes, we now assured:

   1.	A test against a macro in global kernel code where a numa=fake
	kernel can be evaluated.

   2.	That the ordering of next_best_node treats each node equally
	by keeping in the back of our minds that this is really a UMA
	machine.

This now no longer requires us to determine whether two fake nodes are 
really on the same hardware node because we recognize the NUMA emulation 
case when appropriate and the rest of the cpusets infrastructure is 
remained unchanged (prerequisite #2).

Now stepping back to prerequisite #1, we are required to modify 
__cpuset_zone_allowed for the case where we have NUMA emulation.  The idea 
is that since we have N number of nodes where N may be large, 
get_page_from_freelist scans those N nodes before finding memory on page 
claim.  This isn't a problem with the typical NUMA machine because free 
pages can normally be found quickly in a scan of the first few zonelist 
entries; this is thanks to the N differently sorted zonelists.  It _is_ a 
problem in NUMA emulation because we have many more nodes than CPU's, but 
it can be remedied with the proposed caching and changes to 
__node_distance.

As Paul and Andrew suggested, there are three additions to task_struct:

   1.	cached copy of struct zonelist *zonelist that was passed into
	get_page_from_freelist,

   2.	index of zone where free memory was located last, and

   3.	index of next zone to try when (2) is full.

get_page_from_freelist, in the case where the passed in zonelist* differs 
from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current 
implementation going through the zonelist and finding one with enough free 
pages.  Otherwise, if we are in the NUMA emulation case, the node where 
the memory was found most recently can be cached since all memory is 
equal.  There is no consideration given to the distance between the last 
used node and the node at the front of the zonelist because the distance 
between all nodes is 10.  (If the passed in zonelist* differs from (1), 
then the three additions to task_struct are reset per the new 
configuration in the same sense as cpuset_update_task_memory_state since 
the memory placement has changed relative to current->cpuset which 
cpusets allows by outside manipulation.)  

Now, when get_page_from_freelist is called, (3) is tested for new memory 
and used if some is found, otherwise it is incremented so that we don't 
spin on this one full zone (especially in the case with the memory hogs 
from my experiment where it would never get out of this spin).  This 
prevents us from having to take callback_mutex and makes the call to 
__cpuset_zone_allowed more efficient.

Thus this solution correctly implements prerequisite #1 and keeps the 
modification to the current infrastructure of cpusets small 
(prerequisite #2) by abstracting the NUMA emulation case away from the 
current code path.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 19:31                   ` Christoph Lameter
@ 2006-09-19 21:12                     ` David Rientjes
  2006-09-19 21:28                       ` Christoph Lameter
  0 siblings, 1 reply; 82+ messages in thread
From: David Rientjes @ 2006-09-19 21:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Paul Jackson, linux-mm, rientjes

On Tue, 19 Sep 2006, Christoph Lameter wrote:

> What is workable would be to dynamically create new nodes.
> 
> The system would consist of node 0 .... MAX_PHYSNODES -1  which would 
> be physical nodes.
> 
> Additional nodes beyond X MAX_PHYSNODES - 1 ... MAX_NUMNODES -1 would be 
> contrainers.
> 

I had something similiar working when I abstracted some of the x86_64 
numa=fake capabilities to work on real NUMA machines.

> A container could be created through a node hotplug API. When a node is 
> created one specifies how much memory from which nodes should be assigned 
> to that container / node.
> 

If the memory from existing nodes are used to create the new node, then 
any tasks assigned to that parent node through cpusets will be degraded.  
Not a problem since the user would be aware of this affect on node 
creation, but you'd need callback_mutex and task_lock for each task 
within the parent node and possibly rcu_read_lock for the mems_generation.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 20:52                               ` David Rientjes
@ 2006-09-19 21:26                                 ` Christoph Lameter
  2006-09-19 21:50                                   ` David Rientjes
  2006-09-21 22:11                                 ` David Rientjes
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-19 21:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, linux-mm

On Tue, 19 Sep 2006, David Rientjes wrote:

>    1.	An increase in N does not lead to degraded performance due to
> 	get_page_from_freelist in any more than a negligible manner.

I think that is true if you do not do weird things like creating 64 of 
those containers on UMA. Or do you anticipate having hundreds of 
containers?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 21:12                     ` David Rientjes
@ 2006-09-19 21:28                       ` Christoph Lameter
  2006-09-19 21:53                         ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2006-09-19 21:28 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Paul Jackson, linux-mm

On Tue, 19 Sep 2006, David Rientjes wrote:

> If the memory from existing nodes are used to create the new node, then 
> any tasks assigned to that parent node through cpusets will be degraded.  
> Not a problem since the user would be aware of this affect on node 
> creation, but you'd need callback_mutex and task_lock for each task 
> within the parent node and possibly rcu_read_lock for the mems_generation.

Paul has already cpuset code in mm that supports exactly this situation. 
He can probably explain the locking which as far as I can tell is much 
simpler than you anticipate.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 21:26                                 ` Christoph Lameter
@ 2006-09-19 21:50                                   ` David Rientjes
  0 siblings, 0 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-19 21:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, linux-mm, rientjes

On Tue, 19 Sep 2006, Christoph Lameter wrote:

> I think that is true if you do not do weird things like creating 64 of 
> those containers on UMA. Or do you anticipate having hundreds of 
> containers?
> 

What I currently have running is a watered-down version of your suggestion 
about dynamic node allocation.  It does it in user-space by just 
allocating N number of fixed sized nodes and then when a particular cpuset 
feels memory pressure, it grabs another node and uses it until it is no 
longer needed.  It's a way that you can get simple resource management and 
throttle up processes that a more important.  This is how I've used NUMA 
emulation and cpusets to match a business goal of achieving certain 
objectives with a system goal in the form of limits.

Obviously it's not the most efficient way of handling such a policy and an 
implementation such as the one you've proposed that is supported by the 
kernel would be much more desirable.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 21:28                       ` Christoph Lameter
@ 2006-09-19 21:53                         ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2006-09-19 21:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: rientjes, akpm, linux-mm

Christoph wrote:
> Paul has already cpuset code in mm that supports exactly this situation.

See my *-mm patches:
  cpuset-top_cpuset-tracks-hotplug-changes-to-node_online_map.patch
  cpuset-hotunplug-cpus-and-mems-in-all-cpusets.patch

and patient Andrews fixes thereto.

In particular, anytime you add or remove nodes (whether
fake or real) be sure to update node_online_map, and then
call cpuset_track_online_nodes(), so that the cpuset code
can resync with node_online_map.  You must make this call
in a context where it is ok for the called code to sleep
on various cpuset mutex's.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-19 20:52                               ` David Rientjes
  2006-09-19 21:26                                 ` Christoph Lameter
@ 2006-09-21 22:11                                 ` David Rientjes
  2006-09-22 10:10                                   ` Nick Piggin
  2006-09-22 16:26                                   ` Paul Jackson
  1 sibling, 2 replies; 82+ messages in thread
From: David Rientjes @ 2006-09-21 22:11 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, linux-mm

On Tue, 19 Sep 2006, David Rientjes wrote:

> As Paul and Andrew suggested, there are three additions to task_struct:
> 
>    1.	cached copy of struct zonelist *zonelist that was passed into
> 	get_page_from_freelist,
> 
>    2.	index of zone where free memory was located last, and
> 
>    3.	index of next zone to try when (2) is full.
> 
> get_page_from_freelist, in the case where the passed in zonelist* differs 
> from (1) or in the ~GFP_HARDWALL & ~ALLOC_CPUSET case, uses the current 
> implementation going through the zonelist and finding one with enough free 
> pages.  Otherwise, if we are in the NUMA emulation case, the node where 
> the memory was found most recently can be cached since all memory is 
> equal.  There is no consideration given to the distance between the last 
> used node and the node at the front of the zonelist because the distance 
> between all nodes is 10.  (If the passed in zonelist* differs from (1), 
> then the three additions to task_struct are reset per the new 
> configuration in the same sense as cpuset_update_task_memory_state since 
> the memory placement has changed relative to current->cpuset which 
> cpusets allows by outside manipulation.)  
> 

As suggested by Paul Jackson and friends, this patch abstracts a numa=fake 
macro to the global kernel code.  A macro, 'numa_emu_enabled', is defined 
that can be tested against to determine whether NUMA emulation was 
successful at boot.

In the NUMA emulation case, the most recently allocated from zone is now 
cached in task_struct and used whenever the same zonelist is passed into 
get_page_from_freelist with GFP_HARDWALL and ALLOC_CPUSET.  The node 
distance compared to the first zone's node_id is not compared because 
x86_64 NUMA emulation is not supported for real NUMA machines anyway 
(later work).

This patch is on top of my numa=fake patches that are not currently in -mm 
(this one appears for comments).  Also includes Christoph Lameter's 
z->zone_pgdat->node_id speedup moved away from zone_to_nid since it, too, 
does not appear in my tree.

These trials were the same as before: 3G machine, numa=fake=64, 'usemem -m 
1500 -s 100000 &' in 2G cpuset, and a kernel build in the remaining.

		unpatched	patched		no cpusets, numa=fake=off
	real	5m16.223s	5m9.711s	4m58.118s
	user	9m13.323s	9m16.803s	9m16.583s
	sys	1m7.756s	0m53.947s	0m30.994s

Unpatched top 13:
	8292 __cpuset_zone_allowed		39.4857 <-- ~210.0
	1813 mwait_idle				23.2436
	1042 clear_page				18.2807
	  24 clear_page_end			 3.4286
	 207 find_get_page			 2.9155
	 123 pfn_to_page			 2.6739
	 347 zone_watermark_ok			 2.2244
	 128 __down_read_trylock		 1.9394
	  84 page_remove_rmap			 1.9091
	 155 find_vma				 1.7816
	  80 page_to_pfn			 1.5686
	  60 __strnlen_user			 1.5385
	1250 get_page_from_freelist		 1.3426 <-- ~931.0
329093.6744

Patched top:
	5068 __cpuset_zone_allowed		25.3400 <-- 200.0
	1348 mwait_idle				17.2821
	 928 clear_page				16.2807
	 195 find_get_page			 2.7465
	  17 clear_page_end			 2.4286
	 106 pfn_to_page			 2.3043
	 344 zone_watermark_ok			 2.2051
	  44 nr_free_pages			 1.5172
	  66 page_remove_rmap			 1.5000
	  54 __strnlen_user			 1.3846
	 119 find_vma				 1.3678
	  62 page_to_pfn			 1.2157
	  73 __down_read_trylock		 1.1061
	1133 get_page_from_freelist		 1.0648 <-- ~1064.0

Tradeoff:
	Unpatched:	8292*39.4857 + 1250*1.3426 = 329093.6744
	Patched:	5068*25.3400 + 1133*1.0648 = 129629.5384

Not-signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86_64/mm/numa.c  |    9 +++++++--
 arch/x86_64/mm/srat.c  |    2 ++
 include/linux/mmzone.h |    1 +
 include/linux/numa.h   |    7 +++++++
 include/linux/sched.h  |    4 ++++
 kernel/cpuset.c        |    9 +++++++--
 mm/page_alloc.c        |   17 +++++++++++++++--
 7 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 9a9e452..46ede0b 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -11,6 +11,7 @@ #include <linux/mmzone.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/numa.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -187,6 +188,7 @@ #define E820_ADDR_HOLE_SIZE(start, end)	
 	(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) <<	\
 		PAGE_SHIFT)
 char *cmdline __initdata;
+int numa_emu;
 
 /*
  * Sets up nodeid to range from addr to addr + sz.  If the end boundary is
@@ -381,8 +383,11 @@ void __init numa_initmem_init(unsigned l
 	int i;
 
 #ifdef CONFIG_NUMA_EMU
-	if (cmdline && !numa_emulation(start_pfn, end_pfn))
- 		return;
+	if (cmdline) {
+		numa_emu = !numa_emulation(start_pfn, end_pfn);
+		if (numa_emu)
+			return;
+	}
 #endif
 
 #ifdef CONFIG_ACPI_NUMA
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 66f375f..eed080c 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -436,6 +436,8 @@ int __node_distance(int a, int b)
 {
 	int index;
 
+	if (numa_emu_enabled)
+		return 10;
 	if (!acpi_slit)
 		return a == b ? 10 : 20;
 	index = acpi_slit->localities * node_to_pxm(a);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f45163c..81e047d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -151,6 +151,7 @@ struct zone {
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
 #ifdef CONFIG_NUMA
+	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
diff --git a/include/linux/numa.h b/include/linux/numa.h
index a31a730..ff2720d 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -10,4 +10,11 @@ #endif
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#ifdef CONFIG_NUMA_EMU
+extern int numa_emu;
+#define numa_emu_enabled	numa_emu
+#else
+#define numa_emu_enabled	0
+#endif
+
 #endif /* _LINUX_NUMA_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34ed0d9..5a2a7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -973,6 +973,10 @@ #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
 	short il_next;
 #endif
+#ifdef CONFIG_NUMA_EMU
+	struct zonelist *last_zonelist;
+	u32 last_zone_used;
+#endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
 	nodemask_t mems_allowed;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4ea6f0d..df19ecf 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -35,6 +35,7 @@ #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/numa.h>
 #include <linux/pagemap.h>
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
@@ -677,6 +678,10 @@ void cpuset_update_task_memory_state(voi
 			tsk->flags |= PF_SPREAD_SLAB;
 		else
 			tsk->flags &= ~PF_SPREAD_SLAB;
+		if (numa_emu_enabled) {
+			tsk->last_zonelist = NULL;
+			tsk->last_zone_used = 0;
+		}
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -2245,7 +2250,7 @@ int cpuset_zonelist_valid_mems_allowed(s
 	int i;
 
 	for (i = 0; zl->zones[i]; i++) {
-		int nid = zl->zones[i]->zone_pgdat->node_id;
+		int nid = zl->zones[i]->node;
 
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
@@ -2318,7 +2323,7 @@ int __cpuset_zone_allowed(struct zone *z
 
 	if (in_interrupt())
 		return 1;
-	node = z->zone_pgdat->node_id;
+	node = z->node;
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
 	if (node_isset(node, current->mems_allowed))
 		return 1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54a4f53..c80d6a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@ #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
+#include <linux/numa.h>
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
 #include <linux/stop_machine.h>
@@ -870,6 +871,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	struct zone **z = zonelist->zones;
 	struct page *page = NULL;
 	int classzone_idx = zone_idx(*z);
+	unsigned index = 0;
+
+	if (numa_emu_enabled) {
+		if (zonelist == current->last_zonelist &&
+		    (alloc_flags & __GFP_HARDWALL) && (alloc_flags & ALLOC_CPUSET))
+			z += current->last_zone_used;
+		current->last_zonelist = zonelist;
+	}
 
 	/*
 	 * Go through the zonelist once, looking for a zone with enough free.
@@ -897,8 +906,11 @@ get_page_from_freelist(gfp_t gfp_mask, u
 
 		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
 		if (page) {
+			if (numa_emu_enabled)
+				current->last_zone_used = index;
 			break;
 		}
+		index++;
 	} while (*(++z) != NULL);
 	return page;
 }
@@ -1203,7 +1215,7 @@ #endif
 #ifdef CONFIG_NUMA
 static void show_node(struct zone *zone)
 {
-	printk("Node %d ", zone->zone_pgdat->node_id);
+	printk("Node %d ", zone->node);
 }
 #else
 #define show_node(zone)	do { } while (0)
@@ -1965,7 +1977,7 @@ __meminit int init_currently_empty_zone(
 
 	zone->zone_start_pfn = zone_start_pfn;
 
-	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+	memmap_init(size, zone->node, zone_idx(zone), zone_start_pfn);
 
 	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
 
@@ -2006,6 +2018,7 @@ static void __meminit free_area_init_cor
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-21 22:11                                 ` David Rientjes
@ 2006-09-22 10:10                                   ` Nick Piggin
  2006-09-22 16:26                                   ` Paul Jackson
  1 sibling, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2006-09-22 10:10 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, clameter, akpm, linux-mm

David Rientjes wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c

You have a couple of problems with the page_alloc side. First of all,
__alloc_pages can be called from interrupt context and you don't
protect current->last_zonelist from that.

Secondly, you aren't checking wrapping the zone and checking skipped
ones so you may return NULL by mistake.

Also, if you are going into page reclaim, get_page_from_freelist is
designed to return NULL after at the first call from __alloc_pages,
so you end up having to go through all zones and check all their
watermarks anyway. If you use my pcp patch, this can be made to
happen much less frequently.

> @@ -34,6 +34,7 @@ #include <linux/cpu.h>
>  #include <linux/cpuset.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/nodemask.h>
> +#include <linux/numa.h>
>  #include <linux/vmalloc.h>
>  #include <linux/mempolicy.h>
>  #include <linux/stop_machine.h>
> @@ -870,6 +871,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
>  	struct zone **z = zonelist->zones;
>  	struct page *page = NULL;
>  	int classzone_idx = zone_idx(*z);
> +	unsigned index = 0;
> +
> +	if (numa_emu_enabled) {
> +		if (zonelist == current->last_zonelist &&
> +		    (alloc_flags & __GFP_HARDWALL) && (alloc_flags & ALLOC_CPUSET))
> +			z += current->last_zone_used;
> +		current->last_zonelist = zonelist;
> +	}
>  
>  	/*
>  	 * Go through the zonelist once, looking for a zone with enough free.
> @@ -897,8 +906,11 @@ get_page_from_freelist(gfp_t gfp_mask, u
>  
>  		page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
>  		if (page) {
> +			if (numa_emu_enabled)
> +				current->last_zone_used = index;
>  			break;
>  		}
> +		index++;
>  	} while (*(++z) != NULL);
>  	return page;
>  }

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-21 22:11                                 ` David Rientjes
  2006-09-22 10:10                                   ` Nick Piggin
@ 2006-09-22 16:26                                   ` Paul Jackson
  2006-09-22 16:36                                     ` Christoph Lameter
  1 sibling, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2006-09-22 16:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, linux-mm

Thanks for taking a shot at this.

David wrote:
> +	if (numa_emu_enabled)
> +		return 10;

The topology.h header has:
> #define LOCAL_DISTANCE               10

though -no-one- uses it, why I don't know ...

This simple forcing of distances to 10 is probably good enough for your
setup, but if this gets serious, we'll need to handle multiple arch's,
and hybrid systems with both fake and real numa.  That will take a bit
of work to get the SLIT table, node_distance and zonelist sorting
correct.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] GFP_THISNODE for the slab allocator
  2006-09-22 16:26                                   ` Paul Jackson
@ 2006-09-22 16:36                                     ` Christoph Lameter
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2006-09-22 16:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, akpm, linux-mm

On Fri, 22 Sep 2006, Paul Jackson wrote:

> The topology.h header has:
> > #define LOCAL_DISTANCE               10
> 
> though -no-one- uses it, why I don't know ...

It is a SLIT table reference value. This is the distance to memory that is 
local to the processor and it is the lowest possible value.

> This simple forcing of distances to 10 is probably good enough for your
> setup, but if this gets serious, we'll need to handle multiple arch's,
> and hybrid systems with both fake and real numa.  That will take a bit
> of work to get the SLIT table, node_distance and zonelist sorting
> correct.

Distance 10 is okay if the memory is on the node where the processor sits.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2006-09-22 16:36 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-13 23:50 [PATCH] GFP_THISNODE for the slab allocator Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson
2006-09-15  7:23     ` Andrew Morton
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox