[RFC] another way to speed up fake numa node page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] another way to speed up fake numa node page_alloc
@ 2006-09-25  9:14 Paul Jackson
  2006-09-26  6:08 ` David Rientjes
  2006-10-02  6:18 ` Paul Jackson
  0 siblings, 2 replies; 28+ messages in thread
From: Paul Jackson @ 2006-09-25  9:14 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Nick Piggin, David Rientjes, Andi Kleen, mbligh, rohitseth,
	menage, Paul Jackson, clameter

Here's an entirely different approach to speeding up
get_page_from_freelist() on large fake numa configurations.

Instead of trying to cache the last node that worked, it remembers
the nodes that didn't work recently.  Namely, it remembers the nodes
that were short of free memory in the last second.  And it stashes a
zone-to-node table in the zonelist struct, to optimize that conversion
(minimize its cache footprint.)

Beware.  This code has not been tested.  It has built and booted, once.

It almost certainly has bugs, and I have no idea if it actually speeds
up, or slows down, any load of interest.  I have not yet verified
that it does anything like what I intend it to do.

It applies to 2.6.18-rc7-mm1.

There are two reasons I persued this alternative:

 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use.  Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past.  So this is not guaranteed.  Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.
    
 2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

 - See the use of "zlf_scan" below, to delay adding any work for this
   new mechanism until we've looked at the first zone in zonelist.
   I figured the odds of the first zone having the memory we needed
   were high enough that we should just look there, first, then get
   fancy only if we need to keep looking.
   
 - Some odd hackery was needed to add items to struct zonelist, while
   not tripping up the custom zonelists built by the mm/mempolicy.c
   code for MPOL_BIND.  My usual wordy comments below explain this.
   Search for "MPOL_BIND".

 - Some per-node data in the struct zonelist is now modified frequently,
   with no locking.  Multiple CPU cores on a node could hit and mangle
   this data.  The theory is that this is just performance hint data,
   and the memory allocator will work just fine despite any such mangling.
   The fields at risk are the struct 'zonelist_faster' fields 'fullnodes'
   (a nodemask_t) and 'last_full_zap' (unsigned long jiffies).  It should
   all be self correcting after at most a one second delay.
 
 - This still does a linear scan of the same lengths as before.  All
   I've optimized is making the scan faster, not algorithmically
   shorter.  It is now able to scan a compact array of 'unsigned
   short' in the case of many full nodes, so one cache line should
   cover quite a few nodes, rather than each node hitting another
   one or two new and distinct cache lines.
 
 - If both Andi and Nick don't find this too complicated, I will be
   (pleasantly) flabbergasted.
 
 - In what really should be a separate patch, I removed the six lines
   of source code following the 'restart' label in __alloc_pages, and
   changed the wakeup_kswapd loop from a do-while loop to a for-loop.
   Eh ... I didn't think four of the six lines were needed, and I
   thought the remaining made more sense written as a for-loop.
   
 - I removed the comment claiming we only use one cachline's worth of
   zonelist.  We seem, at least in the fake numa case, to have put the
   lie to that claim.
   
 - This needs some test builds for variations of NUMA config, not to
   mention various other tests for function and performance.
   
 - I pay no attention to the various watermarks and such in this performance
   hint.  A node could be marked full for one watermark, and then skipped
   over when searching for a page using a different watermark.  I think
   that's actually quite ok, as it will tend to slightly increase the
   spreading of memory over other nodes, away from a memory stressed node.


Signed-off-by: Paul Jackson

---
 include/linux/mmzone.h |   72 +++++++++++++++++++++-
 mm/mempolicy.c         |    2 
 mm/page_alloc.c        |  158 +++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 216 insertions(+), 16 deletions(-)

--- 2.6.18-rc7-mm1.orig/include/linux/mmzone.h	2006-09-22 14:13:18.000000000 -0700
+++ 2.6.18-rc7-mm1/include/linux/mmzone.h	2006-09-24 22:33:58.000000000 -0700
@@ -303,19 +303,83 @@ struct zone {
  */
 #define DEF_PRIORITY 12
 
+#ifdef CONFIG_NUMA
+/*
+ * The node id's of the zone structs are extracted into a parallel
+ * array, for faster (smaller cache footprint) scanning for allowed
+ * nodes in get_page_from_freelist().
+ *
+ * To optimize get_page_from_freelist(), 'fullnodes' tracks which nodes
+ * have come up short of free memory, in searches using this zonelist,
+ * since the last time (last_fullnode_zap) we zero'd fullnodes.
+ *
+ * The get_page_from_freelist() routine does two scans.  During the
+ * first scan, we skip zones whose corresponding node number (in
+ * the node_id[] array) is either set in fullnodes or not set in
+ * current->mems_allowed (which comes from cpusets).
+ *
+ * Once per second, we zero out (zap) fullnodes, forcing us to
+ * reconsider nodes that might have regained more free memory.
+ * The field last_full_zap is the time we last zapped fullnodes.
+ *
+ * This mechanism reduces the amount of time we waste repeatedly
+ * reexaming zones for free memory when they just came up low on
+ * memory momentarilly ago.
+ *
+ * These struct members logically belong in struct zonelist.  However,
+ * the mempolicy zonelists constructed for MPOL_BIND are intentionally
+ * variable length (and usually much shorter).  A general purpose
+ * mechanism for handling structs with multiple variable length
+ * members is more mechanism than we want here.  We resort to some
+ * special case hackery instead.
+ *
+ * The MPOL_BIND zonelists don't need this zonelist_faster stuff
+ * (in good part because they are shorter), so we put the fixed
+ * length stuff at the front of the zonelist struct, ending in a
+ * variable length zones[], as is needed by MPOL_BIND.
+ *
+ * Then we put the optional faster stuff on the end of the zonelist
+ * struct.  This optional stuff is found by a 'zlfast_ptr' pointer in
+ * the fixed length portion at the front of the struct.  This pointer
+ * both enables us to find the faster stuff, and in the case of
+ * MPOL_BIND zonelists, (which will just set the zlfast_ptr to NULL)
+ * to know that the faster stuff is not there.
+ *
+ * The end result is that struct zonelists come in two flavors:
+ *  1) The full, fixed length version, shown below, and
+ *  2) The custom zonelists for MPOL_BIND.
+ * These custom zonelists have a NULL zlfast_ptr and no zlfast.
+ *
+ * Even though there may be multiple CPU cores on a node modifying
+ * fullnodes or last_full_zap in the same zonelist_faster at the same
+ * time, we don't lock it.  This is just hint data - if it is wrong now
+ * and then, the allocator must still function, perhaps slower.
+ */
+struct zonelist_faster {
+	nodemask_t fullnodes;		/* nodes recently lacking free memory */
+	unsigned long last_full_zap;	/* jiffies when fullnodes last zero'd */
+	unsigned short node_id[MAX_NUMNODES * MAX_NR_ZONES]; /* zone -> nid */
+};
+#else
+struct zonelist_faster;
+#endif
+
 /*
  * One allocation request operates on a zonelist. A zonelist
  * is a list of zones, the first one is the 'goal' of the
  * allocation, the other zones are fallback zones, in decreasing
  * priority.
  *
- * Right now a zonelist takes up less than a cacheline. We never
- * modify it apart from boot-up, and only a few indices are used,
- * so despite the zonelist table being relatively big, the cache
- * footprint of this construct is very small.
+ * If zlfast_ptr is not NULL, then it is just the address of zlfast,
+ * as explained above.  If zlfast_ptr is NULL, there is no zlfast.
  */
+
 struct zonelist {
+	struct zonelist_faster *zlfast_ptr;		     // NULL or &zlfast
 	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
+#ifdef CONFIG_NUMA
+	struct zonelist_faster zlfast;			     // optional ...
+#endif
 };
 
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
--- 2.6.18-rc7-mm1.orig/mm/page_alloc.c	2006-09-22 14:13:37.000000000 -0700
+++ 2.6.18-rc7-mm1/mm/page_alloc.c	2006-09-25 01:13:30.000000000 -0700
@@ -935,8 +935,90 @@ int zone_watermark_ok(struct zone *z, in
 	return 1;
 }
 
+#ifdef CONFIG_NUMA
+/*
+ * zlf_setup - Setup for "zonelist faster".  Uses cached zone data
+ * to skip over zones that are not allowed by the cpuset, or that
+ * have been recently (in last second) found to be nearly full.
+ * See further comments in mmzone.h.  Reduces cache footprint of
+ * zonelist scans that have to skip over alot of full or unallowed
+ * nodes.  Returns true if should activate zlf_zone_worth_trying()
+ * this scan.
+ */
+static int zlf_setup(struct zonelist *zonelist, int alloc_flags,
+				nodemask_t *zlf_good)
+{
+	nodemask_t *allowednodes;	/* mems_allowed or all online nodes */
+	struct zonelist_faster *zlf;	/* cached zonelist speedup info */
+
+	zlf = zonelist->zlfast_ptr;
+	if (!zlf)
+		return 0;
+
+	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
+				&current->mems_allowed : &node_online_map;
+
+	if (jiffies - zlf->last_full_zap > 1 * HZ) {
+		nodes_clear(zlf->fullnodes);
+		zlf->last_full_zap = jiffies;
+	}
+	/* Good nodes: allowed but not full nodes */
+	nodes_andnot(*zlf_good, *allowednodes, zlf->fullnodes);
+	return 1;
+}
+
+/*
+ * Given 'z' scanning a zonelist, index into the corresponding node_id
+ * in zlf->node_id[], and determine if that node_id is set in zlf_good.
+ * If it's set, that's a "good" node - allowed by the current cpuset and
+ * so far as we know, not full.  Good nodes are worth examining further
+ * for free memory meeting our requirements.
+ */
+static int zlf_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+				nodemask_t *zlf_good)
+{
+	struct zonelist_faster *zlf;	/* cached zonelist speedup info */
+
+	zlf = zonelist->zlfast_ptr;
+	return node_isset(zlf->node_id[z - zonelist->zones], *zlf_good);
+}
+
 /*
- * get_page_from_freeliest goes through the zonelist trying to allocate
+ * Given 'z' scanning a zonelist, index into the corresponding node_id
+ * in zlf->node_id[], and mark that node_id set in zlf->fullnodes, so
+ * that subsequent attempts to allocate a page on the current node don't
+ * waste time looking at that node.
+ */
+static void zlf_zone_full(struct zonelist *zonelist, struct zone **z)
+{
+	struct zonelist_faster *zlf;	/* cached zonelist speedup info */
+
+	zlf = zonelist->zlfast_ptr;
+	node_set(zlf->node_id[ z - zonelist->zones], zlf->fullnodes);
+}
+
+
+#else	/* CONFIG_NUMA */
+
+static int zlf_setup(struct zonelist *zonelist, int alloc_flags,
+				nodemask_t *zlf_good)
+{
+	return 0;
+}
+
+static int zlf_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+				nodemask_t *zlf_good)
+{
+	return 1;
+}
+
+static void zlf_zone_full(struct zonelist *zonelist, struct zone **z)
+{
+}
+#endif	/* CONFIG_NUMA */
+
+/*
+ * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
 static struct page *
@@ -947,12 +1029,19 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	struct page *page = NULL;
 	int classzone_idx = zone_idx(*z);
 	struct zone *zone;
+	nodemask_t zlf_good;	/* good means allowed but not full */
+	int zlf_active = 0;	/* if set, then just try good nodes */
+	int zlf_scan = 1;	/* zlf_scan: 1 - do zlf_active; 2 - don't */
 
+retry:
 	/*
 	 * Go through the zonelist once, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	do {
+		if (NUMA_BUILD && zlf_active &&
+			!zlf_zone_worth_trying(zonelist, z, &zlf_good))
+				continue;
 		zone = *z;
 		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
 			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
@@ -972,15 +1061,33 @@ get_page_from_freelist(gfp_t gfp_mask, u
 			if (!zone_watermark_ok(zone , order, mark,
 				    classzone_idx, alloc_flags))
 				if (!zone_reclaim_mode ||
-				    !zone_reclaim(zone, gfp_mask, order))
+				    !zone_reclaim(zone, gfp_mask, order)) {
+				    	if (NUMA_BUILD && zlf_active)
+						zlf_zone_full(zonelist, z);
 					continue;
+				}
 		}
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page) {
 			break;
 		}
+		if (unlikely(NUMA_BUILD && zlf_scan == 1 &&
+						z == zonelist->zones)) {
+			/* delay zlf_setup until 1st zone tried */
+			zlf_active = zlf_setup(zonelist, alloc_flags, &zlf_good);
+			zlf_scan = 2;
+		}
+		if (NUMA_BUILD && zlf_active)
+			zlf_zone_full(zonelist, z);
 	} while (*(++z) != NULL);
+
+	if (unlikely(NUMA_BUILD && page == NULL && zlf_active)) {
+		/* Let's try this again, this time more thoroughly. */
+		zlf_active = 0;
+		z = zonelist->zones;
+		goto retry;
+	}
 	return page;
 }
 
@@ -1055,21 +1162,13 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	might_sleep_if(wait);
 
 restart:
-	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
-
-	if (unlikely(*z == NULL)) {
-		/* Should this ever happen?? */
-		return NULL;
-	}
-
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 				zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
 
-	do {
+	for (z = zonelist->zones; *z; z++)
 		wakeup_kswapd(*z, order);
-	} while (*(++z));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1627,6 +1726,29 @@ static void __meminit build_zonelists(pg
 	}
 }
 
+/* Construct the zonelist performance cache - see further mmzone.h */
+static void __meminit build_zonelist_faster(pg_data_t *pgdat)
+{
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zonelist *zonelist;
+		struct zonelist_faster *zlf;
+		int j;
+
+		zonelist = pgdat->node_zonelists + i;
+		zonelist->zlfast_ptr = zlf = &zonelist->zlfast;
+		nodes_clear(zlf->fullnodes);
+		for (j = 0; j < ARRAY_SIZE(zlf->node_id); j++) {
+			struct zone *z = zonelist->zones[j];
+
+			if (!z)
+				break;
+			zlf->node_id[j] = zone_to_nid(z);
+		}
+	}
+}
+
 #else	/* CONFIG_NUMA */
 
 static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -1664,14 +1786,26 @@ static void __meminit build_zonelists(pg
 	}
 }
 
+/* non-NUMA variant of zonelist performance cache - just NULL zlfast_ptr */
+static void __meminit build_zonelist_faster(pg_data_t *pgdat)
+{
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		pgdat->node_zonelists[i].zlfast_ptr = NULL;
+}
+
 #endif	/* CONFIG_NUMA */
 
 /* return values int ....just for stop_machine_run() */
 static int __meminit __build_all_zonelists(void *dummy)
 {
 	int nid;
-	for_each_online_node(nid)
+
+	for_each_online_node(nid) {
 		build_zonelists(NODE_DATA(nid));
+		build_zonelist_faster(NODE_DATA(nid));
+	}
 	return 0;
 }
 
--- 2.6.18-rc7-mm1.orig/mm/mempolicy.c	2006-09-22 14:13:00.000000000 -0700
+++ 2.6.18-rc7-mm1/mm/mempolicy.c	2006-09-23 19:46:15.000000000 -0700
@@ -141,9 +141,11 @@ static struct zonelist *bind_zonelist(no
 	enum zone_type k;
 
 	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
+	max++;			/* space for zlfast_ptr (see mmzone.h) */
 	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
 	if (!zl)
 		return NULL;
+	zl->zlfast_ptr = NULL;
 	num = 0;
 	/* First put in the highest zones from all nodes, then all the next 
 	   lower zones etc. Avoid empty zones because the memory allocator

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-25  9:14 [RFC] another way to speed up fake numa node page_alloc Paul Jackson
@ 2006-09-26  6:08 ` David Rientjes
  2006-09-26  7:06   ` Paul Jackson
  2006-10-02  6:18 ` Paul Jackson
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-09-26  6:08 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, Nick Piggin, Andi Kleen, mbligh, rohitseth,
	menage, clameter

On Mon, 25 Sep 2006, Paul Jackson wrote:

>  - Some per-node data in the struct zonelist is now modified frequently,
>    with no locking.  Multiple CPU cores on a node could hit and mangle
>    this data.  The theory is that this is just performance hint data,
>    and the memory allocator will work just fine despite any such mangling.
>    The fields at risk are the struct 'zonelist_faster' fields 'fullnodes'
>    (a nodemask_t) and 'last_full_zap' (unsigned long jiffies).  It should
>    all be self correcting after at most a one second delay.
>  

If there's mangling on 'last_full_zap' in the scenario with multiple CPU's 
on one node, that means that we might be clearing 'fullnodes' more often 
than every 1*HZ, and that clear is always done by one CPU.  Since the only 
purpose of the delay is to allow a certain period of time go by where 
these hints will actually serve a purpose, this entire speed-up will 
then be degraded.  I agree that adding locking for 'zonelist_faster' is 
probably going too far in terms of performance hint data, but it seems 
necessary with 'last_full_zap' if the goal is to preserve this 1*HZ 
delay.

>  - I pay no attention to the various watermarks and such in this performance
>    hint.  A node could be marked full for one watermark, and then skipped
>    over when searching for a page using a different watermark.  I think
>    that's actually quite ok, as it will tend to slightly increase the
>    spreading of memory over other nodes, away from a memory stressed node.
> 

Since we currently lack support for dynamically allocating nodes with a 
node hotplug API, it actually seems advantageous to have a memory stressed 
node in a pool or cpuset of 'mems'.  Now when another cpuset is facing 
memory pressure I can cherry-pick an untouched node from a less bogged 
down cpuset for my own use.

It seems like an immutable time interval embedded in the page alloc code 
may not be the best way to measure when a full zap should occur.  A more 
appropriate metric might be to do a full zap after a certain threshold of 
pages have been freed.  If it's done that way, the zap would occur in a 
more appropriate place (when pages are freed) as opposed to when pages are 
allocated.  The overhead that we incur of zapping the nodemask every 
second and then being forced to recheck all the nodes again would then be 
eliminated in the case where there's been no change.  Based on the 
benchmarks I ran earlier, that's a popular case.  It's more appropriate 
when we're freeing pages and we know for sure that we're getting memory 
somewhere.

Note to self: in 2.6.18-rc7-mm1, NUMA_BUILD is just a synonym for 
CONFIG_NUMA.  And since this and CONFIG_NUMA_EMU is defined by default on 
x64_64, we're going to have overhead on a single processor system.  In my 
earlier patch I started extracting a macro that could be tested against 
in generic kernel code to determine at least whether NUMA emulation was 
being _used_.  This might need to make a comeback if this type of 
implementation is considered later.

This is a creative solution, especially considering the use of a 
statically-sized zlfast_ptr to find zlfast hidden away in struct zonelist.  
This definitely seems to be headed in the right direction because it works 
in both the real NUMA case and the fake NUMA case.  I would really like to 
run benchmarks on this implementation as I have done for the others but I 
no longer have access to a 64-bit machine.  I don't see how it could cause 
a performance degredation in the non-NUMA case.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-26  6:08 ` David Rientjes
@ 2006-09-26  7:06   ` Paul Jackson
  2006-09-26 18:17     ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-09-26  7:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

Thanks for reviewing this, David.

David wrote:
> If there's mangling on 'last_full_zap' in the scenario with multiple CPU's 
> on one node, that means that we might be clearing 'fullnodes' more often 
> than every 1*HZ, and that clear is always done by one CPU.  Since the only 
> purpose of the delay is to allow a certain period of time go by where 
> these hints will actually serve a purpose, this entire speed-up will 
> then be degraded.  I agree that adding locking for 'zonelist_faster' is 
> probably going too far in terms of performance hint data, but it seems 
> necessary with 'last_full_zap' if the goal is to preserve this 1*HZ 
> delay.

I doubt it.  An occassional extra clearing of fullnodes seems quite
harmless to me.  I doubt it matters whether we zap fullnodes once per
second, or once per two seconds, or twice a second.  We're just dealing
with a single 64 bit word (a jiffies value), and it's a word that just
the few CPUs local to a single node are contending over.  On real 64 bit
systems, it may not even be possible to mangle it

The goal is not to preserve a 1*HZ delay.  I just pulled that delay out
of some unspeakable place.

Roughly I wanted to throttle the rate of wasteful scans of already full
zones to some rate that was infrequent enough to solve our performance
problem, while still fast enough that no one would ever seriously
notice the subtle transient changes in memory placement behaviour.

> It seems like an immutable time interval embedded in the page alloc code 
> may not be the best way to measure when a full zap should occur.

Eh ... why not?  Sure, it's dirt simple.  But in this case, fancier
control of this interval seems like it risks spending more effort than
it would save, with almost no discernable advantage to the user.

If we already had the exact metric handy that we needed, so no more
code needed to be added to a hot path to maintain the metric (including
likely real locks, since most metrics don't like to be mangled by
code that takes a cavelier attitude to locking), then I might reconsider.

But I doubt that this use would justify adding a metric.

> This is a creative solution, 

thanks ..

> This definitely seems to be headed in the right direction because it works 
> in both the real NUMA case and the fake NUMA case.

I hope so.

> I would really like to 
> run benchmarks on this implementation as I have done for the others but I 
> no longer have access to a 64-bit machine. 

Odd ...  Do you expect that situation to be remedied anytime soon?

I'd like to see the results of your rerunning your benchmark.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-26  7:06   ` Paul Jackson
@ 2006-09-26 18:17     ` David Rientjes
  2006-09-26 19:24       ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-09-26 18:17 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Tue, 26 Sep 2006, Paul Jackson wrote:

> The goal is not to preserve a 1*HZ delay.  I just pulled that delay out
> of some unspeakable place.
> 
> Roughly I wanted to throttle the rate of wasteful scans of already full
> zones to some rate that was infrequent enough to solve our performance
> problem, while still fast enough that no one would ever seriously
> notice the subtle transient changes in memory placement behaviour.
> 

Absolutely, I'm sure we'll see a performance enhancement with the 
get_page_from_freelist speedup even though I cannot run benchmarks myself.
Since one second was chosen as the time interval between zaps, however, 
that will not always be the case if there's mangling and one CPU on the 
node will be zapping it prematurely when the system is being stressed for 
page allocation.  This happens to be the case where the smaller time 
interval would be the most unfortunate.  Obviously a second is a long time 
to constantly be allocating more and more pages, so I guess what bothers 
me is that we're zapping information that we have no reason to not believe 
is still accurate.

> Eh ... why not?  Sure, it's dirt simple.  But in this case, fancier
> control of this interval seems like it risks spending more effort than
> it would save, with almost no discernable advantage to the user.
> 

Because when we're stressing the system for more and more memory for a 
particular task regardless of whether it's starting or not, we're 
constantly allocating pages and zapping the nodemask about every second 
even though the status of each node could not have changed.  Those hints 
should not be zapped and rather preserved because we have not freed any 
pages over the same time interval and not because an arbitrary clock tick 
came around.

When we free memory from a specific zone, why is it not better to use 
zone_to_nid and then zap that _node_ in the nodemask only because we are 
guaranteed that the status has changed?

> > I would really like to 
> > run benchmarks on this implementation as I have done for the others but I 
> > no longer have access to a 64-bit machine. 
> 
> Odd ...  Do you expect that situation to be remedied anytime soon?
> 
> I'd like to see the results of your rerunning your benchmark.
> 

I no longer have access to a 64-bit machine or my benchmarking script so 
unless they have relaxed the kernel hacking policies for undergrads back 
at my school, I doubt I can contribute in performing benchmarks.  Four 
people on the Cc list to this email, however, still have access to my 
script.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-26 18:17     ` David Rientjes
@ 2006-09-26 19:24       ` Paul Jackson
  2006-09-26 19:58         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-09-26 19:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

David wrote:
> This happens to be the case where the smaller time 
> interval would be the most unfortunate.

"most unfortunate" -- that phrase sounds overly dramatic to me.

So what if the average time between zaps is 0.9 seconds instead of 1.0
seconds?  More realistically, we are talking something like 0.99999
versus 1.00000 seconds, given that writing a 64 bit word on a 32 bit
arch offers only a tiny window for lost races.

Lost races that break things are unacceptable, even in tiny windows.

But lost races that just slightly nudge an already arbitrary and not
particularly fussy performance heuristic are not worth a single line
of code to avoid.

> When we free memory from a specific zone, why is it not better to use 
> zone_to_nid and then zap that _node_ in the nodemask only because we are 
> guaranteed that the status has changed?

It might be better.  And it might not.  More likely, it would be an
immeasurable difference except on custom microbenchmarks designed to
highlight this difference one way or the other.

Less code is better, unless there is better reason than this for it.

And unless I locked the bit clear, I'd still have to occassionally zap
the entire nodemask.  Setting or clearing individual bits in a mask opens
a bigger critical section to races.  Eventually, after loosing enough
such races, that nodemask would be suitable for donating a little bit of
entropy to the random number subsystem -- mush.

> Four people on the Cc list to this email, however, still have access to
> my script.

Perhaps you could ping them off-list, and see if they are in a position
to participate.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-26 19:24       ` Paul Jackson
@ 2006-09-26 19:58         ` David Rientjes
  2006-09-26 21:48           ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-09-26 19:58 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Tue, 26 Sep 2006, Paul Jackson wrote:

> So what if the average time between zaps is 0.9 seconds instead of 1.0
> seconds?  More realistically, we are talking something like 0.99999
> versus 1.00000 seconds, given that writing a 64 bit word on a 32 bit
> arch offers only a tiny window for lost races.
> 
> Lost races that break things are unacceptable, even in tiny windows.
> 
> But lost races that just slightly nudge an already arbitrary and not
> particularly fussy performance heuristic are not worth a single line
> of code to avoid.
> 

Why is it arbitrary, though?  This is hard-coded into the page allocation 
code as the performance enhancement window for which your code relies 
upon.  If time is the metric to be used to determine when we should go 
back and see if nodes have gained more memory, and I disagree that it is, 
then surely this one second window cannot possibly achieve the most 
efficient results you can squeeze out of your implementation for all 
possible workloads.  In my opinion a more appropriate metric would be when 
we _know_ the amount of free memory in a zone has changed.  And if you're 
seeking a distributed amount of memory among mems as your original post 
specified, then you could even get away with a simple counter and the 
nodemask is zapped after X number of page allocations.  This would _not_ 
be susceptible to race conditions among multiple CPU's on one node.

> > When we free memory from a specific zone, why is it not better to use 
> > zone_to_nid and then zap that _node_ in the nodemask only because we are 
> > guaranteed that the status has changed?
> 
> It might be better.  And it might not.  More likely, it would be an
> immeasurable difference except on custom microbenchmarks designed to
> highlight this difference one way or the other.
> 

If that's the case, then the entire speed-up is broken.  As it stands 
right now you're zapping the _entire_ nodemask every second and going back 
to rechecking all those that you failed to find free memory on in the 
past.  In my suggestion, you're only zapping a node when it is known that 
the free memory has changed (increased) based on a free.  So when my 
process that wants to mlock and allocate tons and tons of pages, you're 
zapping unnecessarily because the _exact_ same nodemask is going to 
reproduce itself but only after unnecessary delay.

> And unless I locked the bit clear, I'd still have to occassionally zap
> the entire nodemask.  Setting or clearing individual bits in a mask opens
> a bigger critical section to races.  Eventually, after loosing enough
> such races, that nodemask would be suitable for donating a little bit of
> entropy to the random number subsystem -- mush.
> 

The only such race conditions that exist are among the CPU's on that 
particular node in this case and the node bit is only zapped when pages 
are freed from a zone on that node.  And since the node bit is only turned 
on when it has been passed by and deemed too full to allocate on, I don't 
see where the race exists.  It's what we want since we aren't sure whether 
the free has allowed us to allocate there again, all we are doing is 
saying that it should be rechecked on the next alloc.

> > Four people on the Cc list to this email, however, still have access to
> > my script.
> 
> Perhaps you could ping them off-list, and see if they are in a position
> to participate.
> 

Done.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-26 19:58         ` David Rientjes
@ 2006-09-26 21:48           ` Paul Jackson
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2006-09-26 21:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

> Why is it arbitrary, though?

I was just trying to throttle the rate of futile zonelist scans.

In my implementation, the choice of 1*HZ for the zap time is obviously
an arbitrarily chosen time, within some acceptable range - right?

If you are asking why I didn't pick the non-arbitrary variant
implementation you suggested, wherein we clear individual node bits in
the nodemask of full nodes, anytime we free memory on that node, then I
did not do this because it was more code, and because it required a
lock to safely clear the bit, and because I had no particular reason to
think it would provide measurable improvement anyway.

I am quite happy coding stupid, simple, short and racey code, if it
looks to me like it will perform just as well, and be just as robust,
if not more so, than the more exact, longer, lock protected code.

> If that's the case, then the entire speed-up is broken. 

Are we looking at the same patch ;)?  My patch enables us to only have
to look closely at each full node once per second, instead of once per
page allocation.  That's the speedup.  That and the more rapid
application of the cpuset constraint in most cases.  The unallowed and
recently full nodes are skipped over on the first scan at the per-zone
cost of loading just a single unsigned short, from a compact array, plus
modest constant overhead per __alloc_pages call.

(My unit of cost here is 'cache line misses'.)

> And since the node bit is only turned 
> on when it has been passed by and deemed too full to allocate on, I don't 
> see where the race exists.

If two cpus on the same node each go to clear a (different) bit in the
nodemask at the same time, you could have each cpu load the mask, each
cpu compute a new mask, with its bit cleared, and each cpu store the
mask, all in that order.  Notice that the second cpu to store just
clobbered the bit clear done by the first cpu.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-09-25  9:14 [RFC] another way to speed up fake numa node page_alloc Paul Jackson
  2006-09-26  6:08 ` David Rientjes
@ 2006-10-02  6:18 ` Paul Jackson
  2006-10-02  6:31   ` David Rientjes
  1 sibling, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-02  6:18 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, rientjes, ak, mbligh, rohitseth,
	menage, clameter

pj wrote:
+struct zonelist_faster {
+	nodemask_t fullnodes;		/* nodes recently lacking free memory */
+	unsigned long last_full_zap;	/* jiffies when fullnodes last zero'd */
+	unsigned short node_id[MAX_NUMNODES * MAX_NR_ZONES]; /* zone -> nid */
+};

This seems broken on systems with more than one zone per node.

If whichever zone comes first of the several zones on a node (the
several consecutive zones in the zonelist that evaluate to the same
node) ever gets full, then the other zones on that node will be
skipped over, because they would end up on a full node.  Once per
second, we will retry the first zone from that node, but if it is still
full, we would -still- skip over the remaining zones without looking at
them.  That is, these other zones wouldn't even get the courtesy of a
once per second consideration.

Only if every allowed node in the system is full will we actually
rescan the zonelist with this faster mechanism disabled and seriously
examine these other zones on such a node.

Perhaps instead of a single 'nodemask_t fullnodes', I need a small
array of these nodemasks, one per MAX_NR_ZONES.  Then I could select
which fullnodes nodemask to check by taking my index into the node_id[]
array, modulo MAX_NR_ZONES.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-02  6:18 ` Paul Jackson
@ 2006-10-02  6:31   ` David Rientjes
  2006-10-02  6:48     ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-02  6:31 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Sun, 1 Oct 2006, Paul Jackson wrote:
> Perhaps instead of a single 'nodemask_t fullnodes', I need a small
> array of these nodemasks, one per MAX_NR_ZONES.  Then I could select
> which fullnodes nodemask to check by taking my index into the node_id[]
> array, modulo MAX_NR_ZONES.
> 

It would be nice to be able to scale this so that the speed-up works 
efficiently for numa=fake=256 (after NODES_SHIFT is increased from 6 to 8 
on x86_64).

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-02  6:31   ` David Rientjes
@ 2006-10-02  6:48     ` Paul Jackson
  2006-10-02  7:05       ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-02  6:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

David wrote:
> It would be nice to be able to scale this so that the speed-up works 
> efficiently for numa=fake=256 (after NODES_SHIFT is increased from 6 to 8 
> on x86_64).

I'm not sure what you have in mind by "scale this."

We have a linear search of zones ... my speedup just changes the
constant multiplier, by converting that search from one that takes
one or two cache lines per node, to one that takes an unsigned
short, from compact array, per node.

This speedup should apply regardless of how many nodes (fake or
real or mixed) are present.

The fake node case is more interesting, because the usage pattern
it anticipates, with many, even most, of a long string of nodes
full during ordinary operation, stresses this linear scan more.

But whatever benefit this proposal has should be independent of the
value of NODES_SHIFT.

The systems I care most about, ia64 sn2, are already running with a
default NODES_SHIFT of 10.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-02  6:48     ` Paul Jackson
@ 2006-10-02  7:05       ` David Rientjes
  2006-10-02  8:41         ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-02  7:05 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Sun, 1 Oct 2006, Paul Jackson wrote:

> I'm not sure what you have in mind by "scale this."
> 

I'm talking about this:

+struct zonelist_faster {
+	nodemask_t fullnodes;		/* nodes recently lacking free memory */
+	unsigned long last_full_zap;	/* jiffies when fullnodes last zero'd */
+	unsigned short node_id[MAX_NUMNODES * MAX_NR_ZONES]; /* zone -> nid */
+};

With NODES_SHIFT equal to 10 as you recommend, you can't get away with an 
unsigned short there.  Likewise, your nodemask_t would need to be 128 
bytes.  So this doesn't scale appropriately when you simply change 
NODES_SHIFT.

> This speedup should apply regardless of how many nodes (fake or
> real or mixed) are present.
> 

It doesn't (see above).

> But whatever benefit this proposal has should be independent of the
> value of NODES_SHIFT.
> 

It's not (see above).

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-02  7:05       ` David Rientjes
@ 2006-10-02  8:41         ` Paul Jackson
  2006-10-03 18:15           ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-02  8:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

David wrote:
> I'm talking about this:
> 
> +struct zonelist_faster {
> +	nodemask_t fullnodes;		/* nodes recently lacking free memory */
> +	unsigned long last_full_zap;	/* jiffies when fullnodes last zero'd */
> +	unsigned short node_id[MAX_NUMNODES * MAX_NR_ZONES]; /* zone -> nid */
> +};
> 
> With NODES_SHIFT equal to 10 as you recommend, you can't get away with an 
> unsigned short there. 

Apparently it's time for me to be a stupid git again.  That's ok; I'm
getting quite accustomed to it.

Could you spell out exactly why I can't get away with an unsigned short
node_id if NODES_SHIFT is 10?

I was thinking that limiting node_id to an unsigned short just meant
that we couldn't have more than 65536 nodes on the system.  That should
be enough, for a while anyway.

Indeed, given this line in include/linux/mempolicy.h:

	short            preferred_node;

I didn't even think I was being very original in this.

> Likewise, your nodemask_t would need to be 128 bytes.

Yes - big honkin NUMA iron calls for big nodemasks.  That's part of
why I spent the better part of a year driving Andrew to drink with
my cpumask/nodemask patches from hell.

Is there a problem with a 128 byte nodemask_t that I'm missing?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-02  8:41         ` Paul Jackson
@ 2006-10-03 18:15           ` Paul Jackson
  2006-10-03 19:37             ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-03 18:15 UTC (permalink / raw)
  To: Paul Jackson
  Cc: rientjes, linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth,
	menage, clameter

pj, responding to David:
> > With NODES_SHIFT equal to 10 as you recommend, you can't get away with an 
> > unsigned short there. 
> 
> Apparently it's time for me to be a stupid git again.  That's ok; I'm
> getting quite accustomed to it.
> 
> Could you spell out exactly why I can't get away with an unsigned short
> node_id if NODES_SHIFT is 10?


Is this still in your queue to respond to, David?

I'm still curious as to why I can't get away with an unsigned short there.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-03 18:15           ` Paul Jackson
@ 2006-10-03 19:37             ` David Rientjes
  2006-10-04 15:45               ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-03 19:37 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Tue, 3 Oct 2006, Paul Jackson wrote:

> pj, responding to David:
> > > With NODES_SHIFT equal to 10 as you recommend, you can't get away with an 
> > > unsigned short there. 
> > 
> > Apparently it's time for me to be a stupid git again.  That's ok; I'm
> > getting quite accustomed to it.
> > 
> > Could you spell out exactly why I can't get away with an unsigned short
> > node_id if NODES_SHIFT is 10?
> 
> 
> Is this still in your queue to respond to, David?
> 
> I'm still curious as to why I can't get away with an unsigned short there.
> 

Because it's unnecessary.  On my 4G machine with numa=fake=256, each of 
these node_id arrays is going to be 1.5K.  You could get away with the 
exact same behavior with using a u8 or unsigned char.  There's no reason 
to support anything greater than a shift of 8 since NUMA emulation is 
_only_ available on x86_64 and doesn't even work right as it stands in the 
current mainline so that you could boot my machine with anything more than 
numa=fake=8.

If you are going to abstract this functionality to other architectures or 
even generically I would suggest following Magnus Damm's example and 
creating a NODES_SHIFT_HW instead that would limit the number of numa=fake 
nodes.  There is simply no reason for this to be greater than 8 (even a 
128G machine with numa=fake=256 would have 512M nodes).

Secondly, the entire node_id lookup is redundant on x86_64 in the first 
place (see arch/x86_64/mm/numa.c and include/asm-x86_64/mmzone.h for 
memnodemap).  The only thing that is being sped-up with your node_id array 
in each zonelist_faster is moving this calculation from two steps to one 
step; since the mainline implementation today are both inline functions I 
think the improvement is minimal.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-03 19:37             ` David Rientjes
@ 2006-10-04 15:45               ` Paul Jackson
  2006-10-04 16:11                 ` Christoph Lameter
  2006-10-04 22:10                 ` David Rientjes
  0 siblings, 2 replies; 28+ messages in thread
From: Paul Jackson @ 2006-10-04 15:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

David responding to pj:
> > I'm still curious as to why I can't get away with an unsigned short there.
> > 
> 
> Because it's unnecessary.  On my 4G machine with numa=fake=256, each of 
> these node_id arrays is going to be 1.5K.  You could get away with the 
> exact same behavior with using a u8 or unsigned char.

Are you trying to tell to me that the reason I can NOT get away with
u16 is because I CAN get away with u16, but u8 would be better?

This makes no bleeping sense ...

Not to mention that I obviously can NOT get away with u8, as I already
have 1024 real nodes on some systems.

> If you are going to abstract this functionality to other architectures or 
> even generically

Yes - I am trying to generalize whatever code changes we make to
get_page_from_freelist() to be at least neutral for all arch's, and
to benefit at least systems with large counts of nodes, real or fake.

> I would suggest following Magnus Damm's example ...

I don't know what example you mean - please provide a pointer.

> The only thing that is being sped-up with your node_id array 
> in each zonelist_faster is moving this calculation from two steps to one 
> step; since the mainline implementation today are both inline functions I 
> think the improvement is minimal.

No - I'm optimizing cache line misses, not classic algorithmic
complexity or number of function calls.  Scanning say 256 zones
with the existing kernel code uses 256 or 512 cache lines, at
one or two per zone.  Scanning 256 zones with my zonelist_faster
patch uses however many cache lines it takes to hold 512 consecutive
bytes of memory, which is much fewer.

Hopefully Martin can get us some real numbers.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-04 15:45               ` Paul Jackson
@ 2006-10-04 16:11                 ` Christoph Lameter
  2006-10-04 22:10                 ` David Rientjes
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-10-04 16:11 UTC (permalink / raw)
  To: Paul Jackson
  Cc: David Rientjes, linux-mm, akpm, nickpiggin, ak, mbligh,
	rohitseth, menage

On Wed, 4 Oct 2006, Paul Jackson wrote:

> Not to mention that I obviously can NOT get away with u8, as I already
> have 1024 real nodes on some systems.

Well lets make clear that we are talking about general NUMA enhancements 
here despite the subject line.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-04 15:45               ` Paul Jackson
  2006-10-04 16:11                 ` Christoph Lameter
@ 2006-10-04 22:10                 ` David Rientjes
  2006-10-05  2:27                   ` Paul Jackson
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-04 22:10 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Wed, 4 Oct 2006, Paul Jackson wrote:

> Are you trying to tell to me that the reason I can NOT get away with
> u16 is because I CAN get away with u16, but u8 would be better?
> 
> This makes no bleeping sense ...
> 
> Not to mention that I obviously can NOT get away with u8, as I already
> have 1024 real nodes on some systems.
> 

Isn't this the exact behavior that ordered zonelists are supposed to solve 
for real NUMA systems?  Has there been an _observed_ case where the cost 
to scan the zonelists was considered excessive on real NUMA systems?  If 
not, then this implementation is simply adding more (and unnecessary) 
complexity because now there's two strategies for determining the zones to 
check on every get_page_from_freelist and one of the major reasons we 
order zonelists in the first place is to deal with NUMA.

> Yes - I am trying to generalize whatever code changes we make to
> get_page_from_freelist() to be at least neutral for all arch's, and
> to benefit at least systems with large counts of nodes, real or fake.
> 

I was under the impression that there was nothing wrong with the way 
current real NUMA systems allocate pages.  If not, please point me to the 
thread that _specifically_ discusses this with _data_ that shows it's 
inefficient.  In fact, when this thread started you recommended as little 
changes as possible to the code to not interfere with what already works.  
I suggest if changes are going to be made to page allocation on fake AND 
real NUMA setups that you provide convincing data that it does indeed 
improve the efficiency of such an algorithm and thus far the only test I 
have seen you solicit is that of the fake case.

> > I would suggest following Magnus Damm's example ...
> 
> I don't know what example you mean - please provide a pointer.
> 

It was the same example that I posted in the other thread which caused you 
to add Magnus to the Cc.

http://marc.theaimsgroup.com/?l=linux-mm&m=113161386520342

If you read the thread this time, you'll notice that Andi Kleen's original 
objection to abstracting this generically was because he felt it was a 
debugger hack and didn't deserve the attention.  But as more and more 
discussion has taken place on the viability of using NUMA emulation in 
conjunction with cpusets for the purpose of resource management, perhaps 
he has relaxed that objection.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-04 22:10                 ` David Rientjes
@ 2006-10-05  2:27                   ` Paul Jackson
  2006-10-05  2:37                     ` David Rientjes
  2006-10-11  3:42                     ` Paul Jackson
  0 siblings, 2 replies; 28+ messages in thread
From: Paul Jackson @ 2006-10-05  2:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

> Isn't this the exact behavior that ordered zonelists are supposed to solve 
> for real NUMA systems?  Has there been an _observed_ case where the cost 
> to scan the zonelists was considered excessive on real NUMA systems?

Well ... the good news is I understood your comments this time.

I guess I should be happy it only took about 3 iterations.

Historically the ordered zonelists addressed the situation where one
almost always found free memory near the front of the ordered zonelist.

Yes, you are correct that I originally didn't think we had a problem
with real numa zonelist scans.

Three days ago, when I introduced this alternative patch that started
this current thread, I changed my position, stating at that time:
>
> There are two reasons I persued this alternative:
> 
>  1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
>     have seen real customer loads where the cost to scan the zonelist
>     was a problem, due to many nodes being full of memory before
>     we got to a node we could use.  Or at least, I think we have.
>     This was related to me by another engineer, based on experiences
>     from some time past.  So this is not guaranteed.  Most likely, though.
> 
>     The following approach should help such real numa systems just as
>     much as it helps fake numa systems, or any combination thereof.
>     
>  2) The effort to distinguish fake from real numa, using node_distance,
>     so that we could cache a fake numa node and optimize choosing
>     it over equivalent distance fake nodes, while continuing to
>     properly scan all real nodes in distance order, was going to
>     require a nasty blob of zonelist and node distance munging.
> 
>     The following approach has no new dependency on node distances or
>     zone sorting.

David wrote:
> I was under the impression that there was nothing wrong with the way 
> current real NUMA systems allocate pages.  If not, please point me to the 
> thread that _specifically_ discusses this with _data_ that shows it's 
> inefficient.

See above.  I don't have data, so cannot justify going far out of our
way.

If someone has a better way to skin this fake numa cat, that does not
benefit (or harm) real numa, that would still be worth careful
consideration.

> In fact, when this thread started you recommended as little 
> changes as possible to the code to not interfere with what already works.  

Yes, I did start with that recommendation.  See above.

And see above for my current reasons for persuing this patch.

Some more things I like about this patch:
 * Conceptually, it is very localized, making no changes to the
   larger code or data structure, just adding a cache of some
   hot data.
 * Further, it makes few assumptions about the larger scheme of
   things.
 * It has no dependencies on zonelist sorting, node distances,
   fake vs real numa nodes or any of that.
 * It makes no discernable difference in the memory placement
   behaviour of a system.

Downside - it's still a linear zonelist scan, and it's a cache bolted on
the side of things, rather than an inherently fast algorithm and data
structure.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  2:27                   ` Paul Jackson
@ 2006-10-05  2:37                     ` David Rientjes
  2006-10-05  2:53                       ` Paul Jackson
  2006-10-11  3:42                     ` Paul Jackson
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-05  2:37 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Wed, 4 Oct 2006, Paul Jackson wrote:

> > There are two reasons I persued this alternative:
> > 
> >  1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
> >     have seen real customer loads where the cost to scan the zonelist
> >     was a problem, due to many nodes being full of memory before
> >     we got to a node we could use.  Or at least, I think we have.
> >     This was related to me by another engineer, based on experiences
> >     from some time past.  So this is not guaranteed.  Most likely, though.
> > 
> >     The following approach should help such real numa systems just as
> >     much as it helps fake numa systems, or any combination thereof.
> >     
> >  2) The effort to distinguish fake from real numa, using node_distance,
> >     so that we could cache a fake numa node and optimize choosing
> >     it over equivalent distance fake nodes, while continuing to
> >     properly scan all real nodes in distance order, was going to
> >     require a nasty blob of zonelist and node distance munging.
> > 
> >     The following approach has no new dependency on node distances or
> >     zone sorting.
> 
> 
> David wrote:
> > I was under the impression that there was nothing wrong with the way 
> > current real NUMA systems allocate pages.  If not, please point me to the 
> > thread that _specifically_ discusses this with _data_ that shows it's 
> > inefficient.
> 
> See above.  I don't have data, so cannot justify going far out of our
> way.
> 

I've never seen the zonelist ordering pose a problem on real NUMA systems, 
especially to the degree where any non-trivial speedup could be suggested.  
So I was curious as to whether this has ever been seen in practice with a 
sufficiently large workload and a considerable number of nodes.

> If someone has a better way to skin this fake numa cat, that does not
> benefit (or harm) real numa, that would still be worth careful
> consideration.
> 

Well, if it turns out that there is really no trouble with the real NUMA 
case (and I suspect that there isn't), then your speed-up could definitely 
be used only for the fake case.  The only change that would be required is 
to abstract a macro to test against if NUMA emulation was configured 
correctly at boot-time instead of just NUMA_BUILD.  That's a trivial 
change so once the data is presented that shows that this speeds up page 
allocation, it would be very nice to see this implemented even if it 
doesn't do anything for real NUMA.  I'm a big fan of it for the fake case.

>  * It has no dependencies on zonelist sorting, node distances,
>    fake vs real numa nodes or any of that.

Yes, it is nice that no change to __node_distance needs to be made so 
there's no chance of the srat warning coming around later and causing 
trouble.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  2:37                     ` David Rientjes
@ 2006-10-05  2:53                       ` Paul Jackson
  2006-10-05  3:00                         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-05  2:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

David wrote:
> The only change that would be required is 
> to abstract a macro to test against if NUMA emulation was configured 
> correctly at boot-time instead of just NUMA_BUILD.

Why add any logic to avoid this zonelist caching on systems not using
numa emulation?

Leaving this zonelist caching enabled all the time:
 1) improves test coverage of it, and
 2) benefits those real numa systems that might have
    long zonelist scans in the future.

My experience on my current customer base with cpusets is almost
entirely with HPC (High Performance Computing) apps, which usually
manage their memory layout very closely.  These workloads would tend to
have very short zonelist scans and benefit little from this speed up.

As cpusets gets wider use on more varied workloads, I would expect
that some of these varied workloads would stress the zonelist scanning
more.

And there's still a pretty good chance, though I can't document it,
that we've already seen performance problems, even on existing HPC
workloads, with this zonelist scan.

So ... I ask again ... why avoid this speed up on systems not emulating
nodes?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  2:53                       ` Paul Jackson
@ 2006-10-05  3:00                         ` David Rientjes
  2006-10-05  3:26                           ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-05  3:00 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Wed, 4 Oct 2006, Paul Jackson wrote:

> So ... I ask again ... why avoid this speed up on systems not emulating
> nodes?
> 

Aren't we back in the case where zonelist ordering should be good enough 
so that there's no performance enhancement with the speed up on systems 
not emulating nodes?  I'm just curious why there's a lot of naysaying 
going on about ordering the zonelists which has worked well in the past 
and now that mentality has suddenly changed with no data to support it.

[ And going about proving that it's beneficial even for something like a
  dual-core 64-bit setup with UMA is easy and can be done at any time
  (as long as you have a 64-bit machine, which I don't anymore).  So
  let's see the data. ]

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  3:00                         ` David Rientjes
@ 2006-10-05  3:26                           ` Paul Jackson
  2006-10-05  3:49                             ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2006-10-05  3:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

I don't think you didn't answer my question.

I am suggesting we leave it enabled, and I said why.

You are suggesting we disable it unless numa nodes are being emulated.

  Why?  What benefit is there to disabling it at runtime?

And, no, I can't provide data.  It depends on how the system is setup
and used.

If someone has a system with many nodes (say 64, such as in your fake
numa tests) and a cpuset configuration and workload that loads many of
those nodes, forcing long zonelist scans, they will hit it just like
your tests did.

The real question is how common such systems, configurations and
workloads really are.

No amount of micro-benchmarking can answer that question.

Micro-benchmarks are of limited use in making design choices, except
when they are validated against real world workloads.

And as to why my position changed as to whether the zonelist scans
were ever a performance issue on real numa, I've already answered that
question ... a couple of times.  Let me know if you need me to repeat
this answer a third time.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  3:26                           ` Paul Jackson
@ 2006-10-05  3:49                             ` David Rientjes
  2006-10-05  4:07                               ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-05  3:49 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

On Wed, 4 Oct 2006, Paul Jackson wrote:

> And as to why my position changed as to whether the zonelist scans
> were ever a performance issue on real numa, I've already answered that
> question ... a couple of times.  Let me know if you need me to repeat
> this answer a third time.
> 

No, what I need repeated a third time is why changes are being made 
without data to support it, especially to something like 
get_page_from_freelist that has never been complained about on real NUMA 
setups.  Second, what I need repeated a third time is why changes are 
being made to the real NUMA case without data to show it's a problem in 
the first place.  This is a scientific process where we can experiment and 
then collect data and analyize it to see what went right and what went 
wrong.  I'm a big supporter of making changes when you have a feeling that 
it will make a difference because often times the experiments will prove 
that it did.  But I'm not a big supporter of saying "the real NUMA case 
being slow was mentioned to me in passing once, I've never witnessed it, 
I can't describe how to test it, and I have nothing to compare it to, so 
let's add more code because it can't make it worse."

So I really don't see what the point of debating the issue is when any 
number of tests could either prove or disprove this and those tests don't 
need to be run by Rohit on a fake NUMA setup.  You have a NUMA setup with 
1024 nodes, so let's see ANY workload IN ANY CIRCUMSTANCE where the HARD 
DATA shows that it improves the case.  Theory is great for discussion, but 
real numbers actually make the case.

[ And when I return the Seattle from east LA and I try to squeeze a 64-bit
  machine out of my school, even as a lowly undergrad, I'm looking forward
  to patching your patch so that it zaps the nodemask _only_ on frees and
  showing that it works better in every scenario that I can think of. ]

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  3:49                             ` David Rientjes
@ 2006-10-05  4:07                               ` Andrew Morton
  2006-10-05  4:14                                 ` Paul Jackson
  2006-10-05  4:50                                 ` David Rientjes
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2006-10-05  4:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Jackson, linux-mm, nickpiggin, ak, mbligh, rohitseth,
	menage, clameter

On Wed, 4 Oct 2006 20:49:20 -0700 (PDT)
David Rientjes <rientjes@cs.washington.edu> wrote:

> On Wed, 4 Oct 2006, Paul Jackson wrote:
> 
> > And as to why my position changed as to whether the zonelist scans
> > were ever a performance issue on real numa, I've already answered that
> > question ... a couple of times.  Let me know if you need me to repeat
> > this answer a third time.
> > 
> 
> No, what I need repeated a third time is why changes are being made 
> without data to support it, especially to something like 
> get_page_from_freelist that has never been complained about on real NUMA 
> setups.  Second, what I need repeated a third time is why changes are 
> being made to the real NUMA case without data to show it's a problem in 
> the first place.  This is a scientific process where we can experiment and 
> then collect data and analyize it to see what went right and what went 
> wrong.  I'm a big supporter of making changes when you have a feeling that 
> it will make a difference because often times the experiments will prove 
> that it did.  But I'm not a big supporter of saying "the real NUMA case 
> being slow was mentioned to me in passing once, I've never witnessed it, 
> I can't describe how to test it, and I have nothing to compare it to, so 
> let's add more code because it can't make it worse."

We do that sort of thing all the time ;)

It's sometimes OK to rely on common sense and not require benchmark results
or in-field observations for everything.

Or one can concoct artificial microbenchmarks, measure the impact and then
use plain old brainpower to decide whether anyone is ever likely to want to
do anything in real life which is approximately modelled by that benchmark.

The latter is the case here and I'd say the answer is "yes".  People might
be impacted by this in real life.

For example: my HPC job needs lots of memory.  An amount of memory which
requires 100 nodes to be allocated to it.  Some other user is in a similar
situation and needs 50 nodes.  (I think.  This sounds _too_ likely, so
there's perhaps some well-established way to prevent it?)

> So I really don't see what the point of debating the issue is when any 
> number of tests could either prove or disprove this and those tests don't 
> need to be run by Rohit on a fake NUMA setup.  You have a NUMA setup with 
> 1024 nodes, so let's see ANY workload IN ANY CIRCUMSTANCE where the HARD 
> DATA shows that it improves the case.  Theory is great for discussion, but 
> real numbers actually make the case.

But we know without even testing it that an Altix can be made to run like
crap by forcing a workload to walk 100 nodes to allocate each page. 
That'll be 99 off-node accesses to the zone structures per page too, I
think.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  4:07                               ` Andrew Morton
@ 2006-10-05  4:14                                 ` Paul Jackson
  2006-10-05  4:50                                 ` David Rientjes
  1 sibling, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2006-10-05  4:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: rientjes, linux-mm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

Andrew wrote:
> We do that sort of thing all the time ;)

Pretty much every line of code in kernel/cpuset.c
was done this way ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  4:07                               ` Andrew Morton
  2006-10-05  4:14                                 ` Paul Jackson
@ 2006-10-05  4:50                                 ` David Rientjes
  2006-10-05  4:53                                   ` Paul Jackson
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2006-10-05  4:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, linux-mm, nickpiggin, ak, mbligh, rohitseth,
	menage, clameter

On Wed, 4 Oct 2006, Andrew Morton wrote:

> We do that sort of thing all the time ;)
> 
> It's sometimes OK to rely on common sense and not require benchmark results
> or in-field observations for everything.
> 
> Or one can concoct artificial microbenchmarks, measure the impact and then
> use plain old brainpower to decide whether anyone is ever likely to want to
> do anything in real life which is approximately modelled by that benchmark.
> 
> The latter is the case here and I'd say the answer is "yes".  People might
> be impacted by this in real life.
> 

Ah, it's ok to ask for benchmarks in the fake case which _nobody_ uses but 
benchmarks in the real case which a lot of people use is unnecessary.

The funny thing is that it's not going to make the real case more 
efficient at all if you follow real-world examples.  Usually memory is 
going to be found in the first zone anyway and when it's not it's going to 
be found next.  This is, after all, why the zone ordering has worked and 
nobody has had a problem with it.  (Not to mention you're clearing the 
nodemask every second anyway.)  I was hoping this would be evident in the 
real case if you'd just run the code on your 1024 node setup.  I guess 
that will be realized later.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  4:50                                 ` David Rientjes
@ 2006-10-05  4:53                                   ` Paul Jackson
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2006-10-05  4:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-mm, nickpiggin, ak, mbligh, rohitseth, menage, clameter

> Usually memory is 
> going to be found in the first zone anyway and when it's not it's going to 
> be found next.

Agreed.  Usually.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] another way to speed up fake numa node page_alloc
  2006-10-05  2:27                   ` Paul Jackson
  2006-10-05  2:37                     ` David Rientjes
@ 2006-10-11  3:42                     ` Paul Jackson
  1 sibling, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2006-10-11  3:42 UTC (permalink / raw)
  To: Paul Jackson
  Cc: rientjes, linux-mm, akpm, nickpiggin, ak, mbligh, rohitseth,
	menage, clameter

A week ago, I wrote, of my zonelist caching patch:
>
> Downside - it's still a linear zonelist scan

Actually, not quite so, in the terms that matter on real NUMA hardware.

On real NUMA hardware, there are two memory costs of interest:

 1) the usual cost to hit main (node local) memory, also known as a
    cache line miss, and

 2) the higher cost to hit some other nodes memory, for something the
    other node just updated, so you really have to go across the NUMA
    fabric to get it.

My zonelist caching shrinks (1) to just a few cache lines, but more
importantly (for real NUMA hardware) reduces (2) to essentially a
constant, that no longer grows linearly with the number of nodes.

When one node is looking for free memory on a list of other nodes, the
page allocator no longer relies on -any- live information from the
nodes it skips over.  It is usually able to get a page from the very
first node that it tries.  It is able to skip over likely full nodes
using only locally stored and available information from the node local
zonelist cache.

So in the unit of measure that matters most to NUMA systems, (2) above,
this zonelist caching -is- very close to constant time, for workloads
presenting sufficiently high page allocation request rates.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-10-11  3:42 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-25  9:14 [RFC] another way to speed up fake numa node page_alloc Paul Jackson
2006-09-26  6:08 ` David Rientjes
2006-09-26  7:06   ` Paul Jackson
2006-09-26 18:17     ` David Rientjes
2006-09-26 19:24       ` Paul Jackson
2006-09-26 19:58         ` David Rientjes
2006-09-26 21:48           ` Paul Jackson
2006-10-02  6:18 ` Paul Jackson
2006-10-02  6:31   ` David Rientjes
2006-10-02  6:48     ` Paul Jackson
2006-10-02  7:05       ` David Rientjes
2006-10-02  8:41         ` Paul Jackson
2006-10-03 18:15           ` Paul Jackson
2006-10-03 19:37             ` David Rientjes
2006-10-04 15:45               ` Paul Jackson
2006-10-04 16:11                 ` Christoph Lameter
2006-10-04 22:10                 ` David Rientjes
2006-10-05  2:27                   ` Paul Jackson
2006-10-05  2:37                     ` David Rientjes
2006-10-05  2:53                       ` Paul Jackson
2006-10-05  3:00                         ` David Rientjes
2006-10-05  3:26                           ` Paul Jackson
2006-10-05  3:49                             ` David Rientjes
2006-10-05  4:07                               ` Andrew Morton
2006-10-05  4:14                                 ` Paul Jackson
2006-10-05  4:50                                 ` David Rientjes
2006-10-05  4:53                                   ` Paul Jackson
2006-10-11  3:42                     ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox