linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
@ 2006-03-10 19:33 Lee Schermerhorn
  2006-03-11  6:41 ` KAMEZAWA Hiroyuki
  2006-03-14 22:27 ` Lee Schermerhorn
  0 siblings, 2 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-10 19:33 UTC (permalink / raw)
  To: linux-mm

AutoPage Migration - V0.1 - 0/8 Overview

We have seen some workloads suffer decreases in performance on NUMA
platforms when the Linux scheduler moves the tasks away from their initial
memory footprint.  Some users--e.g., HPC--are motivated by this to go to
great lengths to ensure that tasks start up and stay on specific nodes.
2.6.16 includes memory migration mechanisms that will allow these users
to move memory along with their tasks--either manually or under control
of a load scheduling program--in response to changing demands on the
resources.

Other users--e.g., "Enterprise" applications--would prefer that the system
just "do the right thing" in this respect.  One possible approach would
be to have the system automatically migrate a tasks pages when it decides
to move the task to a different node from where it has executed in the
past.  One can debate [and we DO, at length] whether this would improve
the performance or not.  But, why not provide a patch and measure the
effects for various policies?  I.e., "show me the code."

So, ....

This series of patches hooks up linux 2.6.16 direct page migration to the
task scheduler. The effect is such that, when load balancing moves a task
to a cpu on a different node from where the task last executed, the task
is notified of this change using the same mechanism to notify a task of
pending signals.  When the task returns to user state, it attempts to
migrate any pages in those of its vm areas under control of default
policy that are not already on the new node to the new node.

This behavior is disabled by default, but can be enabled by writing non-
zero to /sys/kernel/migation/sched_migrate_memory.  [Could call this
"auto_migrate_memory" ?].  Furthermore, to prevent thrashing, a second
sysctl, sched_migrate_interval, has been implemented.  The load balancer
will not move a task to a different node if it has move to a new node
in the last sched_migrate_interval seconds.  [User interface is in
seconds; internally it's in HZ.]  The idea is to give the task time to
ammortize the cost of the migration by giving it time to benefit from
local references to the page.

The controls, enable/disable and interval, will enable performance testing
of this mechanism to help decide whether it is worth inclusion.

The Patches:

Patches 01-05 apply directly to 2.6.16-rc5-git11.  However, they should
also apply on top of the previously posted "migrate-on-fault" patches
with some fuzz/offsets.  Patch 06 requires that the migrate-on-fault
patches be applied first.

automigrate-01-add-migrate_task_memory.patch

	This patch add the function migrate_task_memory() to mempolicy.c
	to migrate vmas with default policy to the new node.  A second
	helper function, migrate_vma_to_node(), does the actual work of
	scanning the vma's address range [check_range] and invoking the
	existing [in 2.6.16-rc*] migrate_pages_to() function for a non-
	empty pagelist.

	Note that this mechanism uses non-aggressive migration--i.e.,
	MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL.  Therefore, it gives
	up rather easily.  E.g., anon pages still shared, copy-on-write,
	between ancestors and descendants will not be migrated.

automigrate-02-add-sched_migrate_memory-sysctl.patch

	This patch adds the infrastructure for the /sys/kernel/migration
	group as well as the sched_migrate_memory control.  Because we
	have no separate migration source file, I added this to
	mempolicy.c

automigrate-03.0-check-notify-migrate-pending.patch

	This patch adds a minimal <linux/auto-migrate.h> header to interface
	the scheduler to the auto-migration.  The header includes a static
	inline function for the schedule to check for internode migration
	and notify the task [by setting the TIF_NOTIFY_RESUME thread info
	flag], if the task is migrating to a new node and sched_migrate_memory
	is enabled.  The header also includes the function
	check_migrate_pending()
	that the task will call when returning to user state when it notices
	TIF_NOTIFY_RESUME set.  Both of these functions become a null macro
	when MIGRATION is not configured.

	However, note that in 2.6.16-rc*, one cannot deselect MIGRATION when
	building with NUMA configured.

automigrate-03.1-ia64-check-notify-migrate-pending.patch

	This patch adds the call to the check_migrate_pending() to the
	ia64 specific do_notify_resume_user() function.  Note that this
	is the same mechanism used to deliver signals and perfmon events
	to a task.

automigrate-03.2-x86_64-check-notify-migrate-pending.patch

	This patch adds the call to check_migrate_pending() to the x86_64
	specific do_notify_resume() function.  This is just an example
	for an arch other than ia64.  I haven't tested this yet.

automigrate-04-hook-sched-internode-migration.patch

	This patch hooks the calls to check_internode_migration() into
	the scheduler [kernel/sched.c] in places where the scheduler
	sets a new cpu for the task--i.e., just before calls to
	set_task_cpu().  Because these are in migration paths, that are
	already relatively "heavy-weight", they don't add overhead to
	scheduler fast paths.  And, they become empty or constant
	macros when MIGRATION is not configured in.

automigrate-05-add-internode-migration-hysteresis.patch

	This patch adds the sched_migrate_interval control to the
	/sys/kernel/migration group, and adds a function to the auto-migrate.h
	header--too_soon_for_internode_migration()--to check whether it's too
	soon for another internode migration.  This function becomes a macro
	that evaluates to "false" [0], when MIGRATION is not configured.

	This check is added to try_to_wake_up() and can_migrate_task() to
	override internode migrations if the last one was less than
	sched_migrate_interval seconds [HZ] ago.

BONUS PATCH:
automigrate-06-hook-to-migrate-on-fault.patch

	This patch, which requires the migrate-on-fault capability,
	hooks automigration up to migrate-on-fault, with an additional
	control--/sys/kernel/migration/sched_migrate_lazy--to enable
	it.

TESTING:

I have tested this patch on a 16-cpu/4-node HP rx8620 [ia64] platform with
everyone's favorite benchmark.

Kernel builds [after make mrproper+make defconfig]
on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
Times taken after a warm-up run.
Entire kernel source likely held in page cache.
This amplifies the effect of the patches because I
can't hide behind disk IO time.

No auto-migrate patches:

   88.20s real  1042.56s user    97.26s system
   88.92s real  1042.27s user    98.08s system
   88.40s real  1043.58s user    96.51s system
   91.45s real  1042.46s user    97.07s system
   93.29s real  1040.90s user    96.88s system
   90.15s real  1042.06s user    97.02s system
   90.45s real  1042.75s user    96.98s system
   90.77s real  1041.87s user    98.61s system
   90.21s real  1042.00s user    96.91s system
   88.50s real  1042.23s user    97.30s system
   -------------------------------------------
   90.03s real  1042.26s user    97.26s system - mean
    1.59           0.68           0.62         - std dev'n

With auto-migration patches, sched_migrate_memory disabled:

   88.98s real  1042.28s user    96.88s system
   88.75s real  1042.71s user    97.51s system
   89.42s real  1042.32s user    97.42s system
   87.83s real  1042.92s user    96.06s system
   92.47s real  1041.12s user    95.96s system
   89.14s real  1043.77s user    97.10s system
   88.11s real  1044.04s user    95.16s system
   91.74s real  1042.21s user    96.43s system
   89.36s real  1042.31s user    96.56s system
   88.55s real  1042.50s user    96.25s system
   -------------------------------------------
   89.43s real  1042.61s user    96.53s system - mean
    1.51           0.83           0.72         - std dev'n

With auto-migration patches, sched_migrate_memory enabled:

   90.62s real  1041.64s user   106.80s system
   89.94s real  1042.82s user   105.00s system
   91.34s real  1041.89s user   107.74s system
   90.12s real  1041.77s user   108.01s system
   90.93s real  1042.00s user   106.50s system
   93.97s real  1040.12s user   106.16s system
   90.65s real  1041.87s user   106.81s system
   90.53s real  1041.46s user   106.74s system
   91.84s real  1041.59s user   105.57s system
   90.28s real  1041.69s user   106.64s system
   -------------------------------------------
   91.02s real  1041.68s user   106.597 system - mean
    1.18           0.67           0.90         - std dev'n

Not stellar!.  Insignificant decrease in user time, but
~1% increase in run time  [from the unpatched case] and
~10% increase in system time.  In short, page migration,
and/or the scanning of vm areas for eligible pages, is
expensive and, for this job, the programs don't see
enough benefit from resulting locality to pay for cost
of migration.  Compilers just don't run long enough!

On one instrumented sample auto-direct run:
migrate_task_memory	called  3628 times = #internode migrations
migrate_vma_to_node	called 17137 times = 7.68 vma/task
migrate_page		called  3628 times = 1.62 pages/task

Very few "eligible" pages found in eligible vmas!  Perhaps
we're not being aggressive enough in attempts to migrate.

------------

Now, with the last patch, hooking automigration to
migrate-on-fault:

With auto-migrate + migrate-on-fault patches;
sched_migrate_memory disabled:

   88.02s real  1042.77s user    95.62s system
   91.56s real  1041.05s user    97.50s system
   90.41s real  1040.88s user    98.07s system
   90.41s real  1041.64s user    97.00s system
   89.82s real  1042.45s user    96.35s system
   88.28s real  1042.25s user    96.91s system
   91.51s real  1042.74s user    95.90s system
   93.34s real  1041.72s user    96.07s system
   89.09s real  1041.00s user    97.35s system
   89.44s real  1041.57s user    96.55s system
   -------------------------------------------
   90.19s real  1041.81s user    96.73s system - mean
    1.63           0.71           0.78         - std dev'n

With auto-migrate + migrate-on-fault patches;
sched_migrate_memory and sched_migrate_lazy enabled:

   91.72s real  1039.17s user   108.92s system
   91.02s real  1041.62s user   107.38s system
   91.21s real  1041.84s user   106.63s system
   93.24s real  1039.50s user   107.54s system
   92.64s real  1040.79s user   107.10s system
   92.52s real  1040.79s user   107.14s system
   91.85s real  1039.90s user   108.26s system
   90.58s real  1043.34s user   106.06s system
   92.30s real  1040.88s user   106.64s system
   94.25s real  1039.96s user   106.85s system
   -------------------------------------------
   92.13s real  1040.78 user    107.25 system - mean
    1.10           1.25           0.84        - std dev'n

Also, no win for kernel builds.  Again, slightly less
user time, but even more system and real time [~1sec each]
than the auto+direct run.

On one instrumented sample auto-lazy run:
migrate_task_memory	called  3777 times = #internode migrations
migrate_vma_to_node	called 28586 times = 7.56 vma/task
migrate_page		called  3886 times = 1.02 pages/task

Similar pattern, but a lot more "eligible" vmas; fewer
eligible pages.  More internode migrations.

TODO:

Next week, I'll try some longer running workloads that we know
have suffered from the scheduler moving them away from their
memory--e.g., McAlpin STREAMS.  Will report results when
available.

Maybe also test with more aggressive migration: '_MOVE_ALL.

I'll also move this to the -mm tree, once I port my trace
instrumentation from relayfs to sysfs.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
@ 2006-03-11  6:41 ` KAMEZAWA Hiroyuki
  2006-03-13 17:27   ` Lee Schermerhorn
  2006-03-14 22:27 ` Lee Schermerhorn
  1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-03-11  6:41 UTC (permalink / raw)
  To: lee.schermerhorn; +Cc: linux-mm

Hi, a few comments.

On Fri, 10 Mar 2006 14:33:14 -0500
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> Furthermore, to prevent thrashing, a second
> sysctl, sched_migrate_interval, has been implemented.  The load balancer
> will not move a task to a different node if it has move to a new node
> in the last sched_migrate_interval seconds.  [User interface is in
> seconds; internally it's in HZ.]  The idea is to give the task time to
> ammortize the cost of the migration by giving it time to benefit from
> local references to the page.
I think this HZ should be automatically estimated by the kernel. not by user.


> Kernel builds [after make mrproper+make defconfig]
> on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
> Times taken after a warm-up run.
> Entire kernel source likely held in page cache.
> This amplifies the effect of the patches because I
> can't hide behind disk IO time.

It looks you added check_internode_migration() in migrate_task().
migrate_task() is called by sched_migrate_task().
And....sched_migrate_task() is called by sched_exec().
(a process can be migrated when exec().)
In this case, migrate_task_memory() just wastes time..., I think.

BTW, what happens against shared pages ?
-- Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-11  6:41 ` KAMEZAWA Hiroyuki
@ 2006-03-13 17:27   ` Lee Schermerhorn
  2006-03-13 23:45     ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-13 17:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm

On Sat, 2006-03-11 at 15:41 +0900, KAMEZAWA Hiroyuki wrote:
> Hi, a few comments.

Thanks!

> 
> On Fri, 10 Mar 2006 14:33:14 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > Furthermore, to prevent thrashing, a second
> > sysctl, sched_migrate_interval, has been implemented.  The load balancer
> > will not move a task to a different node if it has move to a new node
> > in the last sched_migrate_interval seconds.  [User interface is in
> > seconds; internally it's in HZ.]  The idea is to give the task time to
> > ammortize the cost of the migration by giving it time to benefit from
> > local references to the page.
> I think this HZ should be automatically estimated by the kernel. not by user.

Well, perhaps, eventually...  When we have a feel for what the algorithm
should be.  Perhaps a single value, which might be different for
different platforms, would suffice. I know that for a similar
implementation in Tru64 Unix on Alpha, we settled on a constant value of
30seconds [my current default].  But that was a single architecture OS.
And, this patch series is still "experimental", so I wanted to be able
to measure the effect of this interval w/o having to reboot a new kernel
to change the value.  On my test platform, rebooting takes about 6x as
long as rebuilding the kernel :-(.

> 
> 
> > Kernel builds [after make mrproper+make defconfig]
> > on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
> > Times taken after a warm-up run.
> > Entire kernel source likely held in page cache.
> > This amplifies the effect of the patches because I
> > can't hide behind disk IO time.
> 
> It looks you added check_internode_migration() in migrate_task().
> migrate_task() is called by sched_migrate_task().
> And....sched_migrate_task() is called by sched_exec().
> (a process can be migrated when exec().)
> In this case, migrate_task_memory() just wastes time..., I think.

You're probably right about wasting time in the exec() case.
migrate_task() is also called from set_cpus_allowed() when changing a
task's cpu affinity.  In this case, I think we want to migrate memory to
follow the task if it moves to a new node.  So, I've added the patch
below to bypass the "check_internode_migration()" when migrate_task() is
called from sched_migrate_task().  

When I first looked at this, I didn't think calling migrate_task_memory
() in the exec case would add too much overhead.  It won't get called
until the task returns to user state in the context of the newly exec'd
image.  At that point, there shouldn't be many private/anon pages
already faulted into the task's pte's.  I agree that any such pages
should be on the correct node and therefore unmapping them, only to
fault the ptes back in on touch, is a waste of time.  However, I did
want to give the task a shot at pulling any "eligible" shared pages [see
answer to your question regarding shared pages below].

So here are the results for kernel builds on 2.6.16-rc6-git1 with and
without the patch below.  All runs have both the auto-migration and
migrate-on-fault patches installed.  I reran a few of each of the tests
posted earlier to establish a new baseline.  Again, this is on a 16-
cpu/4-node/32GB ia64 platform.  I should also mention that I build with
-j32 [2 x nr_cpus].

sched_migrate_memory disabled:

   88.01s real  1041.58s user    95.67s system
   88.45s real  1041.86s user    94.71s system
   88.02s real  1043.03s user    94.18s system
   90.36s real  1041.62s user    95.00s system
   89.59s real  1040.90s user    95.62s system
   -------------------------------------------
   88.89        1041.80          95.04

sched_migrate_memory enabled, lazy [migrate on fault] disabled:

   91.14s real  1040.60s user   104.53s system
   94.01s real  1038.49s user   105.66s system
   90.40s real  1039.60s user   105.70s system
   93.22s real  1039.69s user   105.09s system
   94.11s real  1039.20s user   105.66s system
   -------------------------------------------
   92.58        1039.52         105.33

sched_migrate_memory + sched_migrate_lazy enabled:

   91.53s real  1040.46s user   106.04s system
   93.45s real  1040.49s user   105.67s system
   92.01s real  1041.31s user   104.86s system
   93.65s real  1039.96s user   105.20s system
   91.40s real  1041.92s user   104.96s system
   -------------------------------------------
   92.41        1040.83         105.35

w/ nix memory migration on exec patch:
sched_migrate_memory + sched_migrate_lazy enabled:

   89.30s real  1041.45s user   105.60s system
   89.44s real  1042.53s user   105.24s system
   89.03s real  1043.35s user   104.09s system
   92.37s real  1039.92s user   107.62s system <---?
   93.42s real  1040.00s user   105.86s system
   -------------------------------------------
   90.71        1041.45         105.68
Real time is a little [not significantly] faster than w/o this patch.
But both the user and system times are a little higher.  I think that
the system time would have been better except for the one run with
noticably longer system time.

Same kernel as above with:
sched_migrate_memory + sched_migrate_lazy disabled:

   89.79s real  1041.97s user    96.12s system
   88.27s real  1042.74s user    95.26s system
   91.68s real  1042.17s user    95.94s system
   93.02s real  1040.41s user    96.48s system
   90.72s real  1042.51s user    95.32s system
   -------------------------------------------
   90.70        1041.96          95.82


I ran some instrumented runs, to see how many task/vma/page migrations
occur during the builds.  The numbers are "all over the map", even with
repeated runs on the same kernel.  However, bypassing the check for
internode migration that results in calling migrate_task_memory in the
exec path does seem to decrease the number of such calls:

        Test                     tasks     vmas   pages
16-rc5-git11+autodirect           2230    17137   3629
16-rc5-git11+autolazy             2973    22385   3109

16-rc6-git1+autolazy              2041    15981   7485

16-rc6-git1+autolazy/nixexec      1996    15587   8505
16-rc6-git1+autolazy/nixexec      1946    14927   3019
16-rc6-git1+autolazy/nixexec      2171    16758   8231

tasks = migrate_task_memory calls
vmas  = migrate_vma_to_node calls
pages = [buffer_]migrate_page calls

The first 2 lines are the numbers I reported in the automigration
overview post.  I only took a single measurement on rc6-git1 without the
patch below.  There happened to be a couple of hundred less calls to
migration_task_memory that in the rc5-git-11 cases.  When I added the
patch, 2 of the 3 runs I took [after rebuild/reboot] had less calls to
migrate_task_memory, and fewer calls to migrate_vma_to_node, as well.

Note:  out of all the runs above, I only saw 3 buffer_migrate_page
calls.  I suspect these are shared text/library pages that just happened
to be only mapped into caller's page table at time of scan.

> 
> BTW, what happens against shared pages ?

I have made no changes to the way that 2.6.16-rc* migration code handles
shared pages.  Note that migrate_task_memory()/migrate_vma_to_node()
calls check_range() with the flag MPOL_MF_MOVE.  This will select for
migration pages that are only mapped by the calling task--i.e., only in
the calling task's page tables.  This includes shared pages that are
only mapped by the calling task.  With the current migration code, we
have 2 flags:  '_MOVE and '_MOVE_ALL.  '_MOVE behaves as described
above; '_MOVE_ALL is more aggressive and migrates pages regardless of
the # of mappings.  Christoph says that's primarily for cpusets, but the
migrate_pages() sys call will also use 'MOVE_ALL when invoked as root.
I'm working on another patch to experiment with finer grain control over
this.  I'll add another [temporary ;-)] sysctl to specify the max # of
references to allow when selecting a page for migration.  Then, I'll
measure the effect on various workloads.  

In some of my testing, I've noticed that with the current '_MOVE
semantics, a lot of private, anon pages won't migrate because they're
shared "copy-on-write" between parents and [grand]children.  Perhaps a
threshold > 1 might be appropriate?  I'll post my findings when I have
them.

So, here an experimental patch to nix the check for internode migration
when migrating task on exec:

---------------------------
Bypass check for internode task migration when migrate_task()
is being called in the exec() path.  

I may fold this into the automigrate "hook sched migrate to 
memory migration" [6/8] patch if if proves beneficial.  
It seems like calling migrate_task_memory() on a migration that
occured because of an exec() is a waste of time.  However, it
does give the new task a chance to pull some nominally shared
pages [executable image or libraries] local to itself.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-rc6-git1/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-git1.orig/kernel/sched.c	2006-03-13 09:05:17.000000000 -0500
+++ linux-2.6.16-rc6-git1/kernel/sched.c	2006-03-13 09:52:48.000000000 -0500
@@ -865,7 +865,8 @@ typedef struct {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req)
+static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req,
+			int execing)
 {
 	runqueue_t *rq = task_rq(p);
 
@@ -874,7 +875,8 @@ static int migrate_task(task_t *p, int d
 	 * it is sufficient to simply update the task's cpu field.
 	 */
 	if (!p->array && !task_running(rq, p)) {
-		check_internode_migration(p, dest_cpu);
+		if (!execing)
+			check_internode_migration(p, dest_cpu);
 		set_task_cpu(p, dest_cpu);
 		return 0;
 	}
@@ -1738,7 +1740,7 @@ static void sched_migrate_task(task_t *p
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, &req, 1)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 		get_task_struct(mt);
@@ -4414,7 +4416,7 @@ int set_cpus_allowed(task_t *p, cpumask_
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, any_online_cpu(new_mask), &req)) {
+	if (migrate_task(p, any_online_cpu(new_mask), &req, 0)) {
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, &flags);
 		wake_up_process(rq->migration_thread);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-13 17:27   ` Lee Schermerhorn
@ 2006-03-13 23:45     ` Christoph Lameter
  2006-03-15 16:05       ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-13 23:45 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KAMEZAWA Hiroyuki, linux-mm, pj

On Mon, 13 Mar 2006, Lee Schermerhorn wrote:

> > BTW, what happens against shared pages ?
> 
> I have made no changes to the way that 2.6.16-rc* migration code handles
> shared pages.  Note that migrate_task_memory()/migrate_vma_to_node()
> calls check_range() with the flag MPOL_MF_MOVE.  This will select for
> migration pages that are only mapped by the calling task--i.e., only in
> the calling task's page tables.  This includes shared pages that are
> only mapped by the calling task.  With the current migration code, we
> have 2 flags:  '_MOVE and '_MOVE_ALL.  '_MOVE behaves as described
> above; '_MOVE_ALL is more aggressive and migrates pages regardless of
> the # of mappings.  Christoph says that's primarily for cpusets, but the
> migrate_pages() sys call will also use 'MOVE_ALL when invoked as root.

cpusets uses _MOVE_ALL because Paul wanted it that way. I still think it 
is a bad idea to move shared libraries etc. _MOVE only moves the pages used
by the currently executing process. If you do a MOVE_ALL then you may 
cause delays in other processes because they have to wait for their pages 
to become available again. Also they may have to generate additional 
faults to restore their PTEs. So you are negatively impacting other 
processes. Note that these wait times can be extensive if _MOVE_ALL is 
f.e. just migrating a critical glibc page that all processes use.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
  2006-03-11  6:41 ` KAMEZAWA Hiroyuki
@ 2006-03-14 22:27 ` Lee Schermerhorn
  1 sibling, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-14 22:27 UTC (permalink / raw)
  To: linux-mm

Some results for auto migration + migrate on fault using the McAlpin
Stream benchmark.

Kernel:  2.6.16-rc6-git1 with the migrate-on-fault and automigration
patches applied, in that order.

Platform:  HP rx8620 16cpu/4node/32BG numa platform with 1.6GHz ia64
cpus.

I built the stream benchmark to run for 1000 loops [NTIMES defined as
1000], and at the end of each loop I printed the run# and the times for
the copy, scale, add and triad loops.  I compiled with icc [Intel
compiler] 9.0 using OpenMP extensions to parallelize the loops.  

I ran the benchmark twice:  once with auto migration disabled
[/sys/kernel/migration/sched_migrate_memory = 0] and once with it
enabled [including migrate on fault--a.k.a lazy migration].  After I
started each run, I did a kernel build on the platform using:  make
mrproper; make oldconfig [using my regular .config]; make -j32 all.

I have posted graphs of the results and the raw data itself at:

http://free.linux.hp.com/~lts/Patches/PageMigration/

The nomig+kbuild files [automigration disabled] show the benchmark
starting with more or less "best case" times.  By best case, I mean,
that if you run the benchmark over and over [with, say, NTIMES = 10],
you'll see a number of different results, depending on how the OMP
threads happen to fall relative to the data.  I have made no effort to
do any initial placement.  The copy and scale times are ~0.052 seconds;
the add and triad numbers are ~0.078 seconds.  Then the kernel build
disturbs the run for a while, during which time the scheduler migrates
the stream/OMP threads to least busy groups/queues.  When the kernel
build is done, the threads happen to be a way suboptimal configuration.
This is just a roll of the dice.  They could have ended up in better
shape.

The automig+kbuild files show the same secnario.  However, the threads
happened to start at a suboptimal configuration, similar to the end
state of the nomig case.  Again, I made no effort to place the threads
at the start.  As soon as the kernel build starts, migration kicks in.
I believe that the first spike is from the "make mrproper".  This
disturbance is sufficient to cause the stream/OMP threads to migrate, at
which time autopage migration causes the pages to be migrated to the
node with the threads.  Then, the "make -j32 all" really messes up the
benchmark, but when build finishes, the job runs in the more or less
best case configuration.

Some things to note:  the peaks for the automigration case are higher
[almost double] that of the no migration case because the page
migrations get included in the benchmark times, which is just using
gettimeofday().   Also, because I'm using lazy migration
[sched_migrate_lazy enabled], when a task migrates to a new node, I only
unmap the pages.  Then, they migrate or not, as threads touch them.  If
the page is already on the node where the faulting thread is running,
the fault just reinstates the pte for the page.  If not, it will migrate
the page to the thread's node and install that pte.

Finally, as noted above, the first disturbance in the automig case
caused some migrations that resulted in a better thread/memory
configuration that what the program started with.  To give an
application control over this, it might be useful to provide an
MPOL_NOOP policy to be used along with the MPOL_MF_MOVE flag, and the
MPOL_MF_LAZY flag that I implemented in the last of the migrate-on-fault
patches.  The 'NOOP would retain the existing policy for the specified
range, but the 'MOVE+'LAZY would unmap the ptes of the range, pushing
anon pages to the swap cache, if necessary.
This would allow the threads of an application to pull the pages local
to their node on next touch.  I will test this theory...

Lee



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-13 23:45     ` Christoph Lameter
@ 2006-03-15 16:05       ` Avi Kivity
  2006-03-15 17:54         ` Paul Jackson
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2006-03-15 16:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, pj

Christoph Lameter wrote:

>cpusets uses _MOVE_ALL because Paul wanted it that way. I still think it 
>is a bad idea to move shared libraries etc. _MOVE only moves the pages used
>by the currently executing process. If you do a MOVE_ALL then you may 
>cause delays in other processes because they have to wait for their pages 
>to become available again. Also they may have to generate additional 
>faults to restore their PTEs. So you are negatively impacting other 
>processes. Note that these wait times can be extensive if _MOVE_ALL is 
>f.e. just migrating a critical glibc page that all processes use.
>  
>
Doesn't it make sense to duplicate heavily accessed shared read-only pages?

Something like page migration, but keeping the original page intact. 
Unfortunately, for threaded applications, it means page table bases 
(cr3) can't be shared among threads.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 16:05       ` Avi Kivity
@ 2006-03-15 17:54         ` Paul Jackson
  2006-03-15 18:10           ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Jackson @ 2006-03-15 17:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: clameter, lee.schermerhorn, kamezawa.hiroyu, linux-mm

> Doesn't it make sense to duplicate heavily accessed shared read-only pages?

It might .. that would be a major and difficult effort,
and it is not clear that it would be a win.  The additional
bookkeeping to figure out what pages were heavily accessed
would be very costly.  Probably prohibitive.

That's certainly a very different discussion than migration.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 17:54         ` Paul Jackson
@ 2006-03-15 18:10           ` Christoph Lameter
  2006-03-15 18:14             ` Paul Jackson
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-15 18:10 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Avi Kivity, lee.schermerhorn, kamezawa.hiroyu, linux-mm

On Wed, 15 Mar 2006, Paul Jackson wrote:

> > Doesn't it make sense to duplicate heavily accessed shared read-only pages?
> 
> It might .. that would be a major and difficult effort,
> and it is not clear that it would be a win.  The additional
> bookkeeping to figure out what pages were heavily accessed
> would be very costly.  Probably prohibitive.
> 
> That's certainly a very different discussion than migration.

That is a different discussion but it is not complicated. There are 
trivial one or two line patches around that make the fault handlers copy 
a page if a certain mapcount is reached.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 18:10           ` Christoph Lameter
@ 2006-03-15 18:14             ` Paul Jackson
  2006-03-15 18:20               ` Christoph Lameter
  2006-03-15 18:57               ` Avi Kivity
  0 siblings, 2 replies; 14+ messages in thread
From: Paul Jackson @ 2006-03-15 18:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: avi, lee.schermerhorn, kamezawa.hiroyu, linux-mm

> a page if a certain mapcount is reached.

He said "accessed", not "referenced".

The point was to copy pages that receive many
load and store instructions from far away nodes.

This has only minimal to do with the number of
memory address spaces mapping the region
holding that page.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 18:14             ` Paul Jackson
@ 2006-03-15 18:20               ` Christoph Lameter
  2006-03-15 19:21                 ` Lee Schermerhorn
  2006-03-15 18:57               ` Avi Kivity
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-15 18:20 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Christoph Lameter, avi, lee.schermerhorn, kamezawa.hiroyu, linux-mm

On Wed, 15 Mar 2006, Paul Jackson wrote:

> The point was to copy pages that receive many
> load and store instructions from far away nodes.

Right. In order to do that we first need to have some memory traces or 
statistics that can establish that a page is accessed from far away nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 18:14             ` Paul Jackson
  2006-03-15 18:20               ` Christoph Lameter
@ 2006-03-15 18:57               ` Avi Kivity
  2006-03-15 19:27                 ` Lee Schermerhorn
  1 sibling, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2006-03-15 18:57 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Christoph Lameter, lee.schermerhorn, kamezawa.hiroyu, linux-mm

Paul Jackson wrote:

>>a page if a certain mapcount is reached.
>>    
>>
>
>He said "accessed", not "referenced".
>
>The point was to copy pages that receive many
>load and store instructions from far away nodes.
>
>  
>
Only loads, please. Writable pages should not be duplicated.

>This has only minimal to do with the number of
>memory address spaces mapping the region
>holding that page.
>
>  
>

For starters, you could indicate which files need duplication manually. 
You would duplicate your main binaries and associated shared objects. 
Presumably large numas have plenty of memory so over-duplication would 
not be a huge problem.

Is the kernel text duplicated?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 18:20               ` Christoph Lameter
@ 2006-03-15 19:21                 ` Lee Schermerhorn
  0 siblings, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-15 19:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, avi, kamezawa.hiroyu, linux-mm, Peter Chubb

On Wed, 2006-03-15 at 10:20 -0800, Christoph Lameter wrote:
> On Wed, 15 Mar 2006, Paul Jackson wrote:
> 
> > The point was to copy pages that receive many
> > load and store instructions from far away nodes.
> 
> Right. In order to do that we first need to have some memory traces or 
> statistics that can establish that a page is accessed from far away nodes.
> 

The guys down at UNSW have patches for the ia64 that can show numa
accesses.  The patches are based on their long format vhpt tlb miss
handler.  As such, it can only report when a pages misses in the tlb,
but that's more that we have now.  I believe that they have a "numa
visualization" tool to display the results graphically, as well.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 18:57               ` Avi Kivity
@ 2006-03-15 19:27                 ` Lee Schermerhorn
  2006-03-15 19:56                   ` Jack Steiner
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-15 19:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paul Jackson, Christoph Lameter, kamezawa.hiroyu, linux-mm,
	Steve Ofsthun

On Wed, 2006-03-15 at 20:57 +0200, Avi Kivity wrote:
> Paul Jackson wrote:
> 
> >>a page if a certain mapcount is reached.
> >>    
> >>
> >
> >He said "accessed", not "referenced".
> >
> >The point was to copy pages that receive many
> >load and store instructions from far away nodes.
> >
> >  
> >
> Only loads, please. Writable pages should not be duplicated.
> 
> >This has only minimal to do with the number of
> >memory address spaces mapping the region
> >holding that page.
> >
> >  
> >
> 
> For starters, you could indicate which files need duplication manually. 
> You would duplicate your main binaries and associated shared objects. 
> Presumably large numas have plenty of memory so over-duplication would 
> not be a huge problem.
> 
> Is the kernel text duplicated?

No.  Might have been patches to do this for ia64 at one time.  I'm not
sure, tho'.

However, the folks at Virtual Iron do have patches to replicate shared,
executable segments.  They mentioned this at OLS last year.  I believe
that Ray Bryant got 'hold of a copy of the patch and had it working at
one time.  Didn't address one of the issues he was interested in, which
was to also duplicate the page tables for shared segments [?].  I hope
to experiment with them sometime down the line to see if they provide
measurable benefit.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
  2006-03-15 19:27                 ` Lee Schermerhorn
@ 2006-03-15 19:56                   ` Jack Steiner
  0 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2006-03-15 19:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Avi Kivity, Paul Jackson, Christoph Lameter, kamezawa.hiroyu,
	linux-mm, Steve Ofsthun

> > Is the kernel text duplicated?
> 
> No.  Might have been patches to do this for ia64 at one time.  I'm not
> sure, tho'.
> 

Yes, there is a patch to duplicate kernel text. I still have a copy
although I'm sure it has gotten very stale.

Kernel text replication was part of the IA64 "trillian" patch at 
one time but was dropped because we never saw any significant benefit.
However, systems are larger now & I would not be surprised if
replication helped on very large systems.

I plan to retest kernel replication within the next couple of
months. Stay tuned...

---
Jack




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-03-19 23:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
2006-03-11  6:41 ` KAMEZAWA Hiroyuki
2006-03-13 17:27   ` Lee Schermerhorn
2006-03-13 23:45     ` Christoph Lameter
2006-03-15 16:05       ` Avi Kivity
2006-03-15 17:54         ` Paul Jackson
2006-03-15 18:10           ` Christoph Lameter
2006-03-15 18:14             ` Paul Jackson
2006-03-15 18:20               ` Christoph Lameter
2006-03-15 19:21                 ` Lee Schermerhorn
2006-03-15 18:57               ` Avi Kivity
2006-03-15 19:27                 ` Lee Schermerhorn
2006-03-15 19:56                   ` Jack Steiner
2006-03-14 22:27 ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox