Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lee Schermerhorn <lee.schermerhorn@hp.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
Date: Mon, 13 Mar 2006 12:27:36 -0500	[thread overview]
Message-ID: <1142270857.5210.50.camel@localhost.localdomain> (raw)
In-Reply-To: <20060311154113.c4358e40.kamezawa.hiroyu@jp.fujitsu.com>

On Sat, 2006-03-11 at 15:41 +0900, KAMEZAWA Hiroyuki wrote:
> Hi, a few comments.

Thanks!

> 
> On Fri, 10 Mar 2006 14:33:14 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > Furthermore, to prevent thrashing, a second
> > sysctl, sched_migrate_interval, has been implemented.  The load balancer
> > will not move a task to a different node if it has move to a new node
> > in the last sched_migrate_interval seconds.  [User interface is in
> > seconds; internally it's in HZ.]  The idea is to give the task time to
> > ammortize the cost of the migration by giving it time to benefit from
> > local references to the page.
> I think this HZ should be automatically estimated by the kernel. not by user.

Well, perhaps, eventually...  When we have a feel for what the algorithm
should be.  Perhaps a single value, which might be different for
different platforms, would suffice. I know that for a similar
implementation in Tru64 Unix on Alpha, we settled on a constant value of
30seconds [my current default].  But that was a single architecture OS.
And, this patch series is still "experimental", so I wanted to be able
to measure the effect of this interval w/o having to reboot a new kernel
to change the value.  On my test platform, rebooting takes about 6x as
long as rebuilding the kernel :-(.

> 
> 
> > Kernel builds [after make mrproper+make defconfig]
> > on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
> > Times taken after a warm-up run.
> > Entire kernel source likely held in page cache.
> > This amplifies the effect of the patches because I
> > can't hide behind disk IO time.
> 
> It looks you added check_internode_migration() in migrate_task().
> migrate_task() is called by sched_migrate_task().
> And....sched_migrate_task() is called by sched_exec().
> (a process can be migrated when exec().)
> In this case, migrate_task_memory() just wastes time..., I think.

You're probably right about wasting time in the exec() case.
migrate_task() is also called from set_cpus_allowed() when changing a
task's cpu affinity.  In this case, I think we want to migrate memory to
follow the task if it moves to a new node.  So, I've added the patch
below to bypass the "check_internode_migration()" when migrate_task() is
called from sched_migrate_task().  

When I first looked at this, I didn't think calling migrate_task_memory
() in the exec case would add too much overhead.  It won't get called
until the task returns to user state in the context of the newly exec'd
image.  At that point, there shouldn't be many private/anon pages
already faulted into the task's pte's.  I agree that any such pages
should be on the correct node and therefore unmapping them, only to
fault the ptes back in on touch, is a waste of time.  However, I did
want to give the task a shot at pulling any "eligible" shared pages [see
answer to your question regarding shared pages below].

So here are the results for kernel builds on 2.6.16-rc6-git1 with and
without the patch below.  All runs have both the auto-migration and
migrate-on-fault patches installed.  I reran a few of each of the tests
posted earlier to establish a new baseline.  Again, this is on a 16-
cpu/4-node/32GB ia64 platform.  I should also mention that I build with
-j32 [2 x nr_cpus].

sched_migrate_memory disabled:

   88.01s real  1041.58s user    95.67s system
   88.45s real  1041.86s user    94.71s system
   88.02s real  1043.03s user    94.18s system
   90.36s real  1041.62s user    95.00s system
   89.59s real  1040.90s user    95.62s system
   -------------------------------------------
   88.89        1041.80          95.04

sched_migrate_memory enabled, lazy [migrate on fault] disabled:

   91.14s real  1040.60s user   104.53s system
   94.01s real  1038.49s user   105.66s system
   90.40s real  1039.60s user   105.70s system
   93.22s real  1039.69s user   105.09s system
   94.11s real  1039.20s user   105.66s system
   -------------------------------------------
   92.58        1039.52         105.33

sched_migrate_memory + sched_migrate_lazy enabled:

   91.53s real  1040.46s user   106.04s system
   93.45s real  1040.49s user   105.67s system
   92.01s real  1041.31s user   104.86s system
   93.65s real  1039.96s user   105.20s system
   91.40s real  1041.92s user   104.96s system
   -------------------------------------------
   92.41        1040.83         105.35

w/ nix memory migration on exec patch:
sched_migrate_memory + sched_migrate_lazy enabled:

   89.30s real  1041.45s user   105.60s system
   89.44s real  1042.53s user   105.24s system
   89.03s real  1043.35s user   104.09s system
   92.37s real  1039.92s user   107.62s system <---?
   93.42s real  1040.00s user   105.86s system
   -------------------------------------------
   90.71        1041.45         105.68
Real time is a little [not significantly] faster than w/o this patch.
But both the user and system times are a little higher.  I think that
the system time would have been better except for the one run with
noticably longer system time.

Same kernel as above with:
sched_migrate_memory + sched_migrate_lazy disabled:

   89.79s real  1041.97s user    96.12s system
   88.27s real  1042.74s user    95.26s system
   91.68s real  1042.17s user    95.94s system
   93.02s real  1040.41s user    96.48s system
   90.72s real  1042.51s user    95.32s system
   -------------------------------------------
   90.70        1041.96          95.82

I ran some instrumented runs, to see how many task/vma/page migrations
occur during the builds.  The numbers are "all over the map", even with
repeated runs on the same kernel.  However, bypassing the check for
internode migration that results in calling migrate_task_memory in the
exec path does seem to decrease the number of such calls:

        Test                     tasks     vmas   pages
16-rc5-git11+autodirect           2230    17137   3629
16-rc5-git11+autolazy             2973    22385   3109

16-rc6-git1+autolazy              2041    15981   7485

16-rc6-git1+autolazy/nixexec      1996    15587   8505
16-rc6-git1+autolazy/nixexec      1946    14927   3019
16-rc6-git1+autolazy/nixexec      2171    16758   8231

tasks = migrate_task_memory calls
vmas  = migrate_vma_to_node calls
pages = [buffer_]migrate_page calls

The first 2 lines are the numbers I reported in the automigration
overview post.  I only took a single measurement on rc6-git1 without the
patch below.  There happened to be a couple of hundred less calls to
migration_task_memory that in the rc5-git-11 cases.  When I added the
patch, 2 of the 3 runs I took [after rebuild/reboot] had less calls to
migrate_task_memory, and fewer calls to migrate_vma_to_node, as well.

Note:  out of all the runs above, I only saw 3 buffer_migrate_page
calls.  I suspect these are shared text/library pages that just happened
to be only mapped into caller's page table at time of scan.

> 
> BTW, what happens against shared pages ?

I have made no changes to the way that 2.6.16-rc* migration code handles
shared pages.  Note that migrate_task_memory()/migrate_vma_to_node()
calls check_range() with the flag MPOL_MF_MOVE.  This will select for
migration pages that are only mapped by the calling task--i.e., only in
the calling task's page tables.  This includes shared pages that are
only mapped by the calling task.  With the current migration code, we
have 2 flags:  '_MOVE and '_MOVE_ALL.  '_MOVE behaves as described
above; '_MOVE_ALL is more aggressive and migrates pages regardless of
the # of mappings.  Christoph says that's primarily for cpusets, but the
migrate_pages() sys call will also use 'MOVE_ALL when invoked as root.
I'm working on another patch to experiment with finer grain control over
this.  I'll add another [temporary ;-)] sysctl to specify the max # of
references to allow when selecting a page for migration.  Then, I'll
measure the effect on various workloads.  

In some of my testing, I've noticed that with the current '_MOVE
semantics, a lot of private, anon pages won't migrate because they're
shared "copy-on-write" between parents and [grand]children.  Perhaps a
threshold > 1 might be appropriate?  I'll post my findings when I have
them.

So, here an experimental patch to nix the check for internode migration
when migrating task on exec:

---------------------------
Bypass check for internode task migration when migrate_task()
is being called in the exec() path.  

I may fold this into the automigrate "hook sched migrate to 
memory migration" [6/8] patch if if proves beneficial.  
It seems like calling migrate_task_memory() on a migration that
occured because of an exec() is a waste of time.  However, it
does give the new task a chance to pull some nominally shared
pages [executable image or libraries] local to itself.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-rc6-git1/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-git1.orig/kernel/sched.c	2006-03-13 09:05:17.000000000 -0500
+++ linux-2.6.16-rc6-git1/kernel/sched.c	2006-03-13 09:52:48.000000000 -0500
@@ -865,7 +865,8 @@ typedef struct {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req)
+static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req,
+			int execing)
 {
 	runqueue_t *rq = task_rq(p);

@@ -874,7 +875,8 @@ static int migrate_task(task_t *p, int d
 	 * it is sufficient to simply update the task's cpu field.
 	 */
 	if (!p->array && !task_running(rq, p)) {
-		check_internode_migration(p, dest_cpu);
+		if (!execing)
+			check_internode_migration(p, dest_cpu);
 		set_task_cpu(p, dest_cpu);
 		return 0;
 	}
@@ -1738,7 +1740,7 @@ static void sched_migrate_task(task_t *p
 		goto out;

 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, &req, 1)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 		get_task_struct(mt);
@@ -4414,7 +4416,7 @@ int set_cpus_allowed(task_t *p, cpumask_
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;

-	if (migrate_task(p, any_online_cpu(new_mask), &req)) {
+	if (migrate_task(p, any_online_cpu(new_mask), &req, 0)) {
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, &flags);
 		wake_up_process(rq->migration_thread);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-03-13 17:27 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-03-10 19:33 Lee Schermerhorn
2006-03-11  6:41 ` KAMEZAWA Hiroyuki
2006-03-13 17:27   ` Lee Schermerhorn [this message]
2006-03-13 23:45     ` Christoph Lameter
2006-03-15 16:05       ` Avi Kivity
2006-03-15 17:54         ` Paul Jackson
2006-03-15 18:10           ` Christoph Lameter
2006-03-15 18:14             ` Paul Jackson
2006-03-15 18:20               ` Christoph Lameter
2006-03-15 19:21                 ` Lee Schermerhorn
2006-03-15 18:57               ` Avi Kivity
2006-03-15 19:27                 ` Lee Schermerhorn
2006-03-15 19:56                   ` Jack Steiner
2006-03-14 22:27 ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1142270857.5210.50.camel@localhost.localdomain \
    --to=lee.schermerhorn@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox