* [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
@ 2006-03-10 19:33 Lee Schermerhorn
2006-03-11 6:41 ` KAMEZAWA Hiroyuki
2006-03-14 22:27 ` Lee Schermerhorn
0 siblings, 2 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-10 19:33 UTC (permalink / raw)
To: linux-mm
AutoPage Migration - V0.1 - 0/8 Overview
We have seen some workloads suffer decreases in performance on NUMA
platforms when the Linux scheduler moves the tasks away from their initial
memory footprint. Some users--e.g., HPC--are motivated by this to go to
great lengths to ensure that tasks start up and stay on specific nodes.
2.6.16 includes memory migration mechanisms that will allow these users
to move memory along with their tasks--either manually or under control
of a load scheduling program--in response to changing demands on the
resources.
Other users--e.g., "Enterprise" applications--would prefer that the system
just "do the right thing" in this respect. One possible approach would
be to have the system automatically migrate a tasks pages when it decides
to move the task to a different node from where it has executed in the
past. One can debate [and we DO, at length] whether this would improve
the performance or not. But, why not provide a patch and measure the
effects for various policies? I.e., "show me the code."
So, ....
This series of patches hooks up linux 2.6.16 direct page migration to the
task scheduler. The effect is such that, when load balancing moves a task
to a cpu on a different node from where the task last executed, the task
is notified of this change using the same mechanism to notify a task of
pending signals. When the task returns to user state, it attempts to
migrate any pages in those of its vm areas under control of default
policy that are not already on the new node to the new node.
This behavior is disabled by default, but can be enabled by writing non-
zero to /sys/kernel/migation/sched_migrate_memory. [Could call this
"auto_migrate_memory" ?]. Furthermore, to prevent thrashing, a second
sysctl, sched_migrate_interval, has been implemented. The load balancer
will not move a task to a different node if it has move to a new node
in the last sched_migrate_interval seconds. [User interface is in
seconds; internally it's in HZ.] The idea is to give the task time to
ammortize the cost of the migration by giving it time to benefit from
local references to the page.
The controls, enable/disable and interval, will enable performance testing
of this mechanism to help decide whether it is worth inclusion.
The Patches:
Patches 01-05 apply directly to 2.6.16-rc5-git11. However, they should
also apply on top of the previously posted "migrate-on-fault" patches
with some fuzz/offsets. Patch 06 requires that the migrate-on-fault
patches be applied first.
automigrate-01-add-migrate_task_memory.patch
This patch add the function migrate_task_memory() to mempolicy.c
to migrate vmas with default policy to the new node. A second
helper function, migrate_vma_to_node(), does the actual work of
scanning the vma's address range [check_range] and invoking the
existing [in 2.6.16-rc*] migrate_pages_to() function for a non-
empty pagelist.
Note that this mechanism uses non-aggressive migration--i.e.,
MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL. Therefore, it gives
up rather easily. E.g., anon pages still shared, copy-on-write,
between ancestors and descendants will not be migrated.
automigrate-02-add-sched_migrate_memory-sysctl.patch
This patch adds the infrastructure for the /sys/kernel/migration
group as well as the sched_migrate_memory control. Because we
have no separate migration source file, I added this to
mempolicy.c
automigrate-03.0-check-notify-migrate-pending.patch
This patch adds a minimal <linux/auto-migrate.h> header to interface
the scheduler to the auto-migration. The header includes a static
inline function for the schedule to check for internode migration
and notify the task [by setting the TIF_NOTIFY_RESUME thread info
flag], if the task is migrating to a new node and sched_migrate_memory
is enabled. The header also includes the function
check_migrate_pending()
that the task will call when returning to user state when it notices
TIF_NOTIFY_RESUME set. Both of these functions become a null macro
when MIGRATION is not configured.
However, note that in 2.6.16-rc*, one cannot deselect MIGRATION when
building with NUMA configured.
automigrate-03.1-ia64-check-notify-migrate-pending.patch
This patch adds the call to the check_migrate_pending() to the
ia64 specific do_notify_resume_user() function. Note that this
is the same mechanism used to deliver signals and perfmon events
to a task.
automigrate-03.2-x86_64-check-notify-migrate-pending.patch
This patch adds the call to check_migrate_pending() to the x86_64
specific do_notify_resume() function. This is just an example
for an arch other than ia64. I haven't tested this yet.
automigrate-04-hook-sched-internode-migration.patch
This patch hooks the calls to check_internode_migration() into
the scheduler [kernel/sched.c] in places where the scheduler
sets a new cpu for the task--i.e., just before calls to
set_task_cpu(). Because these are in migration paths, that are
already relatively "heavy-weight", they don't add overhead to
scheduler fast paths. And, they become empty or constant
macros when MIGRATION is not configured in.
automigrate-05-add-internode-migration-hysteresis.patch
This patch adds the sched_migrate_interval control to the
/sys/kernel/migration group, and adds a function to the auto-migrate.h
header--too_soon_for_internode_migration()--to check whether it's too
soon for another internode migration. This function becomes a macro
that evaluates to "false" [0], when MIGRATION is not configured.
This check is added to try_to_wake_up() and can_migrate_task() to
override internode migrations if the last one was less than
sched_migrate_interval seconds [HZ] ago.
BONUS PATCH:
automigrate-06-hook-to-migrate-on-fault.patch
This patch, which requires the migrate-on-fault capability,
hooks automigration up to migrate-on-fault, with an additional
control--/sys/kernel/migration/sched_migrate_lazy--to enable
it.
TESTING:
I have tested this patch on a 16-cpu/4-node HP rx8620 [ia64] platform with
everyone's favorite benchmark.
Kernel builds [after make mrproper+make defconfig]
on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
Times taken after a warm-up run.
Entire kernel source likely held in page cache.
This amplifies the effect of the patches because I
can't hide behind disk IO time.
No auto-migrate patches:
88.20s real 1042.56s user 97.26s system
88.92s real 1042.27s user 98.08s system
88.40s real 1043.58s user 96.51s system
91.45s real 1042.46s user 97.07s system
93.29s real 1040.90s user 96.88s system
90.15s real 1042.06s user 97.02s system
90.45s real 1042.75s user 96.98s system
90.77s real 1041.87s user 98.61s system
90.21s real 1042.00s user 96.91s system
88.50s real 1042.23s user 97.30s system
-------------------------------------------
90.03s real 1042.26s user 97.26s system - mean
1.59 0.68 0.62 - std dev'n
With auto-migration patches, sched_migrate_memory disabled:
88.98s real 1042.28s user 96.88s system
88.75s real 1042.71s user 97.51s system
89.42s real 1042.32s user 97.42s system
87.83s real 1042.92s user 96.06s system
92.47s real 1041.12s user 95.96s system
89.14s real 1043.77s user 97.10s system
88.11s real 1044.04s user 95.16s system
91.74s real 1042.21s user 96.43s system
89.36s real 1042.31s user 96.56s system
88.55s real 1042.50s user 96.25s system
-------------------------------------------
89.43s real 1042.61s user 96.53s system - mean
1.51 0.83 0.72 - std dev'n
With auto-migration patches, sched_migrate_memory enabled:
90.62s real 1041.64s user 106.80s system
89.94s real 1042.82s user 105.00s system
91.34s real 1041.89s user 107.74s system
90.12s real 1041.77s user 108.01s system
90.93s real 1042.00s user 106.50s system
93.97s real 1040.12s user 106.16s system
90.65s real 1041.87s user 106.81s system
90.53s real 1041.46s user 106.74s system
91.84s real 1041.59s user 105.57s system
90.28s real 1041.69s user 106.64s system
-------------------------------------------
91.02s real 1041.68s user 106.597 system - mean
1.18 0.67 0.90 - std dev'n
Not stellar!. Insignificant decrease in user time, but
~1% increase in run time [from the unpatched case] and
~10% increase in system time. In short, page migration,
and/or the scanning of vm areas for eligible pages, is
expensive and, for this job, the programs don't see
enough benefit from resulting locality to pay for cost
of migration. Compilers just don't run long enough!
On one instrumented sample auto-direct run:
migrate_task_memory called 3628 times = #internode migrations
migrate_vma_to_node called 17137 times = 7.68 vma/task
migrate_page called 3628 times = 1.62 pages/task
Very few "eligible" pages found in eligible vmas! Perhaps
we're not being aggressive enough in attempts to migrate.
------------
Now, with the last patch, hooking automigration to
migrate-on-fault:
With auto-migrate + migrate-on-fault patches;
sched_migrate_memory disabled:
88.02s real 1042.77s user 95.62s system
91.56s real 1041.05s user 97.50s system
90.41s real 1040.88s user 98.07s system
90.41s real 1041.64s user 97.00s system
89.82s real 1042.45s user 96.35s system
88.28s real 1042.25s user 96.91s system
91.51s real 1042.74s user 95.90s system
93.34s real 1041.72s user 96.07s system
89.09s real 1041.00s user 97.35s system
89.44s real 1041.57s user 96.55s system
-------------------------------------------
90.19s real 1041.81s user 96.73s system - mean
1.63 0.71 0.78 - std dev'n
With auto-migrate + migrate-on-fault patches;
sched_migrate_memory and sched_migrate_lazy enabled:
91.72s real 1039.17s user 108.92s system
91.02s real 1041.62s user 107.38s system
91.21s real 1041.84s user 106.63s system
93.24s real 1039.50s user 107.54s system
92.64s real 1040.79s user 107.10s system
92.52s real 1040.79s user 107.14s system
91.85s real 1039.90s user 108.26s system
90.58s real 1043.34s user 106.06s system
92.30s real 1040.88s user 106.64s system
94.25s real 1039.96s user 106.85s system
-------------------------------------------
92.13s real 1040.78 user 107.25 system - mean
1.10 1.25 0.84 - std dev'n
Also, no win for kernel builds. Again, slightly less
user time, but even more system and real time [~1sec each]
than the auto+direct run.
On one instrumented sample auto-lazy run:
migrate_task_memory called 3777 times = #internode migrations
migrate_vma_to_node called 28586 times = 7.56 vma/task
migrate_page called 3886 times = 1.02 pages/task
Similar pattern, but a lot more "eligible" vmas; fewer
eligible pages. More internode migrations.
TODO:
Next week, I'll try some longer running workloads that we know
have suffered from the scheduler moving them away from their
memory--e.g., McAlpin STREAMS. Will report results when
available.
Maybe also test with more aggressive migration: '_MOVE_ALL.
I'll also move this to the -mm tree, once I port my trace
instrumentation from relayfs to sysfs.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
@ 2006-03-11 6:41 ` KAMEZAWA Hiroyuki
2006-03-13 17:27 ` Lee Schermerhorn
2006-03-14 22:27 ` Lee Schermerhorn
1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-03-11 6:41 UTC (permalink / raw)
To: lee.schermerhorn; +Cc: linux-mm
Hi, a few comments.
On Fri, 10 Mar 2006 14:33:14 -0500
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> Furthermore, to prevent thrashing, a second
> sysctl, sched_migrate_interval, has been implemented. The load balancer
> will not move a task to a different node if it has move to a new node
> in the last sched_migrate_interval seconds. [User interface is in
> seconds; internally it's in HZ.] The idea is to give the task time to
> ammortize the cost of the migration by giving it time to benefit from
> local references to the page.
I think this HZ should be automatically estimated by the kernel. not by user.
> Kernel builds [after make mrproper+make defconfig]
> on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
> Times taken after a warm-up run.
> Entire kernel source likely held in page cache.
> This amplifies the effect of the patches because I
> can't hide behind disk IO time.
It looks you added check_internode_migration() in migrate_task().
migrate_task() is called by sched_migrate_task().
And....sched_migrate_task() is called by sched_exec().
(a process can be migrated when exec().)
In this case, migrate_task_memory() just wastes time..., I think.
BTW, what happens against shared pages ?
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-11 6:41 ` KAMEZAWA Hiroyuki
@ 2006-03-13 17:27 ` Lee Schermerhorn
2006-03-13 23:45 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-13 17:27 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm
On Sat, 2006-03-11 at 15:41 +0900, KAMEZAWA Hiroyuki wrote:
> Hi, a few comments.
Thanks!
>
> On Fri, 10 Mar 2006 14:33:14 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > Furthermore, to prevent thrashing, a second
> > sysctl, sched_migrate_interval, has been implemented. The load balancer
> > will not move a task to a different node if it has move to a new node
> > in the last sched_migrate_interval seconds. [User interface is in
> > seconds; internally it's in HZ.] The idea is to give the task time to
> > ammortize the cost of the migration by giving it time to benefit from
> > local references to the page.
> I think this HZ should be automatically estimated by the kernel. not by user.
Well, perhaps, eventually... When we have a feel for what the algorithm
should be. Perhaps a single value, which might be different for
different platforms, would suffice. I know that for a similar
implementation in Tru64 Unix on Alpha, we settled on a constant value of
30seconds [my current default]. But that was a single architecture OS.
And, this patch series is still "experimental", so I wanted to be able
to measure the effect of this interval w/o having to reboot a new kernel
to change the value. On my test platform, rebooting takes about 6x as
long as rebuilding the kernel :-(.
>
>
> > Kernel builds [after make mrproper+make defconfig]
> > on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
> > Times taken after a warm-up run.
> > Entire kernel source likely held in page cache.
> > This amplifies the effect of the patches because I
> > can't hide behind disk IO time.
>
> It looks you added check_internode_migration() in migrate_task().
> migrate_task() is called by sched_migrate_task().
> And....sched_migrate_task() is called by sched_exec().
> (a process can be migrated when exec().)
> In this case, migrate_task_memory() just wastes time..., I think.
You're probably right about wasting time in the exec() case.
migrate_task() is also called from set_cpus_allowed() when changing a
task's cpu affinity. In this case, I think we want to migrate memory to
follow the task if it moves to a new node. So, I've added the patch
below to bypass the "check_internode_migration()" when migrate_task() is
called from sched_migrate_task().
When I first looked at this, I didn't think calling migrate_task_memory
() in the exec case would add too much overhead. It won't get called
until the task returns to user state in the context of the newly exec'd
image. At that point, there shouldn't be many private/anon pages
already faulted into the task's pte's. I agree that any such pages
should be on the correct node and therefore unmapping them, only to
fault the ptes back in on touch, is a waste of time. However, I did
want to give the task a shot at pulling any "eligible" shared pages [see
answer to your question regarding shared pages below].
So here are the results for kernel builds on 2.6.16-rc6-git1 with and
without the patch below. All runs have both the auto-migration and
migrate-on-fault patches installed. I reran a few of each of the tests
posted earlier to establish a new baseline. Again, this is on a 16-
cpu/4-node/32GB ia64 platform. I should also mention that I build with
-j32 [2 x nr_cpus].
sched_migrate_memory disabled:
88.01s real 1041.58s user 95.67s system
88.45s real 1041.86s user 94.71s system
88.02s real 1043.03s user 94.18s system
90.36s real 1041.62s user 95.00s system
89.59s real 1040.90s user 95.62s system
-------------------------------------------
88.89 1041.80 95.04
sched_migrate_memory enabled, lazy [migrate on fault] disabled:
91.14s real 1040.60s user 104.53s system
94.01s real 1038.49s user 105.66s system
90.40s real 1039.60s user 105.70s system
93.22s real 1039.69s user 105.09s system
94.11s real 1039.20s user 105.66s system
-------------------------------------------
92.58 1039.52 105.33
sched_migrate_memory + sched_migrate_lazy enabled:
91.53s real 1040.46s user 106.04s system
93.45s real 1040.49s user 105.67s system
92.01s real 1041.31s user 104.86s system
93.65s real 1039.96s user 105.20s system
91.40s real 1041.92s user 104.96s system
-------------------------------------------
92.41 1040.83 105.35
w/ nix memory migration on exec patch:
sched_migrate_memory + sched_migrate_lazy enabled:
89.30s real 1041.45s user 105.60s system
89.44s real 1042.53s user 105.24s system
89.03s real 1043.35s user 104.09s system
92.37s real 1039.92s user 107.62s system <---?
93.42s real 1040.00s user 105.86s system
-------------------------------------------
90.71 1041.45 105.68
Real time is a little [not significantly] faster than w/o this patch.
But both the user and system times are a little higher. I think that
the system time would have been better except for the one run with
noticably longer system time.
Same kernel as above with:
sched_migrate_memory + sched_migrate_lazy disabled:
89.79s real 1041.97s user 96.12s system
88.27s real 1042.74s user 95.26s system
91.68s real 1042.17s user 95.94s system
93.02s real 1040.41s user 96.48s system
90.72s real 1042.51s user 95.32s system
-------------------------------------------
90.70 1041.96 95.82
I ran some instrumented runs, to see how many task/vma/page migrations
occur during the builds. The numbers are "all over the map", even with
repeated runs on the same kernel. However, bypassing the check for
internode migration that results in calling migrate_task_memory in the
exec path does seem to decrease the number of such calls:
Test tasks vmas pages
16-rc5-git11+autodirect 2230 17137 3629
16-rc5-git11+autolazy 2973 22385 3109
16-rc6-git1+autolazy 2041 15981 7485
16-rc6-git1+autolazy/nixexec 1996 15587 8505
16-rc6-git1+autolazy/nixexec 1946 14927 3019
16-rc6-git1+autolazy/nixexec 2171 16758 8231
tasks = migrate_task_memory calls
vmas = migrate_vma_to_node calls
pages = [buffer_]migrate_page calls
The first 2 lines are the numbers I reported in the automigration
overview post. I only took a single measurement on rc6-git1 without the
patch below. There happened to be a couple of hundred less calls to
migration_task_memory that in the rc5-git-11 cases. When I added the
patch, 2 of the 3 runs I took [after rebuild/reboot] had less calls to
migrate_task_memory, and fewer calls to migrate_vma_to_node, as well.
Note: out of all the runs above, I only saw 3 buffer_migrate_page
calls. I suspect these are shared text/library pages that just happened
to be only mapped into caller's page table at time of scan.
>
> BTW, what happens against shared pages ?
I have made no changes to the way that 2.6.16-rc* migration code handles
shared pages. Note that migrate_task_memory()/migrate_vma_to_node()
calls check_range() with the flag MPOL_MF_MOVE. This will select for
migration pages that are only mapped by the calling task--i.e., only in
the calling task's page tables. This includes shared pages that are
only mapped by the calling task. With the current migration code, we
have 2 flags: '_MOVE and '_MOVE_ALL. '_MOVE behaves as described
above; '_MOVE_ALL is more aggressive and migrates pages regardless of
the # of mappings. Christoph says that's primarily for cpusets, but the
migrate_pages() sys call will also use 'MOVE_ALL when invoked as root.
I'm working on another patch to experiment with finer grain control over
this. I'll add another [temporary ;-)] sysctl to specify the max # of
references to allow when selecting a page for migration. Then, I'll
measure the effect on various workloads.
In some of my testing, I've noticed that with the current '_MOVE
semantics, a lot of private, anon pages won't migrate because they're
shared "copy-on-write" between parents and [grand]children. Perhaps a
threshold > 1 might be appropriate? I'll post my findings when I have
them.
So, here an experimental patch to nix the check for internode migration
when migrating task on exec:
---------------------------
Bypass check for internode task migration when migrate_task()
is being called in the exec() path.
I may fold this into the automigrate "hook sched migrate to
memory migration" [6/8] patch if if proves beneficial.
It seems like calling migrate_task_memory() on a migration that
occured because of an exec() is a waste of time. However, it
does give the new task a chance to pull some nominally shared
pages [executable image or libraries] local to itself.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Index: linux-2.6.16-rc6-git1/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-git1.orig/kernel/sched.c 2006-03-13 09:05:17.000000000 -0500
+++ linux-2.6.16-rc6-git1/kernel/sched.c 2006-03-13 09:52:48.000000000 -0500
@@ -865,7 +865,8 @@ typedef struct {
* The task's runqueue lock must be held.
* Returns true if you have to wait for migration thread.
*/
-static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req)
+static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req,
+ int execing)
{
runqueue_t *rq = task_rq(p);
@@ -874,7 +875,8 @@ static int migrate_task(task_t *p, int d
* it is sufficient to simply update the task's cpu field.
*/
if (!p->array && !task_running(rq, p)) {
- check_internode_migration(p, dest_cpu);
+ if (!execing)
+ check_internode_migration(p, dest_cpu);
set_task_cpu(p, dest_cpu);
return 0;
}
@@ -1738,7 +1740,7 @@ static void sched_migrate_task(task_t *p
goto out;
/* force the process onto the specified CPU */
- if (migrate_task(p, dest_cpu, &req)) {
+ if (migrate_task(p, dest_cpu, &req, 1)) {
/* Need to wait for migration thread (might exit: take ref). */
struct task_struct *mt = rq->migration_thread;
get_task_struct(mt);
@@ -4414,7 +4416,7 @@ int set_cpus_allowed(task_t *p, cpumask_
if (cpu_isset(task_cpu(p), new_mask))
goto out;
- if (migrate_task(p, any_online_cpu(new_mask), &req)) {
+ if (migrate_task(p, any_online_cpu(new_mask), &req, 0)) {
/* Need help from migration thread: drop lock and wait. */
task_rq_unlock(rq, &flags);
wake_up_process(rq->migration_thread);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-13 17:27 ` Lee Schermerhorn
@ 2006-03-13 23:45 ` Christoph Lameter
2006-03-15 16:05 ` Avi Kivity
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-13 23:45 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: KAMEZAWA Hiroyuki, linux-mm, pj
On Mon, 13 Mar 2006, Lee Schermerhorn wrote:
> > BTW, what happens against shared pages ?
>
> I have made no changes to the way that 2.6.16-rc* migration code handles
> shared pages. Note that migrate_task_memory()/migrate_vma_to_node()
> calls check_range() with the flag MPOL_MF_MOVE. This will select for
> migration pages that are only mapped by the calling task--i.e., only in
> the calling task's page tables. This includes shared pages that are
> only mapped by the calling task. With the current migration code, we
> have 2 flags: '_MOVE and '_MOVE_ALL. '_MOVE behaves as described
> above; '_MOVE_ALL is more aggressive and migrates pages regardless of
> the # of mappings. Christoph says that's primarily for cpusets, but the
> migrate_pages() sys call will also use 'MOVE_ALL when invoked as root.
cpusets uses _MOVE_ALL because Paul wanted it that way. I still think it
is a bad idea to move shared libraries etc. _MOVE only moves the pages used
by the currently executing process. If you do a MOVE_ALL then you may
cause delays in other processes because they have to wait for their pages
to become available again. Also they may have to generate additional
faults to restore their PTEs. So you are negatively impacting other
processes. Note that these wait times can be extensive if _MOVE_ALL is
f.e. just migrating a critical glibc page that all processes use.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
2006-03-11 6:41 ` KAMEZAWA Hiroyuki
@ 2006-03-14 22:27 ` Lee Schermerhorn
1 sibling, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-14 22:27 UTC (permalink / raw)
To: linux-mm
Some results for auto migration + migrate on fault using the McAlpin
Stream benchmark.
Kernel: 2.6.16-rc6-git1 with the migrate-on-fault and automigration
patches applied, in that order.
Platform: HP rx8620 16cpu/4node/32BG numa platform with 1.6GHz ia64
cpus.
I built the stream benchmark to run for 1000 loops [NTIMES defined as
1000], and at the end of each loop I printed the run# and the times for
the copy, scale, add and triad loops. I compiled with icc [Intel
compiler] 9.0 using OpenMP extensions to parallelize the loops.
I ran the benchmark twice: once with auto migration disabled
[/sys/kernel/migration/sched_migrate_memory = 0] and once with it
enabled [including migrate on fault--a.k.a lazy migration]. After I
started each run, I did a kernel build on the platform using: make
mrproper; make oldconfig [using my regular .config]; make -j32 all.
I have posted graphs of the results and the raw data itself at:
http://free.linux.hp.com/~lts/Patches/PageMigration/
The nomig+kbuild files [automigration disabled] show the benchmark
starting with more or less "best case" times. By best case, I mean,
that if you run the benchmark over and over [with, say, NTIMES = 10],
you'll see a number of different results, depending on how the OMP
threads happen to fall relative to the data. I have made no effort to
do any initial placement. The copy and scale times are ~0.052 seconds;
the add and triad numbers are ~0.078 seconds. Then the kernel build
disturbs the run for a while, during which time the scheduler migrates
the stream/OMP threads to least busy groups/queues. When the kernel
build is done, the threads happen to be a way suboptimal configuration.
This is just a roll of the dice. They could have ended up in better
shape.
The automig+kbuild files show the same secnario. However, the threads
happened to start at a suboptimal configuration, similar to the end
state of the nomig case. Again, I made no effort to place the threads
at the start. As soon as the kernel build starts, migration kicks in.
I believe that the first spike is from the "make mrproper". This
disturbance is sufficient to cause the stream/OMP threads to migrate, at
which time autopage migration causes the pages to be migrated to the
node with the threads. Then, the "make -j32 all" really messes up the
benchmark, but when build finishes, the job runs in the more or less
best case configuration.
Some things to note: the peaks for the automigration case are higher
[almost double] that of the no migration case because the page
migrations get included in the benchmark times, which is just using
gettimeofday(). Also, because I'm using lazy migration
[sched_migrate_lazy enabled], when a task migrates to a new node, I only
unmap the pages. Then, they migrate or not, as threads touch them. If
the page is already on the node where the faulting thread is running,
the fault just reinstates the pte for the page. If not, it will migrate
the page to the thread's node and install that pte.
Finally, as noted above, the first disturbance in the automig case
caused some migrations that resulted in a better thread/memory
configuration that what the program started with. To give an
application control over this, it might be useful to provide an
MPOL_NOOP policy to be used along with the MPOL_MF_MOVE flag, and the
MPOL_MF_LAZY flag that I implemented in the last of the migrate-on-fault
patches. The 'NOOP would retain the existing policy for the specified
range, but the 'MOVE+'LAZY would unmap the ptes of the range, pushing
anon pages to the swap cache, if necessary.
This would allow the threads of an application to pull the pages local
to their node on next touch. I will test this theory...
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-13 23:45 ` Christoph Lameter
@ 2006-03-15 16:05 ` Avi Kivity
2006-03-15 17:54 ` Paul Jackson
0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2006-03-15 16:05 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm, pj
Christoph Lameter wrote:
>cpusets uses _MOVE_ALL because Paul wanted it that way. I still think it
>is a bad idea to move shared libraries etc. _MOVE only moves the pages used
>by the currently executing process. If you do a MOVE_ALL then you may
>cause delays in other processes because they have to wait for their pages
>to become available again. Also they may have to generate additional
>faults to restore their PTEs. So you are negatively impacting other
>processes. Note that these wait times can be extensive if _MOVE_ALL is
>f.e. just migrating a critical glibc page that all processes use.
>
>
Doesn't it make sense to duplicate heavily accessed shared read-only pages?
Something like page migration, but keeping the original page intact.
Unfortunately, for threaded applications, it means page table bases
(cr3) can't be shared among threads.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 16:05 ` Avi Kivity
@ 2006-03-15 17:54 ` Paul Jackson
2006-03-15 18:10 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: Paul Jackson @ 2006-03-15 17:54 UTC (permalink / raw)
To: Avi Kivity; +Cc: clameter, lee.schermerhorn, kamezawa.hiroyu, linux-mm
> Doesn't it make sense to duplicate heavily accessed shared read-only pages?
It might .. that would be a major and difficult effort,
and it is not clear that it would be a win. The additional
bookkeeping to figure out what pages were heavily accessed
would be very costly. Probably prohibitive.
That's certainly a very different discussion than migration.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 17:54 ` Paul Jackson
@ 2006-03-15 18:10 ` Christoph Lameter
2006-03-15 18:14 ` Paul Jackson
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-15 18:10 UTC (permalink / raw)
To: Paul Jackson; +Cc: Avi Kivity, lee.schermerhorn, kamezawa.hiroyu, linux-mm
On Wed, 15 Mar 2006, Paul Jackson wrote:
> > Doesn't it make sense to duplicate heavily accessed shared read-only pages?
>
> It might .. that would be a major and difficult effort,
> and it is not clear that it would be a win. The additional
> bookkeeping to figure out what pages were heavily accessed
> would be very costly. Probably prohibitive.
>
> That's certainly a very different discussion than migration.
That is a different discussion but it is not complicated. There are
trivial one or two line patches around that make the fault handlers copy
a page if a certain mapcount is reached.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 18:10 ` Christoph Lameter
@ 2006-03-15 18:14 ` Paul Jackson
2006-03-15 18:20 ` Christoph Lameter
2006-03-15 18:57 ` Avi Kivity
0 siblings, 2 replies; 14+ messages in thread
From: Paul Jackson @ 2006-03-15 18:14 UTC (permalink / raw)
To: Christoph Lameter; +Cc: avi, lee.schermerhorn, kamezawa.hiroyu, linux-mm
> a page if a certain mapcount is reached.
He said "accessed", not "referenced".
The point was to copy pages that receive many
load and store instructions from far away nodes.
This has only minimal to do with the number of
memory address spaces mapping the region
holding that page.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 18:14 ` Paul Jackson
@ 2006-03-15 18:20 ` Christoph Lameter
2006-03-15 19:21 ` Lee Schermerhorn
2006-03-15 18:57 ` Avi Kivity
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-03-15 18:20 UTC (permalink / raw)
To: Paul Jackson
Cc: Christoph Lameter, avi, lee.schermerhorn, kamezawa.hiroyu, linux-mm
On Wed, 15 Mar 2006, Paul Jackson wrote:
> The point was to copy pages that receive many
> load and store instructions from far away nodes.
Right. In order to do that we first need to have some memory traces or
statistics that can establish that a page is accessed from far away nodes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 18:14 ` Paul Jackson
2006-03-15 18:20 ` Christoph Lameter
@ 2006-03-15 18:57 ` Avi Kivity
2006-03-15 19:27 ` Lee Schermerhorn
1 sibling, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2006-03-15 18:57 UTC (permalink / raw)
To: Paul Jackson
Cc: Christoph Lameter, lee.schermerhorn, kamezawa.hiroyu, linux-mm
Paul Jackson wrote:
>>a page if a certain mapcount is reached.
>>
>>
>
>He said "accessed", not "referenced".
>
>The point was to copy pages that receive many
>load and store instructions from far away nodes.
>
>
>
Only loads, please. Writable pages should not be duplicated.
>This has only minimal to do with the number of
>memory address spaces mapping the region
>holding that page.
>
>
>
For starters, you could indicate which files need duplication manually.
You would duplicate your main binaries and associated shared objects.
Presumably large numas have plenty of memory so over-duplication would
not be a huge problem.
Is the kernel text duplicated?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 18:20 ` Christoph Lameter
@ 2006-03-15 19:21 ` Lee Schermerhorn
0 siblings, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-15 19:21 UTC (permalink / raw)
To: Christoph Lameter
Cc: Paul Jackson, avi, kamezawa.hiroyu, linux-mm, Peter Chubb
On Wed, 2006-03-15 at 10:20 -0800, Christoph Lameter wrote:
> On Wed, 15 Mar 2006, Paul Jackson wrote:
>
> > The point was to copy pages that receive many
> > load and store instructions from far away nodes.
>
> Right. In order to do that we first need to have some memory traces or
> statistics that can establish that a page is accessed from far away nodes.
>
The guys down at UNSW have patches for the ia64 that can show numa
accesses. The patches are based on their long format vhpt tlb miss
handler. As such, it can only report when a pages misses in the tlb,
but that's more that we have now. I believe that they have a "numa
visualization" tool to display the results graphically, as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 18:57 ` Avi Kivity
@ 2006-03-15 19:27 ` Lee Schermerhorn
2006-03-15 19:56 ` Jack Steiner
0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2006-03-15 19:27 UTC (permalink / raw)
To: Avi Kivity
Cc: Paul Jackson, Christoph Lameter, kamezawa.hiroyu, linux-mm,
Steve Ofsthun
On Wed, 2006-03-15 at 20:57 +0200, Avi Kivity wrote:
> Paul Jackson wrote:
>
> >>a page if a certain mapcount is reached.
> >>
> >>
> >
> >He said "accessed", not "referenced".
> >
> >The point was to copy pages that receive many
> >load and store instructions from far away nodes.
> >
> >
> >
> Only loads, please. Writable pages should not be duplicated.
>
> >This has only minimal to do with the number of
> >memory address spaces mapping the region
> >holding that page.
> >
> >
> >
>
> For starters, you could indicate which files need duplication manually.
> You would duplicate your main binaries and associated shared objects.
> Presumably large numas have plenty of memory so over-duplication would
> not be a huge problem.
>
> Is the kernel text duplicated?
No. Might have been patches to do this for ia64 at one time. I'm not
sure, tho'.
However, the folks at Virtual Iron do have patches to replicate shared,
executable segments. They mentioned this at OLS last year. I believe
that Ray Bryant got 'hold of a copy of the patch and had it working at
one time. Didn't address one of the issues he was interested in, which
was to also duplicate the page tables for shared segments [?]. I hope
to experiment with them sometime down the line to see if they provide
measurable benefit.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
2006-03-15 19:27 ` Lee Schermerhorn
@ 2006-03-15 19:56 ` Jack Steiner
0 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2006-03-15 19:56 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: Avi Kivity, Paul Jackson, Christoph Lameter, kamezawa.hiroyu,
linux-mm, Steve Ofsthun
> > Is the kernel text duplicated?
>
> No. Might have been patches to do this for ia64 at one time. I'm not
> sure, tho'.
>
Yes, there is a patch to duplicate kernel text. I still have a copy
although I'm sure it has gotten very stale.
Kernel text replication was part of the IA64 "trillian" patch at
one time but was dropped because we never saw any significant benefit.
However, systems are larger now & I would not be surprised if
replication helped on very large systems.
I plan to retest kernel replication within the next couple of
months. Stay tuned...
---
Jack
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2006-03-19 23:25 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-10 19:33 [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview Lee Schermerhorn
2006-03-11 6:41 ` KAMEZAWA Hiroyuki
2006-03-13 17:27 ` Lee Schermerhorn
2006-03-13 23:45 ` Christoph Lameter
2006-03-15 16:05 ` Avi Kivity
2006-03-15 17:54 ` Paul Jackson
2006-03-15 18:10 ` Christoph Lameter
2006-03-15 18:14 ` Paul Jackson
2006-03-15 18:20 ` Christoph Lameter
2006-03-15 19:21 ` Lee Schermerhorn
2006-03-15 18:57 ` Avi Kivity
2006-03-15 19:27 ` Lee Schermerhorn
2006-03-15 19:56 ` Jack Steiner
2006-03-14 22:27 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox