From: Lee Schermerhorn <lee.schermerhorn@hp.com>
To: linux-mm <linux-mm@kvack.org>
Subject: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview
Date: Fri, 10 Mar 2006 14:33:14 -0500 [thread overview]
Message-ID: <1142019195.5204.12.camel@localhost.localdomain> (raw)
AutoPage Migration - V0.1 - 0/8 Overview
We have seen some workloads suffer decreases in performance on NUMA
platforms when the Linux scheduler moves the tasks away from their initial
memory footprint. Some users--e.g., HPC--are motivated by this to go to
great lengths to ensure that tasks start up and stay on specific nodes.
2.6.16 includes memory migration mechanisms that will allow these users
to move memory along with their tasks--either manually or under control
of a load scheduling program--in response to changing demands on the
resources.
Other users--e.g., "Enterprise" applications--would prefer that the system
just "do the right thing" in this respect. One possible approach would
be to have the system automatically migrate a tasks pages when it decides
to move the task to a different node from where it has executed in the
past. One can debate [and we DO, at length] whether this would improve
the performance or not. But, why not provide a patch and measure the
effects for various policies? I.e., "show me the code."
So, ....
This series of patches hooks up linux 2.6.16 direct page migration to the
task scheduler. The effect is such that, when load balancing moves a task
to a cpu on a different node from where the task last executed, the task
is notified of this change using the same mechanism to notify a task of
pending signals. When the task returns to user state, it attempts to
migrate any pages in those of its vm areas under control of default
policy that are not already on the new node to the new node.
This behavior is disabled by default, but can be enabled by writing non-
zero to /sys/kernel/migation/sched_migrate_memory. [Could call this
"auto_migrate_memory" ?]. Furthermore, to prevent thrashing, a second
sysctl, sched_migrate_interval, has been implemented. The load balancer
will not move a task to a different node if it has move to a new node
in the last sched_migrate_interval seconds. [User interface is in
seconds; internally it's in HZ.] The idea is to give the task time to
ammortize the cost of the migration by giving it time to benefit from
local references to the page.
The controls, enable/disable and interval, will enable performance testing
of this mechanism to help decide whether it is worth inclusion.
The Patches:
Patches 01-05 apply directly to 2.6.16-rc5-git11. However, they should
also apply on top of the previously posted "migrate-on-fault" patches
with some fuzz/offsets. Patch 06 requires that the migrate-on-fault
patches be applied first.
automigrate-01-add-migrate_task_memory.patch
This patch add the function migrate_task_memory() to mempolicy.c
to migrate vmas with default policy to the new node. A second
helper function, migrate_vma_to_node(), does the actual work of
scanning the vma's address range [check_range] and invoking the
existing [in 2.6.16-rc*] migrate_pages_to() function for a non-
empty pagelist.
Note that this mechanism uses non-aggressive migration--i.e.,
MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL. Therefore, it gives
up rather easily. E.g., anon pages still shared, copy-on-write,
between ancestors and descendants will not be migrated.
automigrate-02-add-sched_migrate_memory-sysctl.patch
This patch adds the infrastructure for the /sys/kernel/migration
group as well as the sched_migrate_memory control. Because we
have no separate migration source file, I added this to
mempolicy.c
automigrate-03.0-check-notify-migrate-pending.patch
This patch adds a minimal <linux/auto-migrate.h> header to interface
the scheduler to the auto-migration. The header includes a static
inline function for the schedule to check for internode migration
and notify the task [by setting the TIF_NOTIFY_RESUME thread info
flag], if the task is migrating to a new node and sched_migrate_memory
is enabled. The header also includes the function
check_migrate_pending()
that the task will call when returning to user state when it notices
TIF_NOTIFY_RESUME set. Both of these functions become a null macro
when MIGRATION is not configured.
However, note that in 2.6.16-rc*, one cannot deselect MIGRATION when
building with NUMA configured.
automigrate-03.1-ia64-check-notify-migrate-pending.patch
This patch adds the call to the check_migrate_pending() to the
ia64 specific do_notify_resume_user() function. Note that this
is the same mechanism used to deliver signals and perfmon events
to a task.
automigrate-03.2-x86_64-check-notify-migrate-pending.patch
This patch adds the call to check_migrate_pending() to the x86_64
specific do_notify_resume() function. This is just an example
for an arch other than ia64. I haven't tested this yet.
automigrate-04-hook-sched-internode-migration.patch
This patch hooks the calls to check_internode_migration() into
the scheduler [kernel/sched.c] in places where the scheduler
sets a new cpu for the task--i.e., just before calls to
set_task_cpu(). Because these are in migration paths, that are
already relatively "heavy-weight", they don't add overhead to
scheduler fast paths. And, they become empty or constant
macros when MIGRATION is not configured in.
automigrate-05-add-internode-migration-hysteresis.patch
This patch adds the sched_migrate_interval control to the
/sys/kernel/migration group, and adds a function to the auto-migrate.h
header--too_soon_for_internode_migration()--to check whether it's too
soon for another internode migration. This function becomes a macro
that evaluates to "false" [0], when MIGRATION is not configured.
This check is added to try_to_wake_up() and can_migrate_task() to
override internode migrations if the last one was less than
sched_migrate_interval seconds [HZ] ago.
BONUS PATCH:
automigrate-06-hook-to-migrate-on-fault.patch
This patch, which requires the migrate-on-fault capability,
hooks automigration up to migrate-on-fault, with an additional
control--/sys/kernel/migration/sched_migrate_lazy--to enable
it.
TESTING:
I have tested this patch on a 16-cpu/4-node HP rx8620 [ia64] platform with
everyone's favorite benchmark.
Kernel builds [after make mrproper+make defconfig]
on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64].
Times taken after a warm-up run.
Entire kernel source likely held in page cache.
This amplifies the effect of the patches because I
can't hide behind disk IO time.
No auto-migrate patches:
88.20s real 1042.56s user 97.26s system
88.92s real 1042.27s user 98.08s system
88.40s real 1043.58s user 96.51s system
91.45s real 1042.46s user 97.07s system
93.29s real 1040.90s user 96.88s system
90.15s real 1042.06s user 97.02s system
90.45s real 1042.75s user 96.98s system
90.77s real 1041.87s user 98.61s system
90.21s real 1042.00s user 96.91s system
88.50s real 1042.23s user 97.30s system
-------------------------------------------
90.03s real 1042.26s user 97.26s system - mean
1.59 0.68 0.62 - std dev'n
With auto-migration patches, sched_migrate_memory disabled:
88.98s real 1042.28s user 96.88s system
88.75s real 1042.71s user 97.51s system
89.42s real 1042.32s user 97.42s system
87.83s real 1042.92s user 96.06s system
92.47s real 1041.12s user 95.96s system
89.14s real 1043.77s user 97.10s system
88.11s real 1044.04s user 95.16s system
91.74s real 1042.21s user 96.43s system
89.36s real 1042.31s user 96.56s system
88.55s real 1042.50s user 96.25s system
-------------------------------------------
89.43s real 1042.61s user 96.53s system - mean
1.51 0.83 0.72 - std dev'n
With auto-migration patches, sched_migrate_memory enabled:
90.62s real 1041.64s user 106.80s system
89.94s real 1042.82s user 105.00s system
91.34s real 1041.89s user 107.74s system
90.12s real 1041.77s user 108.01s system
90.93s real 1042.00s user 106.50s system
93.97s real 1040.12s user 106.16s system
90.65s real 1041.87s user 106.81s system
90.53s real 1041.46s user 106.74s system
91.84s real 1041.59s user 105.57s system
90.28s real 1041.69s user 106.64s system
-------------------------------------------
91.02s real 1041.68s user 106.597 system - mean
1.18 0.67 0.90 - std dev'n
Not stellar!. Insignificant decrease in user time, but
~1% increase in run time [from the unpatched case] and
~10% increase in system time. In short, page migration,
and/or the scanning of vm areas for eligible pages, is
expensive and, for this job, the programs don't see
enough benefit from resulting locality to pay for cost
of migration. Compilers just don't run long enough!
On one instrumented sample auto-direct run:
migrate_task_memory called 3628 times = #internode migrations
migrate_vma_to_node called 17137 times = 7.68 vma/task
migrate_page called 3628 times = 1.62 pages/task
Very few "eligible" pages found in eligible vmas! Perhaps
we're not being aggressive enough in attempts to migrate.
------------
Now, with the last patch, hooking automigration to
migrate-on-fault:
With auto-migrate + migrate-on-fault patches;
sched_migrate_memory disabled:
88.02s real 1042.77s user 95.62s system
91.56s real 1041.05s user 97.50s system
90.41s real 1040.88s user 98.07s system
90.41s real 1041.64s user 97.00s system
89.82s real 1042.45s user 96.35s system
88.28s real 1042.25s user 96.91s system
91.51s real 1042.74s user 95.90s system
93.34s real 1041.72s user 96.07s system
89.09s real 1041.00s user 97.35s system
89.44s real 1041.57s user 96.55s system
-------------------------------------------
90.19s real 1041.81s user 96.73s system - mean
1.63 0.71 0.78 - std dev'n
With auto-migrate + migrate-on-fault patches;
sched_migrate_memory and sched_migrate_lazy enabled:
91.72s real 1039.17s user 108.92s system
91.02s real 1041.62s user 107.38s system
91.21s real 1041.84s user 106.63s system
93.24s real 1039.50s user 107.54s system
92.64s real 1040.79s user 107.10s system
92.52s real 1040.79s user 107.14s system
91.85s real 1039.90s user 108.26s system
90.58s real 1043.34s user 106.06s system
92.30s real 1040.88s user 106.64s system
94.25s real 1039.96s user 106.85s system
-------------------------------------------
92.13s real 1040.78 user 107.25 system - mean
1.10 1.25 0.84 - std dev'n
Also, no win for kernel builds. Again, slightly less
user time, but even more system and real time [~1sec each]
than the auto+direct run.
On one instrumented sample auto-lazy run:
migrate_task_memory called 3777 times = #internode migrations
migrate_vma_to_node called 28586 times = 7.56 vma/task
migrate_page called 3886 times = 1.02 pages/task
Similar pattern, but a lot more "eligible" vmas; fewer
eligible pages. More internode migrations.
TODO:
Next week, I'll try some longer running workloads that we know
have suffered from the scheduler moving them away from their
memory--e.g., McAlpin STREAMS. Will report results when
available.
Maybe also test with more aggressive migration: '_MOVE_ALL.
I'll also move this to the -mm tree, once I port my trace
instrumentation from relayfs to sysfs.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2006-03-10 19:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-03-10 19:33 Lee Schermerhorn [this message]
2006-03-11 6:41 ` KAMEZAWA Hiroyuki
2006-03-13 17:27 ` Lee Schermerhorn
2006-03-13 23:45 ` Christoph Lameter
2006-03-15 16:05 ` Avi Kivity
2006-03-15 17:54 ` Paul Jackson
2006-03-15 18:10 ` Christoph Lameter
2006-03-15 18:14 ` Paul Jackson
2006-03-15 18:20 ` Christoph Lameter
2006-03-15 19:21 ` Lee Schermerhorn
2006-03-15 18:57 ` Avi Kivity
2006-03-15 19:27 ` Lee Schermerhorn
2006-03-15 19:56 ` Jack Steiner
2006-03-14 22:27 ` Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1142019195.5204.12.camel@localhost.localdomain \
--to=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox