From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: linux-mm <linux-mm@kvack.org>
Subject: [PATCH 2.6.17-rc1-mm1 0/9] AutoPage Migration - V0.2 - Overview
Date: Fri, 07 Apr 2006 16:32:26 -0400 [thread overview]
Message-ID: <1144441946.5198.52.camel@localhost.localdomain> (raw)
This is a repost of the auto-migration series against 2.6.17-rc1-mm1.
I will post the rest of the series as responses to this message.
Lee
--------------------------------------------------------------------
AutoPage Migration - V0.2 - 0/9 Overview
V0.2 reworks the patches on 2.6.17-rc1-mm1, including Christoph's
migration code reorg, moving much of the migration mechanism to
mm/migrate.c Also, some of the individual patches address comments
from Christoph and others on the V0.1 series.
----------------
We have seen some workloads suffer decreases in performance on NUMA
platforms when the Linux scheduler moves the tasks away from their initial
memory footprint. Some users--e.g., HPC--are motivated by this to go to
great lengths to ensure that tasks start up and stay on specific nodes.
2.6.16+ includes memory migration mechanisms that will allow these users
to move memory along with their tasks--either manually or under control
of a load scheduling program--in response to changing demands on the
resourses.
Other users--e.g., "Enterprise" applications--would prefer that the system
just "do the right thing" in this respect. One possible approach would
be to have the system automatically migrate a task's pages when it decides
to move the task to a different node from where it has executed in the
past. In order to determine whether this approach would provide any
benefit, we need working code to measure.
So, ....
This series of patches hooks up linux 2.6.16+ direct page migration to the
task scheduler. The effect is such that, when load balancing moves a task
to a cpu on a different node from where the task last executed, the task
is notified of this change using the same mechanism to notify a task of
pending signals. When the task returns to user state, it attempts to
migrate, to the new node, any pages not already on that node in those of
the task's vm areas under control of default policy.
This behavior is disabled by default, but can be enabled by writing non-
zero to /sys/kernel/migration/auto_migrate_enable. Furthermore, to prevent
thrashing, a second sysctl, auto_migrate_interval, has been implemented.
The load balancer will not move a task to a different node if it has move
to a new node in the last auto_migrate_interval seconds. [User interface
is in seconds; internally it's in HZ.] The idea is to give the task time
to ammortize the cost of the migration by giving it time to benefit from
local references to the page.
The controls, enable/disable and interval, will enable performance testing
of this mechanism to help decide whether it is worth inclusion. Note: providing
these controls does not presuppose that these will be twiddled by human
administrators/users. They may be useful to user space workload management
daemons or such...
The Patches:
Patches 01-06 apply to 2.6.17-rc1-mm1 with or without the previously
posted "migrate-on-fault" patches. Most of my recent testing has
been done with this series layered on the "migrate-on-fault" patches.
So, some fixup may be necessary to apply the series directly to
2.6.17-rc1-mm1 or beyond.
Patch 07 requires that the migrate-on-fault patches be applied first,
including the mbind/MPOL_MF_LAZY patch.
automigrate-01-prepare-mempolicy-for-automigrate.patch
This patch adds the function auto_migrate_task_memory() to
mempolicy.c. In V0.2, this function sets up a call to
migrate_to_node() with the appropriate [mempolicy internal]
flags for auto-migration. This addresses Christoph's comment
about code duplication.
This patch also modifies the vma_migratable() function, called
from check_range(), to reject VMAs that don't have default
policy when auto-migrating.
Note that this mechanism uses non-aggressive migration--i.e.,
MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL. Therefore, it gives
up rather easily. E.g., anon pages still shared, copy-on-write,
between ancestors and descendants will not be migrated.
automigrate-02-add-auto_migrate_enable-sysctl.patch
This patch adds the infrastructure for the /sys/kernel/migration
group as well as the auto_migrate_enable control.
V02 of this series adds the control infrastructure to the new
mm/migrate.c source file.
TODO: extract the basic control infrastructure for use by the
migrate-on-fault series...
automigrate-03.0-check-notify-migrate-pending.patch
The patch adds a static inline function to
include/linux/auto-migrate.h for the schedule to check for
internode migration and notify the task [by setting the
TIF_NOTIFY_RESUME thread info flag], if the task is migrating
to a new node and auto-migration is enabled.
The header also includes the function check_migrate_pending()
that the task will call when returning to user state when it notices
TIF_NOTIFY_RESUME set. Both of these functions become a null macro
when MIGRATION is not configured.
automigrate-03.1-ia64-check-notify-migrate-pending.patch
This patch adds the call to the check_migrate_pending() to the
ia64 specific do_notify_resume_user() function. Note that this
is the same mechanism used to deliver signals and perfmon events
to a task. I have tested this patch on a 4-node, 16-cpu ia64
platform.
automigrate-03.2-x86_64-check-notify-migrate-pending.patch
This patch adds the call to check_migrate_pending() to the x86_64
specific do_notify_resume() function. This is just an example
for an arch other than ia64. I have tested automigrate on a
4-socket/dual-core Opteron platform.
V0.2: fixed auto-migrate.h header include
automigrate-04-hook-sched-internode-migration.patch
This patch hooks the calls to check_internode_migration() into
the scheduler [kernel/sched.c] in places where the scheduler
sets a new cpu for the task--i.e., just before calls to
set_task_cpu(). Because these are in migration paths, that are
already relatively "heavy-weight", they don't add overhead to
scheduler fast paths. And, they become empty or constant
macros when MIGRATION is not configured in.
V0.2: don't check/notify task of internode migration in
migrate_task() when migrating in exec() path. Pointed out
by Kamezawa Hiroyuki.
automigrate-05-add-internode-migration-hysteresis.patch
This patch adds the auto_migrate_interval control to the
/sys/kernel/migration group, and adds a function to the
auto-migrate.h header--too_soon_for_internode_migration()--to
check whether it's too soon for another internode migration.
This function becomes a macro that evaluates to "false" [0],
when MIGRATION is not configured.
This check is added to try_to_wake_up() and can_migrate_task() to
override internode migrations if the last one was less than
auto_migrate_interval seconds [HZ] ago.
automigrate-06-max-mapcount-control.patch
This patch adds an additional control: migrate_max_mapcount.
mempolicy.c:migrate_page_add() has been modified to allow
pages with a mapcount <= this value to be migrated. The
default of 1 results in the same behavior as without this
patch. Use of this patch will allow experimentation and
measurement of the effect of different mapcount thresholds
on workload performance.
automigrate-07-hook-to-migrate-on-fault.patch
This patch, which requires the migrate-on-fault capability,
hooks automigration up to migrate-on-fault, with an additional
control--/sys/kernel/migration/auto_migrate_lazy--to enable
it.
TESTING:
I have tested this patch on a 16-cpu/4-node/32GB HP rx8620 [ia64] platform
and a 4 socket/dual-core/8GB HP Proliant dl585 Opteron platform with
everyone's favorite benchmark [kernel builds]. Patch seems stable.
Performance results for Opteron reported below.
I have also tested on ia64 with the McAlpin Streams benchmark. These
results were reported previously:
http://marc.theaimsgroup.com/?l=linux-mm&m=114237540231833&w=4
Kernel builds [after make mrproper+make defconfig]
on 2.6.16-mm2 on dl585. Times are avg of 10 runs.
Entire kernel source likely held in page cache.
No auto-migrate patches:
40.69 real 226.40 user 41.77 system
With auto-migration patches, auto_migrate disabled:
40.52 real 227.21 user 42.19 system
With auto-migration patches, auto_migrate enabled,
direct [!lazy]:
40.90 real 227.10 user 42.45 system
With patch; auto-migration + lazy enabled:
41.43 real 228.74 user 43.97 system
As mentioned in previous posting of this series, the compiler
don't run long enough to amortize the cost of migrating the
pages. But see the McAlpin Streams results linked above.
Also, the defconfig runs on x86_64 don't run all that long,
anyway. So, I tried allmodconfig builds. The results are,
uh, interesting. These are representative results from half
a dozen runs each.
no auto-migration patches:
290 real 1740 user 344 system
one run @ 316 real: +26sec from typical
with patches; auto-migration disabled:
287 real 1738 user 346 system
basically the same as w/o patches.
real and user slightly lower, system slightly higher.
with patches; auto-migration+lazy enabled:
310s real 1800s user 386s system
user and system times fairly consistent.
did see 2 runs with real time +27sec from the typical runs,
as I did with no patches. System is running multiuser, so
some daemon may jump in occasionally.
In these runs, the cost of migrating pages really starts to
impact the runtime. Note that, on an Opteron, every
inter-[phys]cpu task migration is an inter-node migration.
I see LOTS more internode migrations and resulting triggering
of page migrations in a kernel build on the Opteron platform
than on the 16-cpu, 4-node ia64 platform--not that this is at
all surprising. E.g., from instrumented runs:
ia64 Opteron
inter-node task migrations 2109 4058
pages unmapped for migration 9898 163627
anon migration faults 3208 62518
attempt migrate misplaced page 3007 44973
actually migrate misplaced pg 3007 44968
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2006-04-07 20:31 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-07 20:32 Lee Schermerhorn [this message]
2006-04-07 20:37 ` [PATCH 2.6.17-rc1-mm1 1/9] AutoPage Migration - V0.2 - migrate task memory with default policy Lee Schermerhorn
2006-04-07 20:37 ` [PATCH 2.6.17-rc1-mm1 2/9] AutoPage Migration - V0.2 - add auto_migrate_enable sysctl Lee Schermerhorn
2006-04-07 20:38 ` [PATCH 2.6.17-rc1-mm1 3/9] AutoPage Migration - V0.2 - generic check/notify internode migration Lee Schermerhorn
2006-04-07 20:39 ` [PATCH 2.6.17-rc1-mm1 4/9] AutoPage Migration - V0.2 - ia64 " Lee Schermerhorn
2006-04-07 20:40 ` [PATCH 2.6.17-rc1-mm1 5/9] AutoPage Migration - V0.2 - x64_64 " Lee Schermerhorn
2006-04-07 20:41 ` [PATCH 2.6.17-rc1-mm1 6/9] AutoPage Migration - V0.2 - hook sched migrate to memory migration Lee Schermerhorn
2006-04-07 20:42 ` [PATCH 2.6.17-rc1-mm1 7/9] AutoPage Migration - V0.2 - add hysteresis to internode migration Lee Schermerhorn
2006-04-07 20:43 ` [PATCH 2.6.17-rc1-mm1 8/9] AutoPage Migration - V0.2 - add max mapcount migration threshold Lee Schermerhorn
2006-04-07 20:45 ` [PATCH 2.6.17-rc1-mm1 9/9] AutoPage Migration - V0.2 - hook automigration to migrate-on-fault Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1144441946.5198.52.camel@localhost.localdomain \
--to=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox