From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp2.fc.hp.com (smtp2.fc.hp.com [15.11.136.114]) by atlrel8.hp.com (Postfix) with ESMTP id A5DB036EC5 for ; Fri, 7 Apr 2006 16:31:04 -0400 (EDT) Received: from ldl.fc.hp.com (ldl.fc.hp.com [15.11.146.30]) by smtp2.fc.hp.com (Postfix) with ESMTP id 7F28EAC7A for ; Fri, 7 Apr 2006 20:31:04 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ldl.fc.hp.com (Postfix) with ESMTP id 4C7BE138E39 for ; Fri, 7 Apr 2006 14:31:04 -0600 (MDT) Received: from ldl.fc.hp.com ([127.0.0.1]) by localhost (ldl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 22296-01 for ; Fri, 7 Apr 2006 14:31:02 -0600 (MDT) Received: from [16.116.101.121] (unknown [16.116.101.121]) by ldl.fc.hp.com (Postfix) with ESMTP id BC5EE138E38 for ; Fri, 7 Apr 2006 14:31:01 -0600 (MDT) Subject: [PATCH 2.6.17-rc1-mm1 0/9] AutoPage Migration - V0.2 - Overview From: Lee Schermerhorn Content-Type: text/plain Date: Fri, 07 Apr 2006 16:32:26 -0400 Message-Id: <1144441946.5198.52.camel@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm List-ID: This is a repost of the auto-migration series against 2.6.17-rc1-mm1. I will post the rest of the series as responses to this message. Lee -------------------------------------------------------------------- AutoPage Migration - V0.2 - 0/9 Overview V0.2 reworks the patches on 2.6.17-rc1-mm1, including Christoph's migration code reorg, moving much of the migration mechanism to mm/migrate.c Also, some of the individual patches address comments from Christoph and others on the V0.1 series. ---------------- We have seen some workloads suffer decreases in performance on NUMA platforms when the Linux scheduler moves the tasks away from their initial memory footprint. Some users--e.g., HPC--are motivated by this to go to great lengths to ensure that tasks start up and stay on specific nodes. 2.6.16+ includes memory migration mechanisms that will allow these users to move memory along with their tasks--either manually or under control of a load scheduling program--in response to changing demands on the resourses. Other users--e.g., "Enterprise" applications--would prefer that the system just "do the right thing" in this respect. One possible approach would be to have the system automatically migrate a task's pages when it decides to move the task to a different node from where it has executed in the past. In order to determine whether this approach would provide any benefit, we need working code to measure. So, .... This series of patches hooks up linux 2.6.16+ direct page migration to the task scheduler. The effect is such that, when load balancing moves a task to a cpu on a different node from where the task last executed, the task is notified of this change using the same mechanism to notify a task of pending signals. When the task returns to user state, it attempts to migrate, to the new node, any pages not already on that node in those of the task's vm areas under control of default policy. This behavior is disabled by default, but can be enabled by writing non- zero to /sys/kernel/migration/auto_migrate_enable. Furthermore, to prevent thrashing, a second sysctl, auto_migrate_interval, has been implemented. The load balancer will not move a task to a different node if it has move to a new node in the last auto_migrate_interval seconds. [User interface is in seconds; internally it's in HZ.] The idea is to give the task time to ammortize the cost of the migration by giving it time to benefit from local references to the page. The controls, enable/disable and interval, will enable performance testing of this mechanism to help decide whether it is worth inclusion. Note: providing these controls does not presuppose that these will be twiddled by human administrators/users. They may be useful to user space workload management daemons or such... The Patches: Patches 01-06 apply to 2.6.17-rc1-mm1 with or without the previously posted "migrate-on-fault" patches. Most of my recent testing has been done with this series layered on the "migrate-on-fault" patches. So, some fixup may be necessary to apply the series directly to 2.6.17-rc1-mm1 or beyond. Patch 07 requires that the migrate-on-fault patches be applied first, including the mbind/MPOL_MF_LAZY patch. automigrate-01-prepare-mempolicy-for-automigrate.patch This patch adds the function auto_migrate_task_memory() to mempolicy.c. In V0.2, this function sets up a call to migrate_to_node() with the appropriate [mempolicy internal] flags for auto-migration. This addresses Christoph's comment about code duplication. This patch also modifies the vma_migratable() function, called from check_range(), to reject VMAs that don't have default policy when auto-migrating. Note that this mechanism uses non-aggressive migration--i.e., MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL. Therefore, it gives up rather easily. E.g., anon pages still shared, copy-on-write, between ancestors and descendants will not be migrated. automigrate-02-add-auto_migrate_enable-sysctl.patch This patch adds the infrastructure for the /sys/kernel/migration group as well as the auto_migrate_enable control. V02 of this series adds the control infrastructure to the new mm/migrate.c source file. TODO: extract the basic control infrastructure for use by the migrate-on-fault series... automigrate-03.0-check-notify-migrate-pending.patch The patch adds a static inline function to include/linux/auto-migrate.h for the schedule to check for internode migration and notify the task [by setting the TIF_NOTIFY_RESUME thread info flag], if the task is migrating to a new node and auto-migration is enabled. The header also includes the function check_migrate_pending() that the task will call when returning to user state when it notices TIF_NOTIFY_RESUME set. Both of these functions become a null macro when MIGRATION is not configured. automigrate-03.1-ia64-check-notify-migrate-pending.patch This patch adds the call to the check_migrate_pending() to the ia64 specific do_notify_resume_user() function. Note that this is the same mechanism used to deliver signals and perfmon events to a task. I have tested this patch on a 4-node, 16-cpu ia64 platform. automigrate-03.2-x86_64-check-notify-migrate-pending.patch This patch adds the call to check_migrate_pending() to the x86_64 specific do_notify_resume() function. This is just an example for an arch other than ia64. I have tested automigrate on a 4-socket/dual-core Opteron platform. V0.2: fixed auto-migrate.h header include automigrate-04-hook-sched-internode-migration.patch This patch hooks the calls to check_internode_migration() into the scheduler [kernel/sched.c] in places where the scheduler sets a new cpu for the task--i.e., just before calls to set_task_cpu(). Because these are in migration paths, that are already relatively "heavy-weight", they don't add overhead to scheduler fast paths. And, they become empty or constant macros when MIGRATION is not configured in. V0.2: don't check/notify task of internode migration in migrate_task() when migrating in exec() path. Pointed out by Kamezawa Hiroyuki. automigrate-05-add-internode-migration-hysteresis.patch This patch adds the auto_migrate_interval control to the /sys/kernel/migration group, and adds a function to the auto-migrate.h header--too_soon_for_internode_migration()--to check whether it's too soon for another internode migration. This function becomes a macro that evaluates to "false" [0], when MIGRATION is not configured. This check is added to try_to_wake_up() and can_migrate_task() to override internode migrations if the last one was less than auto_migrate_interval seconds [HZ] ago. automigrate-06-max-mapcount-control.patch This patch adds an additional control: migrate_max_mapcount. mempolicy.c:migrate_page_add() has been modified to allow pages with a mapcount <= this value to be migrated. The default of 1 results in the same behavior as without this patch. Use of this patch will allow experimentation and measurement of the effect of different mapcount thresholds on workload performance. automigrate-07-hook-to-migrate-on-fault.patch This patch, which requires the migrate-on-fault capability, hooks automigration up to migrate-on-fault, with an additional control--/sys/kernel/migration/auto_migrate_lazy--to enable it. TESTING: I have tested this patch on a 16-cpu/4-node/32GB HP rx8620 [ia64] platform and a 4 socket/dual-core/8GB HP Proliant dl585 Opteron platform with everyone's favorite benchmark [kernel builds]. Patch seems stable. Performance results for Opteron reported below. I have also tested on ia64 with the McAlpin Streams benchmark. These results were reported previously: http://marc.theaimsgroup.com/?l=linux-mm&m=114237540231833&w=4 Kernel builds [after make mrproper+make defconfig] on 2.6.16-mm2 on dl585. Times are avg of 10 runs. Entire kernel source likely held in page cache. No auto-migrate patches: 40.69 real 226.40 user 41.77 system With auto-migration patches, auto_migrate disabled: 40.52 real 227.21 user 42.19 system With auto-migration patches, auto_migrate enabled, direct [!lazy]: 40.90 real 227.10 user 42.45 system With patch; auto-migration + lazy enabled: 41.43 real 228.74 user 43.97 system As mentioned in previous posting of this series, the compiler don't run long enough to amortize the cost of migrating the pages. But see the McAlpin Streams results linked above. Also, the defconfig runs on x86_64 don't run all that long, anyway. So, I tried allmodconfig builds. The results are, uh, interesting. These are representative results from half a dozen runs each. no auto-migration patches: 290 real 1740 user 344 system one run @ 316 real: +26sec from typical with patches; auto-migration disabled: 287 real 1738 user 346 system basically the same as w/o patches. real and user slightly lower, system slightly higher. with patches; auto-migration+lazy enabled: 310s real 1800s user 386s system user and system times fairly consistent. did see 2 runs with real time +27sec from the typical runs, as I did with no patches. System is running multiuser, so some daemon may jump in occasionally. In these runs, the cost of migrating pages really starts to impact the runtime. Note that, on an Opteron, every inter-[phys]cpu task migration is an inter-node migration. I see LOTS more internode migrations and resulting triggering of page migrations in a kernel build on the Opteron platform than on the 16-cpu, 4-node ia64 platform--not that this is at all surprising. E.g., from instrumented runs: ia64 Opteron inter-node task migrations 2109 4058 pages unmapped for migration 9898 163627 anon migration faults 3208 62518 attempt migrate misplaced page 3007 44973 actually migrate misplaced pg 3007 44968 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org