From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailrelay01.cce.cpqcorp.net (compaqcce.compaq.com [16.47.68.171]) by ccerelbas03.cce.hp.com (Postfix) with ESMTP id 1C5C034003 for ; Fri, 10 Mar 2006 13:33:37 -0600 (CST) Received: from anw.zk3.dec.com (alpha.zk3.dec.com [16.140.128.4]) by mailrelay01.cce.cpqcorp.net (Postfix) with ESMTP id A6EDD52E6 for ; Fri, 10 Mar 2006 13:33:36 -0600 (CST) Subject: [PATCH/RFC] AutoPage Migration - V0.1 - 0/8 Overview From: Lee Schermerhorn Reply-To: lee.schermerhorn@hp.com Content-Type: text/plain Date: Fri, 10 Mar 2006 14:33:14 -0500 Message-Id: <1142019195.5204.12.camel@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm List-ID: AutoPage Migration - V0.1 - 0/8 Overview We have seen some workloads suffer decreases in performance on NUMA platforms when the Linux scheduler moves the tasks away from their initial memory footprint. Some users--e.g., HPC--are motivated by this to go to great lengths to ensure that tasks start up and stay on specific nodes. 2.6.16 includes memory migration mechanisms that will allow these users to move memory along with their tasks--either manually or under control of a load scheduling program--in response to changing demands on the resources. Other users--e.g., "Enterprise" applications--would prefer that the system just "do the right thing" in this respect. One possible approach would be to have the system automatically migrate a tasks pages when it decides to move the task to a different node from where it has executed in the past. One can debate [and we DO, at length] whether this would improve the performance or not. But, why not provide a patch and measure the effects for various policies? I.e., "show me the code." So, .... This series of patches hooks up linux 2.6.16 direct page migration to the task scheduler. The effect is such that, when load balancing moves a task to a cpu on a different node from where the task last executed, the task is notified of this change using the same mechanism to notify a task of pending signals. When the task returns to user state, it attempts to migrate any pages in those of its vm areas under control of default policy that are not already on the new node to the new node. This behavior is disabled by default, but can be enabled by writing non- zero to /sys/kernel/migation/sched_migrate_memory. [Could call this "auto_migrate_memory" ?]. Furthermore, to prevent thrashing, a second sysctl, sched_migrate_interval, has been implemented. The load balancer will not move a task to a different node if it has move to a new node in the last sched_migrate_interval seconds. [User interface is in seconds; internally it's in HZ.] The idea is to give the task time to ammortize the cost of the migration by giving it time to benefit from local references to the page. The controls, enable/disable and interval, will enable performance testing of this mechanism to help decide whether it is worth inclusion. The Patches: Patches 01-05 apply directly to 2.6.16-rc5-git11. However, they should also apply on top of the previously posted "migrate-on-fault" patches with some fuzz/offsets. Patch 06 requires that the migrate-on-fault patches be applied first. automigrate-01-add-migrate_task_memory.patch This patch add the function migrate_task_memory() to mempolicy.c to migrate vmas with default policy to the new node. A second helper function, migrate_vma_to_node(), does the actual work of scanning the vma's address range [check_range] and invoking the existing [in 2.6.16-rc*] migrate_pages_to() function for a non- empty pagelist. Note that this mechanism uses non-aggressive migration--i.e., MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL. Therefore, it gives up rather easily. E.g., anon pages still shared, copy-on-write, between ancestors and descendants will not be migrated. automigrate-02-add-sched_migrate_memory-sysctl.patch This patch adds the infrastructure for the /sys/kernel/migration group as well as the sched_migrate_memory control. Because we have no separate migration source file, I added this to mempolicy.c automigrate-03.0-check-notify-migrate-pending.patch This patch adds a minimal header to interface the scheduler to the auto-migration. The header includes a static inline function for the schedule to check for internode migration and notify the task [by setting the TIF_NOTIFY_RESUME thread info flag], if the task is migrating to a new node and sched_migrate_memory is enabled. The header also includes the function check_migrate_pending() that the task will call when returning to user state when it notices TIF_NOTIFY_RESUME set. Both of these functions become a null macro when MIGRATION is not configured. However, note that in 2.6.16-rc*, one cannot deselect MIGRATION when building with NUMA configured. automigrate-03.1-ia64-check-notify-migrate-pending.patch This patch adds the call to the check_migrate_pending() to the ia64 specific do_notify_resume_user() function. Note that this is the same mechanism used to deliver signals and perfmon events to a task. automigrate-03.2-x86_64-check-notify-migrate-pending.patch This patch adds the call to check_migrate_pending() to the x86_64 specific do_notify_resume() function. This is just an example for an arch other than ia64. I haven't tested this yet. automigrate-04-hook-sched-internode-migration.patch This patch hooks the calls to check_internode_migration() into the scheduler [kernel/sched.c] in places where the scheduler sets a new cpu for the task--i.e., just before calls to set_task_cpu(). Because these are in migration paths, that are already relatively "heavy-weight", they don't add overhead to scheduler fast paths. And, they become empty or constant macros when MIGRATION is not configured in. automigrate-05-add-internode-migration-hysteresis.patch This patch adds the sched_migrate_interval control to the /sys/kernel/migration group, and adds a function to the auto-migrate.h header--too_soon_for_internode_migration()--to check whether it's too soon for another internode migration. This function becomes a macro that evaluates to "false" [0], when MIGRATION is not configured. This check is added to try_to_wake_up() and can_migrate_task() to override internode migrations if the last one was less than sched_migrate_interval seconds [HZ] ago. BONUS PATCH: automigrate-06-hook-to-migrate-on-fault.patch This patch, which requires the migrate-on-fault capability, hooks automigration up to migrate-on-fault, with an additional control--/sys/kernel/migration/sched_migrate_lazy--to enable it. TESTING: I have tested this patch on a 16-cpu/4-node HP rx8620 [ia64] platform with everyone's favorite benchmark. Kernel builds [after make mrproper+make defconfig] on 2.6.16-rc5-git11 on 16-cpu/4 node/32GB HP rx8620 [ia64]. Times taken after a warm-up run. Entire kernel source likely held in page cache. This amplifies the effect of the patches because I can't hide behind disk IO time. No auto-migrate patches: 88.20s real 1042.56s user 97.26s system 88.92s real 1042.27s user 98.08s system 88.40s real 1043.58s user 96.51s system 91.45s real 1042.46s user 97.07s system 93.29s real 1040.90s user 96.88s system 90.15s real 1042.06s user 97.02s system 90.45s real 1042.75s user 96.98s system 90.77s real 1041.87s user 98.61s system 90.21s real 1042.00s user 96.91s system 88.50s real 1042.23s user 97.30s system ------------------------------------------- 90.03s real 1042.26s user 97.26s system - mean 1.59 0.68 0.62 - std dev'n With auto-migration patches, sched_migrate_memory disabled: 88.98s real 1042.28s user 96.88s system 88.75s real 1042.71s user 97.51s system 89.42s real 1042.32s user 97.42s system 87.83s real 1042.92s user 96.06s system 92.47s real 1041.12s user 95.96s system 89.14s real 1043.77s user 97.10s system 88.11s real 1044.04s user 95.16s system 91.74s real 1042.21s user 96.43s system 89.36s real 1042.31s user 96.56s system 88.55s real 1042.50s user 96.25s system ------------------------------------------- 89.43s real 1042.61s user 96.53s system - mean 1.51 0.83 0.72 - std dev'n With auto-migration patches, sched_migrate_memory enabled: 90.62s real 1041.64s user 106.80s system 89.94s real 1042.82s user 105.00s system 91.34s real 1041.89s user 107.74s system 90.12s real 1041.77s user 108.01s system 90.93s real 1042.00s user 106.50s system 93.97s real 1040.12s user 106.16s system 90.65s real 1041.87s user 106.81s system 90.53s real 1041.46s user 106.74s system 91.84s real 1041.59s user 105.57s system 90.28s real 1041.69s user 106.64s system ------------------------------------------- 91.02s real 1041.68s user 106.597 system - mean 1.18 0.67 0.90 - std dev'n Not stellar!. Insignificant decrease in user time, but ~1% increase in run time [from the unpatched case] and ~10% increase in system time. In short, page migration, and/or the scanning of vm areas for eligible pages, is expensive and, for this job, the programs don't see enough benefit from resulting locality to pay for cost of migration. Compilers just don't run long enough! On one instrumented sample auto-direct run: migrate_task_memory called 3628 times = #internode migrations migrate_vma_to_node called 17137 times = 7.68 vma/task migrate_page called 3628 times = 1.62 pages/task Very few "eligible" pages found in eligible vmas! Perhaps we're not being aggressive enough in attempts to migrate. ------------ Now, with the last patch, hooking automigration to migrate-on-fault: With auto-migrate + migrate-on-fault patches; sched_migrate_memory disabled: 88.02s real 1042.77s user 95.62s system 91.56s real 1041.05s user 97.50s system 90.41s real 1040.88s user 98.07s system 90.41s real 1041.64s user 97.00s system 89.82s real 1042.45s user 96.35s system 88.28s real 1042.25s user 96.91s system 91.51s real 1042.74s user 95.90s system 93.34s real 1041.72s user 96.07s system 89.09s real 1041.00s user 97.35s system 89.44s real 1041.57s user 96.55s system ------------------------------------------- 90.19s real 1041.81s user 96.73s system - mean 1.63 0.71 0.78 - std dev'n With auto-migrate + migrate-on-fault patches; sched_migrate_memory and sched_migrate_lazy enabled: 91.72s real 1039.17s user 108.92s system 91.02s real 1041.62s user 107.38s system 91.21s real 1041.84s user 106.63s system 93.24s real 1039.50s user 107.54s system 92.64s real 1040.79s user 107.10s system 92.52s real 1040.79s user 107.14s system 91.85s real 1039.90s user 108.26s system 90.58s real 1043.34s user 106.06s system 92.30s real 1040.88s user 106.64s system 94.25s real 1039.96s user 106.85s system ------------------------------------------- 92.13s real 1040.78 user 107.25 system - mean 1.10 1.25 0.84 - std dev'n Also, no win for kernel builds. Again, slightly less user time, but even more system and real time [~1sec each] than the auto+direct run. On one instrumented sample auto-lazy run: migrate_task_memory called 3777 times = #internode migrations migrate_vma_to_node called 28586 times = 7.56 vma/task migrate_page called 3886 times = 1.02 pages/task Similar pattern, but a lot more "eligible" vmas; fewer eligible pages. More internode migrations. TODO: Next week, I'll try some longer running workloads that we know have suffered from the scheduler moving them away from their memory--e.g., McAlpin STREAMS. Will report results when available. Maybe also test with more aggressive migration: '_MOVE_ALL. I'll also move this to the -mm tree, once I port my trace instrumentation from relayfs to sysfs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org