From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sun, 10 Jul 2005 18:58:42 -0700 (PDT) From: Paul Jackson Message-Id: <20050711015842.23183.53450.sendpatchset@tomahawk.engr.sgi.com> In-Reply-To: <20050711015835.23183.40213.sendpatchset@tomahawk.engr.sgi.com> References: <20050711015835.23183.40213.sendpatchset@tomahawk.engr.sgi.com> Subject: [PATCH 1/4] cpusets oom_kill and page_alloc tweaks Sender: owner-linux-mm@kvack.org Return-Path: To: Dinakar Guniguntala , Simon Derr , Erich Focht Cc: linux-mm@kvack.org, Paul Jackson List-ID: This patch applies a few comment and code cleanups, to mm/oom_kill.c and mm/page_alloc.c, prior to applying a patch set to improve cpuset management of memory placement. The first comment changed in oom_kill.c, and the first one and only one changed in page_alloc.c, were seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. And add documentation for cpuset flag notify_on_release. This patch should have no material affect. Signed-off-by: Paul Jackson Index: linux-2.6-mem_exclusive/mm/oom_kill.c =================================================================== --- linux-2.6-mem_exclusive.orig/mm/oom_kill.c 2005-06-29 23:11:41.000000000 -0700 +++ linux-2.6-mem_exclusive/mm/oom_kill.c 2005-07-02 17:40:03.000000000 -0700 @@ -6,8 +6,8 @@ * for goading me into coding this file... * * The routines in this file are used to kill a process when - * we're seriously out of memory. This gets called from kswapd() - * in linux/mm/vmscan.c when we really run out of memory. + * we're seriously out of memory. This gets called from __alloc_pages() + * in mm/page_alloc.c when we really run out of memory. * * Since we won't call these routines often (on a well-configured * machine) this file will double as a 'coding guide' and a signpost @@ -26,7 +26,7 @@ /** * oom_badness - calculate a numeric value for how bad this task has been * @p: task struct of which task we should calculate - * @p: current uptime in seconds + * @uptime: current uptime in seconds * * The formula used is relatively simple and documented inline in the * function. The main rationale is that we want to select a good task @@ -57,9 +57,9 @@ unsigned long badness(struct task_struct /* * Processes which fork a lot of child processes are likely - * a good choice. We add the vmsize of the childs if they + * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the - * machine with an endless amount of childs + * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; @@ -143,28 +143,32 @@ static struct task_struct * select_bad_p struct timespec uptime; do_posix_clock_monotonic_gettime(&uptime); - do_each_thread(g, p) - /* skip the init task with pid == 1 */ - if (p->pid > 1 && p->oomkilladj != OOM_DISABLE) { - unsigned long points; + do_each_thread(g, p) { + unsigned long points; + int releasing; - /* - * This is in the process of releasing memory so wait it - * to finish before killing some other task by mistake. - */ - if ((unlikely(test_tsk_thread_flag(p, TIF_MEMDIE)) || (p->flags & PF_EXITING)) && - !(p->flags & PF_DEAD)) - return ERR_PTR(-1UL); - if (p->flags & PF_SWAPOFF) - return p; - - points = badness(p, uptime.tv_sec); - if (points > maxpoints || !chosen) { - chosen = p; - maxpoints = points; - } + /* skip the init task with pid == 1 */ + if (p->pid == 1) + continue; + if (p->oomkilladj == OOM_DISABLE) + continue; + /* + * This is in the process of releasing memory so for wait it + * to finish before killing some other task by mistake. + */ + releasing = test_tsk_thread_flag(p, TIF_MEMDIE) || + p->flags & PF_EXITING; + if (releasing && !(p->flags & PF_DEAD)) + return ERR_PTR(-1UL); + if (p->flags & PF_SWAPOFF) + return p; + + points = badness(p, uptime.tv_sec); + if (points > maxpoints || !chosen) { + chosen = p; + maxpoints = points; } - while_each_thread(g, p); + } while_each_thread(g, p); return chosen; } @@ -189,7 +193,8 @@ static void __oom_kill_task(task_t *p) return; } task_unlock(p); - printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm); + printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", + p->pid, p->comm); /* * We give our sacrificial lamb high priority and access to Index: linux-2.6-mem_exclusive/mm/page_alloc.c =================================================================== --- linux-2.6-mem_exclusive.orig/mm/page_alloc.c 2005-06-29 23:11:41.000000000 -0700 +++ linux-2.6-mem_exclusive/mm/page_alloc.c 2005-07-02 17:40:04.000000000 -0700 @@ -898,10 +898,9 @@ rebalance: if (likely(did_some_progress)) { /* - * Go through the zonelist yet one more time, keep - * very high watermark here, this is only to catch - * a parallel oom killing, we must fail if we're still - * under heavy pressure. + * Go through the zone list yet one more time, with + * min watermark and trying harder. Since try_to_free_pages + * made some progress, there might be something free. */ for (i = 0; (z = zones[i]) != NULL; i++) { if (!zone_watermark_ok(z, order, z->pages_min, Index: linux-2.6-mem_exclusive/Documentation/cpusets.txt =================================================================== --- linux-2.6-mem_exclusive.orig/Documentation/cpusets.txt 2005-07-02 17:40:04.000000000 -0700 +++ linux-2.6-mem_exclusive/Documentation/cpusets.txt 2005-07-02 17:40:54.000000000 -0700 @@ -143,7 +143,7 @@ Cpusets extends these two mechanisms as The implementation of cpusets requires a few, simple hooks into the rest of the kernel, none in performance critical paths: - - in main/init.c, to initialize the root cpuset at system boot. + - in init/main.c, to initialize the root cpuset at system boot. - in fork and exit, to attach and detach a task from its cpuset. - in sched_setaffinity, to mask the requested CPUs by what's allowed in that tasks cpuset. @@ -154,7 +154,7 @@ into the rest of the kernel, none in per and related changes in both sched.c and arch/ia64/kernel/domain.c - in the mbind and set_mempolicy system calls, to mask the requested Memory Nodes by what's allowed in that tasks cpuset. - - in page_alloc, to restrict memory to allowed nodes. + - in page_alloc.c, to restrict memory to allowed nodes. - in vmscan.c, to restrict page recovery to the current cpuset. In addition a new file system, of type "cpuset" may be mounted, @@ -182,6 +182,7 @@ containing the following files describin - mems: list of Memory Nodes in that cpuset - cpu_exclusive flag: is cpu placement exclusive? - mem_exclusive flag: is memory placement exclusive? + - notify_on_release: call /sbin/cpuset_release_agent on exit if set - tasks: list of tasks (by pid) attached to that cpuset New cpusets are created using the mkdir system call or shell @@ -356,7 +357,8 @@ Now you want to do something with this c In this directory you can find several files: # ls -cpus cpu_exclusive mems mem_exclusive tasks +cpu_exclusive mem_exclusive notify_on_release +cpus mems tasks Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using Index: linux-2.6-mem_exclusive/include/linux/gfp.h =================================================================== --- linux-2.6-mem_exclusive.orig/include/linux/gfp.h 2005-07-02 17:40:04.000000000 -0700 +++ linux-2.6-mem_exclusive/include/linux/gfp.h 2005-07-02 17:42:02.000000000 -0700 @@ -39,7 +39,7 @@ struct vm_area_struct; #define __GFP_COMP 0x4000u /* Add compound page metadata */ #define __GFP_ZERO 0x8000u /* Return zeroed page on success */ #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */ -#define __GFP_NORECLAIM 0x20000u /* No realy zone reclaim during allocation */ +#define __GFP_NORECLAIM 0x20000u /* No zone reclaim during page_cache_alloc */ #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.650.933.1373 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org