From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 13 Sep 2007 17:36:06 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 04 of 24] serialize oom killer In-Reply-To: Message-ID: References: <871b7a4fd566de081120.1187786931@v2.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: Paul Jackson , Andrea Arcangeli , linux-mm@kvack.org List-ID: On Thu, 13 Sep 2007, Christoph Lameter wrote: > > It's easier to serialize it outside of out_of_memory() instead, since it > > only has a single caller and we don't need to serialize for sysrq. > > > > This seems like it would collapse down nicely to a global or per-cpuset > > serialization with an added helper function implemented partially in > > kernel/cpuset.c for the CONFIG_CPUSETS case. > > > > Then, in __alloc_pages(), we test for either a global or per-cpuset > > spin_trylock() and, if we acquire it, call out_of_memory() and goto > > restart as we currently do. If it's contended, we reschedule ourself and > > goto restart when we awaken. > > Could you rephrase that in patch form? ;-) > Yeah, it turned out to be a little more invasive then I thought but it appears to be the cleanest solution for both the general CONSTRAINT_NONE and the per-cpuset CONSTRAINT_CPUSET cases. I've been trying to keep score at home, but I've lost track of what patches from the series we're keeping so this is against HEAD. serialize oom killer Serializes the OOM killer both globally and per-cpuset, depending on the system configuration. A new spinlock, oom_lock, is introduced for the global case. It serializes the OOM killer for systems that are not using cpusets. Only one system task may enter the OOM killer at a time to prevent unnecessarily killing others. A per-cpuset flag, CS_OOM, is introduced in the flags field of struct cpuset. It serializes the OOM killer for only for hardwall allocations targeted for that cpuset. Only one task for each cpuset may enter the OOM killer at a time to prevent unnecessarily killing others. When a per-cpuset OOM killing is taking place, the global spinlock is also locked since we'll be alleviating that condition at the same time. Regardless of the synchronization primitive used, if a task cannot acquire the OOM lock, it is put to sleep before retrying the triggering allocation so that the OOM killer may finish and free some memory. We acquire either lock before attempting one last try at get_pages_from_freelist() with a very high watermark, otherwise we could invoke the OOM killer needlessly if another thread reschedules between this allocation attempt and trying to take the OOM lock. Also converts the CONSTAINT_{NONE,CPUSET,MEMORY_POLICY} defines to an enum and moves them to include/linux/swap.h. We're going to need an include/linux/oom_kill.h soon, probably. Cc: Andrea Arcangeli Cc: Christoph Lameter Signed-off-by: David Rientjes --- drivers/char/sysrq.c | 3 +- include/linux/cpuset.h | 13 ++++++++++- include/linux/swap.h | 14 ++++++++++- kernel/cpuset.c | 16 +++++++++++++ mm/oom_kill.c | 58 ++++++++++++++++++++++++++++++++++++----------- mm/page_alloc.c | 42 +++++++++++++++++++++++----------- 6 files changed, 114 insertions(+), 32 deletions(-) diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c --- a/drivers/char/sysrq.c +++ b/drivers/char/sysrq.c @@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL], - GFP_KERNEL, 0); + out_of_memory(GFP_KERNEL, 0, CONSTRAINT_NONE); } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -60,7 +60,8 @@ extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer); extern void cpuset_lock(void); extern void cpuset_unlock(void); - +extern int cpuset_oom_test_and_set_lock(void); +extern int cpuset_oom_unlock(void); extern int cpuset_mem_spread_node(void); static inline int cpuset_do_page_mem_spread(void) @@ -129,6 +130,16 @@ static inline char *cpuset_task_status_allowed(struct task_struct *task, static inline void cpuset_lock(void) {} static inline void cpuset_unlock(void) {} +static inline int cpuset_oom_test_and_set_lock(void) +{ + return -1; +} + +static inline int cpuset_oom_unlock(void) +{ + return 0; +} + static inline int cpuset_mem_spread_node(void) { return 0; diff --git a/include/linux/swap.h b/include/linux/swap.h --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -159,9 +159,21 @@ struct swap_list_t { #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages) /* linux/mm/oom_kill.c */ -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order); +/* + * Types of limitations to the nodes from which allocations may occur + */ +enum oom_constraint { + CONSTRAINT_NONE, + CONSTRAINT_CPUSET, + CONSTRAINT_MEMORY_POLICY, +}; +extern void out_of_memory(gfp_t gfp_mask, int order, + enum oom_constraint constraint); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); +extern int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask, + enum oom_constraint *constraint); +extern void oom_unlock(enum oom_constraint constraint); /* linux/mm/memory.c */ extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *); diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -109,6 +109,7 @@ typedef enum { CS_NOTIFY_ON_RELEASE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_IS_OOM, } cpuset_flagbits_t; /* convenient tests for these bits */ @@ -147,6 +148,11 @@ static inline int is_spread_slab(const struct cpuset *cs) return test_bit(CS_SPREAD_SLAB, &cs->flags); } +static inline int is_oom(const struct cpuset *cs) +{ + return test_bit(CS_IS_OOM, &cs->flags); +} + /* * Increment this integer everytime any cpuset changes its * mems_allowed value. Users of cpusets can track this generation @@ -2527,6 +2533,16 @@ void cpuset_unlock(void) mutex_unlock(&callback_mutex); } +int cpuset_oom_test_and_set_lock(void) +{ + return test_and_set_bit(CS_IS_OOM, ¤t->cpuset->flags); +} + +int cpuset_oom_unlock(void) +{ + return test_and_clear_bit(CS_IS_OOM, ¤t->cpuset->flags); +} + /** * cpuset_mem_spread_node() - On which node to begin search for a page * diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -27,6 +27,7 @@ #include int sysctl_panic_on_oom; +static DEFINE_SPINLOCK(oom_lock); /* #define DEBUG */ /** @@ -164,13 +165,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) } /* - * Types of limitations to the nodes from which allocations may occur - */ -#define CONSTRAINT_NONE 1 -#define CONSTRAINT_MEMORY_POLICY 2 -#define CONSTRAINT_CPUSET 3 - -/* * Determine the type of allocation constraint. */ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask) @@ -387,6 +381,48 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +/* + * If using cpusets, try to lock task's per-cpuset OOM lock; otherwise, try to + * lock the global OOM spinlock. Returns non-zero if the lock is contended or + * zero if acquired. + */ +int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask, + enum oom_constraint *constraint) +{ + int ret; + + *constraint = constrained_alloc(zonelist, gfp_mask); + switch (*constraint) { + case CONSTRAINT_CPUSET: + ret = cpuset_oom_test_and_set_lock(); + if (!ret) + spin_trylock(&oom_lock); + break; + default: + ret = spin_trylock(&oom_lock); + break; + } + return ret; +} + +/* + * If using cpusets, unlock task's per-cpuset OOM lock; otherwise, unlock the + * global OOM spinlock. + */ +void oom_unlock(enum oom_constraint constraint) +{ + switch (constraint) { + case CONSTRAINT_CPUSET: + if (likely(spin_is_locked(&oom_lock))) + spin_unlock(&oom_lock); + cpuset_oom_unlock(); + break; + default: + spin_unlock(&oom_lock); + break; + } +} + /** * out_of_memory - kill the "best" process when we run out of memory * @@ -395,12 +431,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier); * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) +void out_of_memory(gfp_t gfp_mask, int order, enum oom_constraint constraint) { struct task_struct *p; unsigned long points = 0; unsigned long freed = 0; - int constraint; blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) @@ -418,11 +453,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) if (sysctl_panic_on_oom == 2) panic("out of memory. Compulsory panic_on_oom is selected.\n"); - /* - * Check if there were limitations on the allocation (only relevant for - * NUMA) that may require different handling. - */ - constraint = constrained_alloc(zonelist, gfp_mask); cpuset_lock(); read_lock(&tasklist_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1352,22 +1352,36 @@ nofail_alloc: if (page) goto got_pg; } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { - /* - * Go through the zonelist yet one more time, keep - * very high watermark here, this is only to catch - * a parallel oom killing, we must fail if we're still - * under heavy pressure. - */ - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, - zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET); - if (page) - goto got_pg; + enum oom_constraint constraint = CONSTRAINT_NONE; - /* The OOM killer will not help higher order allocs so fail */ - if (order > PAGE_ALLOC_COSTLY_ORDER) - goto nopage; + if (!oom_test_and_set_lock(zonelist, gfp_mask, &constraint)) { + /* + * Go through the zonelist yet one more time, keep + * very high watermark here, this is only to catch + * a previous oom killing, we must fail if we're still + * under heavy pressure. + */ + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, + order, zonelist, + ALLOC_WMARK_HIGH|ALLOC_CPUSET); + if (page) { + oom_unlock(constraint); + goto got_pg; + } + + /* + * The OOM killer will not help higher order allocs so + * fail + */ + if (order > PAGE_ALLOC_COSTLY_ORDER) { + oom_unlock(constraint); + goto nopage; + } - out_of_memory(zonelist, gfp_mask, order); + out_of_memory(gfp_mask, order, constraint); + oom_unlock(constraint); + } else + schedule_timeout_uninterruptible(1); goto restart; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org