From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
rientjes@google.com, hannes@cmpxchg.org, guro@fb.com,
tj@kernel.org, vdavydov.dev@gmail.com,
torvalds@linux-foundation.org
Subject: Re: [PATCH v2] mm,page_alloc: wait for oom_lock than back off
Date: Mon, 26 Feb 2018 10:27:25 +0100 [thread overview]
Message-ID: <20180226092725.GB16269@dhcp22.suse.cz> (raw)
In-Reply-To: <201802241700.JJB51016.FQOLFJHFOOSVMt@I-love.SAKURA.ne.jp>
On Sat 24-02-18 17:00:51, Tetsuo Handa wrote:
> >From d922dd170c2bed01a775e8cca0871098aecc253d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 24 Feb 2018 16:49:21 +0900
> Subject: [PATCH v2] mm,page_alloc: wait for oom_lock than back off
>
> This patch fixes a bug which is essentially same with a bug fixed by
> commit 400e22499dd92613 ("mm: don't warn about allocations which stall for
> too long").
>
> Currently __alloc_pages_may_oom() is using mutex_trylock(&oom_lock) based
> on an assumption that the owner of oom_lock is making progress for us. But
> it is possible to trigger OOM lockup when many threads concurrently called
> __alloc_pages_slowpath() because all CPU resources are wasted for pointless
> direct reclaim efforts. That is, schedule_timeout_uninterruptible(1) in
> __alloc_pages_may_oom() does not always give enough CPU resource to the
> owner of the oom_lock.
>
> It is possible that the owner of oom_lock is preempted by other threads.
> Preemption makes the OOM situation much worse. But the page allocator is
> not responsible about wasting CPU resource for something other than memory
> allocation request. Wasting CPU resource for memory allocation request
> without allowing the owner of oom_lock to make forward progress is a page
> allocator's bug.
>
> Therefore, this patch changes to wait for oom_lock in order to guarantee
> that no thread waiting for the owner of oom_lock to make forward progress
> will not consume CPU resources for pointless direct reclaim efforts.
>
> We know printk() from OOM situation where a lot of threads are doing almost
> busy-looping is a nightmare. As a side effect of this patch, printk() with
> oom_lock held can start utilizing CPU resources saved by this patch (and
> reduce preemption during printk(), making printk() complete faster).
>
> By changing !mutex_trylock(&oom_lock) with mutex_lock_killable(&oom_lock),
> it is possible that many threads prevent the OOM reaper from making forward
> progress. Thus, this patch removes mutex_lock(&oom_lock) from the OOM
> reaper.
>
> Also, since nobody uses oom_lock serialization when setting MMF_OOM_SKIP
> and we don't try last second allocation attempt after confirming that there
> is no !MMF_OOM_SKIP OOM victim, the possibility of needlessly selecting
> more OOM victims will be increased if we continue using ALLOC_WMARK_HIGH.
> Thus, this patch changes to use ALLOC_MARK_MIN.
>
> Also, since we don't want to sleep with oom_lock held so that we can allow
> threads waiting at mutex_lock_killable(&oom_lock) to try last second
> allocation attempt (because the OOM reaper starts reclaiming memory without
> waiting for oom_lock) and start selecting next OOM victim if necessary,
> this patch changes the location of the short sleep from inside of oom_lock
> to outside of oom_lock.
This patch does three different things mangled into one patch. All that
with a patch description which talks a lot but doesn't really explain
those changes.
Moreover, you are effectively tunning for an overloaded page allocator
artifical test case and add a central lock where many tasks would
block. I have already tried to explain that this is not an universal
win and you should better have a real life example where this is really
helpful.
While I do agree that removing the oom_lock from __oom_reap_task_mm is a
sensible thing, changing the last allocation attempt to ALLOC_WMARK_MIN
is not all that straightforward and it would require much more detailed
explaination.
So the patch in its current form is not mergeable IMHO.
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> mm/oom_kill.c | 35 +++--------------------------------
> mm/page_alloc.c | 30 ++++++++++++++++++------------
> 2 files changed, 21 insertions(+), 44 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8219001..802214f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -491,22 +491,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct vm_area_struct *vma;
> bool ret = true;
>
> - /*
> - * We have to make sure to not race with the victim exit path
> - * and cause premature new oom victim selection:
> - * __oom_reap_task_mm exit_mm
> - * mmget_not_zero
> - * mmput
> - * atomic_dec_and_test
> - * exit_oom_victim
> - * [...]
> - * out_of_memory
> - * select_bad_process
> - * # no TIF_MEMDIE task selects new victim
> - * unmap_page_range # frees some memory
> - */
> - mutex_lock(&oom_lock);
> -
> if (!down_read_trylock(&mm->mmap_sem)) {
> ret = false;
> trace_skip_task_reaping(tsk->pid);
> @@ -581,7 +565,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>
> trace_finish_task_reaping(tsk->pid);
> unlock_oom:
> - mutex_unlock(&oom_lock);
> return ret;
> }
>
> @@ -1078,7 +1061,6 @@ bool out_of_memory(struct oom_control *oc)
> {
> unsigned long freed = 0;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> - bool delay = false; /* if set, delay next allocation attempt */
>
> if (oom_killer_disabled)
> return false;
> @@ -1128,10 +1110,8 @@ bool out_of_memory(struct oom_control *oc)
> return true;
> }
>
> - if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
> - delay = true;
> + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
> goto out;
> - }
>
> select_bad_process(oc);
> /* Found nothing?!?! Either we hang forever, or we panic. */
> @@ -1139,20 +1119,10 @@ bool out_of_memory(struct oom_control *oc)
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> - if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
> + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
> oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> "Memory cgroup out of memory");
> - delay = true;
> - }
> -
> out:
> - /*
> - * Give the killed process a good chance to exit before trying
> - * to allocate memory again.
> - */
> - if (delay)
> - schedule_timeout_killable(1);
> -
> return !!oc->chosen_task;
> }
>
> @@ -1178,4 +1148,5 @@ void pagefault_out_of_memory(void)
> return;
> out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> + schedule_timeout_killable(1);
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2836bc9..23d1769 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3438,26 +3438,26 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>
> *did_some_progress = 0;
>
> - /*
> - * Acquire the oom lock. If that fails, somebody else is
> - * making progress for us.
> - */
> - if (!mutex_trylock(&oom_lock)) {
> + if (mutex_lock_killable(&oom_lock)) {
> *did_some_progress = 1;
> - schedule_timeout_uninterruptible(1);
> return NULL;
> }
>
> /*
> - * Go through the zonelist yet one more time, keep very high watermark
> - * here, this is only to catch a parallel oom killing, we must fail if
> - * we're still under heavy pressure. But make sure that this reclaim
> - * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> - * allocation which will never fail due to oom_lock already held.
> + * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> + * !__GFP_NORETRY allocation which will never fail due to oom_lock
> + * already held.
> + *
> + * Since neither the OOM reaper nor exit_mmap() waits for oom_lock when
> + * setting MMF_OOM_SKIP on the OOM victim's mm, we might needlessly
> + * select more OOM victims if we use ALLOC_WMARK_HIGH here. But since
> + * this allocation attempt does not sleep, we will not fail to invoke
> + * the OOM killer even if we choose ALLOC_WMARK_MIN here. Thus, we use
> + * ALLOC_WMARK_MIN here.
> */
> page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> ~__GFP_DIRECT_RECLAIM, order,
> - ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> + ALLOC_WMARK_MIN | ALLOC_CPUSET, ac);
> if (page)
> goto out;
>
> @@ -4205,6 +4205,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> /* Retry as long as the OOM killer is making progress */
> if (did_some_progress) {
> no_progress_loops = 0;
> + /*
> + * This schedule_timeout_*() serves as a guaranteed sleep for
> + * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
> + */
> + if (!tsk_is_oom_victim(current))
> + schedule_timeout_uninterruptible(1);
> goto retry;
> }
>
> --
> 1.8.3.1
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2018-02-26 9:27 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-22 13:46 [PATCH] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
2018-01-23 8:38 ` Michal Hocko
2018-01-23 12:07 ` Tetsuo Handa
2018-01-23 12:42 ` Michal Hocko
2018-01-24 13:28 ` Tetsuo Handa
2018-02-13 11:58 ` Tetsuo Handa
2018-02-20 13:32 ` [PATCH] mm,page_alloc: wait for oom_lock than back off Tetsuo Handa
2018-02-20 13:40 ` Matthew Wilcox
2018-02-20 14:12 ` Tetsuo Handa
2018-02-20 14:49 ` Michal Hocko
2018-02-21 14:27 ` Tetsuo Handa
2018-02-22 13:06 ` Michal Hocko
2018-02-24 8:00 ` [PATCH v2] " Tetsuo Handa
2018-02-26 9:27 ` Michal Hocko [this message]
2018-02-26 10:58 ` Tetsuo Handa
2018-02-26 12:19 ` Michal Hocko
2018-02-26 13:16 ` Tetsuo Handa
2018-03-02 11:10 ` [PATCH v3] " Tetsuo Handa
2018-03-02 14:10 ` Michal Hocko
2018-03-03 3:15 ` Tetsuo Handa
2018-03-21 10:39 ` Tetsuo Handa
2018-03-21 11:21 ` Michal Hocko
2018-03-21 11:35 ` Tetsuo Handa
2018-03-21 12:00 ` Michal Hocko
2018-03-21 12:20 ` Tetsuo Handa
2018-03-21 12:31 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180226092725.GB16269@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox