[PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
@ 2018-06-07 11:00 Tetsuo Handa
  2018-06-07 11:00 ` [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Tetsuo Handa @ 2018-06-07 11:00 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

When I was examining a bug which occurs under CPU + memory pressure, I
observed that a thread which called out_of_memory() can sleep for minutes
at schedule_timeout_killable(1) with oom_lock held when many threads are
doing direct reclaim.

The whole point of the sleep is give the OOM victim some time to exit.
But since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and
oom reaper unmap, v3") changed the OOM victim to wait for oom_lock in order
to close race window at exit_mmap(), the whole point of this sleep is lost
now. We need to make sure that the thread which called out_of_memory() will
release oom_lock shortly. Therefore, this patch brings the sleep to outside
of the OOM path. Whether it is safe to remove the sleep will be tested by
future patch.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/oom_kill.c   | 38 +++++++++++++++++---------------------
 mm/page_alloc.c |  7 ++++++-
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..23ce67f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,6 +479,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+/*
+ * We have to make sure not to cause premature new oom victim selection.
+ *
+ * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+ *   mutex_trylock(&oom_lock)
+ *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+ *                               unmap_page_range() # frees some memory
+ *                               set_bit(MMF_OOM_SKIP)
+ *   out_of_memory()
+ *     select_bad_process()
+ *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+ *   mutex_unlock(&oom_lock)
+ *
+ * Therefore, the callers hold oom_lock when calling this function.
+ */
 void __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -523,20 +538,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
@@ -1077,15 +1078,9 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
-	}
 	return !!oc->chosen;
 }
 
@@ -1111,4 +1106,5 @@ void pagefault_out_of_memory(void)
 		return;
 	out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+	schedule_timeout_killable(1);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27..e90f152 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3471,7 +3471,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -4238,6 +4237,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		/*
+		 * This schedule_timeout_*() serves as a guaranteed sleep for
+		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
+		 */
+		if (!tsk_is_oom_victim(current))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry()
  2018-06-07 11:00 [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
@ 2018-06-07 11:00 ` Tetsuo Handa
  2018-06-07 11:13   ` Michal Hocko
  2018-06-07 11:00 ` [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory() Tetsuo Handa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2018-06-07 11:00 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, Michal Hocko, David Rientjes,
	Johannes Weiner, Roman Gushchin, Tejun Heo, Vladimir Davydov

should_reclaim_retry() should be a natural reschedule point. PF_WQ_WORKER
is a special case which needs a stronger rescheduling policy. Doing that
unconditionally seems more straightforward than depending on a zone being
a good candidate for a further reclaim.

Thus, move the short sleep when we are waiting for the owner of oom_lock
(which coincidentally also serves as a guaranteed sleep for PF_WQ_WORKER
threads) to should_reclaim_retry(). Note that it is not evaluated that
whether there is negative side effect with this change. We need to test
both real and artificial workloads for evaluation. You can compare with
and without this patch if you noticed something unexpected.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/page_alloc.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e90f152..210a476 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3914,6 +3914,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	struct zone *zone;
 	struct zoneref *z;
+	bool ret = false;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3977,25 +3978,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 				}
 			}
 
-			/*
-			 * Memory allocation/reclaim might be called from a WQ
-			 * context and the current implementation of the WQ
-			 * concurrency control doesn't recognize that
-			 * a particular WQ is congested if the worker thread is
-			 * looping without ever sleeping. Therefore we have to
-			 * do a short sleep here rather than calling
-			 * cond_resched().
-			 */
-			if (current->flags & PF_WQ_WORKER)
-				schedule_timeout_uninterruptible(1);
-			else
-				cond_resched();
-
-			return true;
+			ret = true;
+			goto out;
 		}
 	}
 
-	return false;
+out:
+	/*
+	 * Memory allocation/reclaim might be called from a WQ
+	 * context and the current implementation of the WQ
+	 * concurrency control doesn't recognize that
+	 * a particular WQ is congested if the worker thread is
+	 * looping without ever sleeping. Therefore we have to
+	 * do a short sleep here rather than calling
+	 * cond_resched().
+	 */
+	if (current->flags & PF_WQ_WORKER)
+		schedule_timeout_uninterruptible(1);
+	else
+		cond_resched();
+	return ret;
 }
 
 static inline bool
@@ -4237,12 +4239,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
-		/*
-		 * This schedule_timeout_*() serves as a guaranteed sleep for
-		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
-		 */
-		if (!tsk_is_oom_victim(current))
-			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry()
  2018-06-07 11:00 ` [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
@ 2018-06-07 11:13   ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2018-06-07 11:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On Thu 07-06-18 20:00:21, Tetsuo Handa wrote:
> should_reclaim_retry() should be a natural reschedule point. PF_WQ_WORKER
> is a special case which needs a stronger rescheduling policy. Doing that
> unconditionally seems more straightforward than depending on a zone being
> a good candidate for a further reclaim.
> 
> Thus, move the short sleep when we are waiting for the owner of oom_lock
> (which coincidentally also serves as a guaranteed sleep for PF_WQ_WORKER
> threads) to should_reclaim_retry(). Note that it is not evaluated that
> whether there is negative side effect with this change. We need to test
> both real and artificial workloads for evaluation. You can compare with
> and without this patch if you noticed something unexpected.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>

Your s-o-b is missing here. And I suspect this should be From: /me
but I do not care all that much.

> ---
>  mm/page_alloc.c | 40 ++++++++++++++++++----------------------
>  1 file changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e90f152..210a476 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3914,6 +3914,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	bool ret = false;
>  
>  	/*
>  	 * Costly allocations might have made a progress but this doesn't mean
> @@ -3977,25 +3978,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  				}
>  			}
>  
> -			/*
> -			 * Memory allocation/reclaim might be called from a WQ
> -			 * context and the current implementation of the WQ
> -			 * concurrency control doesn't recognize that
> -			 * a particular WQ is congested if the worker thread is
> -			 * looping without ever sleeping. Therefore we have to
> -			 * do a short sleep here rather than calling
> -			 * cond_resched().
> -			 */
> -			if (current->flags & PF_WQ_WORKER)
> -				schedule_timeout_uninterruptible(1);
> -			else
> -				cond_resched();
> -
> -			return true;
> +			ret = true;
> +			goto out;
>  		}
>  	}
>  
> -	return false;
> +out:
> +	/*
> +	 * Memory allocation/reclaim might be called from a WQ
> +	 * context and the current implementation of the WQ
> +	 * concurrency control doesn't recognize that
> +	 * a particular WQ is congested if the worker thread is
> +	 * looping without ever sleeping. Therefore we have to
> +	 * do a short sleep here rather than calling
> +	 * cond_resched().
> +	 */
> +	if (current->flags & PF_WQ_WORKER)
> +		schedule_timeout_uninterruptible(1);
> +	else
> +		cond_resched();
> +	return ret;
>  }
>  
>  static inline bool
> @@ -4237,12 +4239,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
>  		no_progress_loops = 0;
> -		/*
> -		 * This schedule_timeout_*() serves as a guaranteed sleep for
> -		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
> -		 */
> -		if (!tsk_is_oom_victim(current))
> -			schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory().
  2018-06-07 11:00 [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
  2018-06-07 11:00 ` [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
@ 2018-06-07 11:00 ` Tetsuo Handa
  2018-06-07 11:16   ` Michal Hocko
  2018-06-22 18:59   ` David Rientjes
  2018-06-07 11:00 ` [PATCH 4/4] mm,oom: Check pending victims earlier " Tetsuo Handa
  2018-06-07 11:11 ` [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Michal Hocko
  3 siblings, 2 replies; 10+ messages in thread
From: Tetsuo Handa @ 2018-06-07 11:00 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

To avoid oversights when adding the "mm, oom: cgroup-aware OOM killer"
patchset, simplify the exception case handling in out_of_memory().
This patch makes no functional changes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/oom_kill.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23ce67f..5a6f1b1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1073,15 +1073,18 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	select_bad_process(oc);
+	if (oc->chosen == (void *)-1UL)
+		return true;
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen) {
+		if (is_sysrq_oom(oc) || is_memcg_oom(oc))
+			return false;
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL)
-		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-				 "Memory cgroup out of memory");
-	return !!oc->chosen;
+	oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
+			 "Memory cgroup out of memory");
+	return true;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory().
  2018-06-07 11:00 ` [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory() Tetsuo Handa
@ 2018-06-07 11:16   ` Michal Hocko
  2018-06-22 18:59   ` David Rientjes
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2018-06-07 11:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On Thu 07-06-18 20:00:22, Tetsuo Handa wrote:
> To avoid oversights when adding the "mm, oom: cgroup-aware OOM killer"
> patchset, simplify the exception case handling in out_of_memory().
> This patch makes no functional changes.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

with a minor nit

> ---
>  mm/oom_kill.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 23ce67f..5a6f1b1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1073,15 +1073,18 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	select_bad_process(oc);
> +	if (oc->chosen == (void *)-1UL)

I think this one deserves a comment.
	/* There is an inflight oom victim *.

> +		return true;
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
> -	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> +	if (!oc->chosen) {
> +		if (is_sysrq_oom(oc) || is_memcg_oom(oc))
> +			return false;
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (oc->chosen && oc->chosen != (void *)-1UL)
> -		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> -				 "Memory cgroup out of memory");
> -	return !!oc->chosen;
> +	oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> +			 "Memory cgroup out of memory");
> +	return true;
>  }
>  
>  /*
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory().
  2018-06-07 11:00 ` [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory() Tetsuo Handa
  2018-06-07 11:16   ` Michal Hocko
@ 2018-06-22 18:59   ` David Rientjes
  1 sibling, 0 replies; 10+ messages in thread
From: David Rientjes @ 2018-06-22 18:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On Thu, 7 Jun 2018, Tetsuo Handa wrote:

> To avoid oversights when adding the "mm, oom: cgroup-aware OOM killer"
> patchset, simplify the exception case handling in out_of_memory().
> This patch makes no functional changes.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/4] mm,oom: Check pending victims earlier in out_of_memory().
  2018-06-07 11:00 [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
  2018-06-07 11:00 ` [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
  2018-06-07 11:00 ` [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory() Tetsuo Handa
@ 2018-06-07 11:00 ` Tetsuo Handa
  2018-06-07 11:28   ` Michal Hocko
  2018-06-07 11:11 ` [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Michal Hocko
  3 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2018-06-07 11:00 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

The "mm, oom: cgroup-aware OOM killer" patchset is trying to introduce
INFLIGHT_VICTIM in order to replace open-coded ((void *)-1UL). But
(regarding CONFIG_MMU=y case) we have a list of inflight OOM victim
threads which are connected to oom_reaper_list. Thus we can check
whether there are inflight OOM victims before starting process/memcg
list traversal. Since it is likely that only few threads are linked to
oom_reaper_list, checking all victims' OOM domain will not matter.

Thus, check whether there are inflight OOM victims before starting
process/memcg list traversal and eliminate the "abort" path.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |  9 ++---
 mm/memcontrol.c            | 18 +++------
 mm/oom_kill.c              | 99 +++++++++++++++++++++++++---------------------
 3 files changed, 64 insertions(+), 62 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71b..b9ee4f8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -355,8 +355,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
-int mem_cgroup_scan_tasks(struct mem_cgroup *,
-			  int (*)(struct task_struct *, void *), void *);
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -806,10 +806,9 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
-static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-		int (*fn)(struct task_struct *, void *), void *arg)
+static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+		void (*fn)(struct task_struct *, void *), void *arg)
 {
-	return 0;
 }
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1695f38..ba52fce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -884,17 +884,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * @arg: argument passed to @fn
  *
  * This function iterates over tasks attached to @memcg or to any of its
- * descendants and calls @fn for each task. If @fn returns a non-zero
- * value, the function breaks the iteration loop and returns the value.
- * Otherwise, it will iterate over all tasks and return 0.
+ * descendants and calls @fn for each task.
  *
  * This function must not be called for the root memory cgroup.
  */
-int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-			  int (*fn)(struct task_struct *, void *), void *arg)
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg)
 {
 	struct mem_cgroup *iter;
-	int ret = 0;
 
 	BUG_ON(memcg == root_mem_cgroup);
 
@@ -903,15 +900,10 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct task_struct *task;
 
 		css_task_iter_start(&iter->css, 0, &it);
-		while (!ret && (task = css_task_iter_next(&it)))
-			ret = fn(task, arg);
+		while ((task = css_task_iter_next(&it)))
+			fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
-			mem_cgroup_iter_break(memcg, iter);
-			break;
-		}
 	}
-	return ret;
 }
 
 /**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a6f1b1..0452a70 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -304,25 +304,13 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static void oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
 
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
-		goto next;
-
-	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_SKIP because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
-			goto next;
-		goto abort;
-	}
+		return;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -335,29 +323,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 
 	points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
 	if (!points || points < oc->chosen_points)
-		goto next;
+		return;
 
 	/* Prefer thread group leaders for display purposes */
 	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
-		goto next;
+		return;
 select:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
 	get_task_struct(task);
 	oc->chosen = task;
 	oc->chosen_points = points;
-next:
-	return 0;
-abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
-	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -368,8 +349,7 @@ static void select_bad_process(struct oom_control *oc)
 
 		rcu_read_lock();
 		for_each_process(p)
-			if (oom_evaluate_task(p, oc))
-				break;
+			oom_evaluate_task(p, oc);
 		rcu_read_unlock();
 	}
 
@@ -476,7 +456,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 /*
@@ -606,14 +585,16 @@ static void oom_reap_task(struct task_struct *tsk)
 	debug_show_all_locks();
 
 done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
+	spin_lock(&oom_reaper_lock);
+	oom_reaper_th->oom_reaper_list = tsk->oom_reaper_list;
+	spin_unlock(&oom_reaper_lock);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -621,14 +602,12 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct task_struct *tsk;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     oom_reaper_th->oom_reaper_list != NULL);
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
+		tsk = oom_reaper_th->oom_reaper_list;
 		spin_unlock(&oom_reaper_lock);
 
 		if (tsk)
@@ -640,15 +619,18 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
+	struct task_struct *p = oom_reaper_th;
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	while (p != tsk && p->oom_reaper_list)
+		p = p->oom_reaper_list;
+	if (p == tsk) {
+		spin_unlock(&oom_reaper_lock);
+		return;
+	}
+	p->oom_reaper_list = tsk;
+	tsk->oom_reaper_list = NULL;
+	get_task_struct(tsk);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
@@ -1010,6 +992,34 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+static bool oom_has_pending_victims(struct oom_control *oc)
+{
+#ifdef CONFIG_MMU
+	struct task_struct *p;
+
+	if (is_sysrq_oom(oc))
+		return false;
+	/*
+	 * Since oom_reap_task_mm()/exit_mmap() will set MMF_OOM_SKIP, let's
+	 * wait for pending victims until MMF_OOM_SKIP is set.
+	 */
+	spin_lock(&oom_reaper_lock);
+	for (p = oom_reaper_th; p; p = p->oom_reaper_list)
+		if (!oom_unkillable_task(p, oc->memcg, oc->nodemask) &&
+		    !test_bit(MMF_OOM_SKIP, &p->signal->oom_mm->flags))
+			break;
+	spin_unlock(&oom_reaper_lock);
+	return p != NULL;
+#else
+	/*
+	 * Since nobody except oom_kill_process() sets MMF_OOM_SKIP, waiting
+	 * for pending victims until MMF_OOM_SKIP is set is useless. Therefore,
+	 * simply let the OOM killer select pending victims again.
+	 */
+	return false;
+#endif
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1063,6 +1073,9 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint);
 
+	if (oom_has_pending_victims(oc))
+		return true;
+
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -1073,8 +1086,6 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	select_bad_process(oc);
-	if (oc->chosen == (void *)-1UL)
-		return true;
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen) {
 		if (is_sysrq_oom(oc) || is_memcg_oom(oc))
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] mm,oom: Check pending victims earlier in out_of_memory().
  2018-06-07 11:00 ` [PATCH 4/4] mm,oom: Check pending victims earlier " Tetsuo Handa
@ 2018-06-07 11:28   ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2018-06-07 11:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On Thu 07-06-18 20:00:23, Tetsuo Handa wrote:
> The "mm, oom: cgroup-aware OOM killer" patchset is trying to introduce
> INFLIGHT_VICTIM in order to replace open-coded ((void *)-1UL). But
> (regarding CONFIG_MMU=y case) we have a list of inflight OOM victim
> threads which are connected to oom_reaper_list. Thus we can check
> whether there are inflight OOM victims before starting process/memcg
> list traversal. Since it is likely that only few threads are linked to
> oom_reaper_list, checking all victims' OOM domain will not matter.
> 
> Thus, check whether there are inflight OOM victims before starting
> process/memcg list traversal and eliminate the "abort" path.

OK, this looks like a nice shortcut. I am quite surprise that all your
NOMMU concerns are gone now while you clearly regress that case because
inflight victims are not detected anymore AFAICS. Not that I care all
that much, just sayin'.

Anyway, I would suggest splitting this into two patches. One to add an
early check for inflight oom victims and one removing the detection from
oom_evaluate_task. Just to make it easier to revert if somebody on nommu
actually notices a regression.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
  2018-06-07 11:00 [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
                   ` (2 preceding siblings ...)
  2018-06-07 11:00 ` [PATCH 4/4] mm,oom: Check pending victims earlier " Tetsuo Handa
@ 2018-06-07 11:11 ` Michal Hocko
  2018-06-08 10:47   ` Tetsuo Handa
  3 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2018-06-07 11:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On Thu 07-06-18 20:00:20, Tetsuo Handa wrote:
[...]
> @@ -4238,6 +4237,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
>  		no_progress_loops = 0;
> +		/*
> +		 * This schedule_timeout_*() serves as a guaranteed sleep for
> +		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
> +		 */
> +		if (!tsk_is_oom_victim(current))
> +			schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}

Nacked-by: Michal Hocko <mhocko@suse.com>

as explainaed several times already. This moving code just to preserve
the current logic without any arguments to back them must stop finally.
We have way too much of this "just in case" code that nobody really
understands and others just pile on top. Seriously this is not how the
development should work.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
  2018-06-07 11:11 ` [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Michal Hocko
@ 2018-06-08 10:47   ` Tetsuo Handa
  0 siblings, 0 replies; 10+ messages in thread
From: Tetsuo Handa @ 2018-06-08 10:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tejun Heo, Vladimir Davydov

On 2018/06/07 20:28, Michal Hocko wrote:
> On Thu 07-06-18 20:00:23, Tetsuo Handa wrote:
> OK, this looks like a nice shortcut. I am quite surprise that all your
> NOMMU concerns are gone now while you clearly regress that case because
> inflight victims are not detected anymore AFAICS. Not that I care all
> that much, just sayin'.
> 
> Anyway, I would suggest splitting this into two patches. One to add an
> early check for inflight oom victims and one removing the detection from
> oom_evaluate_task. Just to make it easier to revert if somebody on nommu
> actually notices a regression.

Sure. Making it easier to revert is a good thing.
But this patch comes after PATCH 3/4. Need to solve previous problem.

On 2018/06/07 20:16, Michal Hocko wrote:
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> with a minor nit
> 
>> ---
>>  mm/oom_kill.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 23ce67f..5a6f1b1 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1073,15 +1073,18 @@ bool out_of_memory(struct oom_control *oc)
>>  	}
>>  
>>  	select_bad_process(oc);
>> +	if (oc->chosen == (void *)-1UL)
> 
> I think this one deserves a comment.
> 	/* There is an inflight oom victim *.
> 
>> +		return true;

OK. Though, this change will be reverted by PATCH 4/4.
But this patch comes after PATCH 1/4. Need to solve previous problem.

On 2018/06/07 20:13, Michal Hocko wrote:
> On Thu 07-06-18 20:00:21, Tetsuo Handa wrote:
> 
> Your s-o-b is missing here. And I suspect this should be From: /me
> but I do not care all that much.

How can I do that? Forge the From: line (assuming that mail server does not
reject forged From: line)?

But I am quite surprised that you did not respond PATCH 2/4 with Nacked-by:
because

On 2018/06/07 20:11, Michal Hocko wrote:
> On Thu 07-06-18 20:00:20, Tetsuo Handa wrote:
> [...]
>> @@ -4238,6 +4237,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>>  	/* Retry as long as the OOM killer is making progress */
>>  	if (did_some_progress) {
>>  		no_progress_loops = 0;
>> +		/*
>> +		 * This schedule_timeout_*() serves as a guaranteed sleep for
>> +		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
>> +		 */
>> +		if (!tsk_is_oom_victim(current))
>> +			schedule_timeout_uninterruptible(1);
>>  		goto retry;
>>  	}
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>
> 
> as explainaed several times already. This moving code just to preserve
> the current logic without any arguments to back them must stop finally.
> We have way too much of this "just in case" code that nobody really
> understands and others just pile on top. Seriously this is not how the
> development should work.
> 

I am purposely splitting into PATCH 1/4 and PATCH 2/4 in order to make it
easier to revert (like you suggested doing so for PATCH 4/4) in case
somebody actually notices an unexpected side effect.

PATCH 1/4 is doing logically correct thing, no matter how you hate
the short sleep which will be removed by PATCH 2/4.

PATCH 1/4 is proven to be safe. But PATCH 2/4 is not tested to be safe.
PATCH 1/4 is safe for stable. But 2/4 might not be safe for stable.
Therefore, I insist on two separate patches.

If you can accept what PATCH 1/4 + PATCH 2/4 are doing, you are free to post
one squashed patch.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-06-22 19:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 11:00 [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
2018-06-07 11:00 ` [PATCH 2/4] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
2018-06-07 11:13   ` Michal Hocko
2018-06-07 11:00 ` [PATCH 3/4] mm,oom: Simplify exception case handling in out_of_memory() Tetsuo Handa
2018-06-07 11:16   ` Michal Hocko
2018-06-22 18:59   ` David Rientjes
2018-06-07 11:00 ` [PATCH 4/4] mm,oom: Check pending victims earlier " Tetsuo Handa
2018-06-07 11:28   ` Michal Hocko
2018-06-07 11:11 ` [PATCH 1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Michal Hocko
2018-06-08 10:47   ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox