[PATCH] mm: allow exiting processes to exceed the memory.max limit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: allow exiting processes to exceed the memory.max limit
@ 2024-12-09 17:42 Rik van Riel
  2024-12-09 18:08 ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2024-12-09 17:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kernel-team, linux-kernel, linux-mm, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	cgroups

It is possible for programs to get stuck in exit, when their
memcg is at or above the memory.max limit, and things like
the do_futex() call from mm_release() need to page memory in.

This can hang forever, but it really doesn't have to.

The amount of memory that the exit path will page into memory
should be relatively small, and letting exit proceed faster
will free up memory faster.

Allow PF_EXITING tasks to bypass the cgroup memory.max limit
the same way PF_MEMALLOC already does.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 mm/memcontrol.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..d1abef1138ff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2218,11 +2218,12 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
-	 * allocate memory. This might exceed the limits temporarily,
-	 * but we prefer facilitating memory reclaim and getting back
-	 * under the limit over triggering OOM kills in these cases.
+	 * allocate memory, or the process is exiting. This might exceed
+	 * the limits temporarily, but we prefer facilitating memory reclaim
+	 * and getting back under the limit over triggering OOM kills in
+	 * these cases.
 	 */
-	if (unlikely(current->flags & PF_MEMALLOC))
+	if (unlikely(current->flags & (PF_MEMALLOC | PF_EXITING)))
 		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
-- 
2.47.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: allow exiting processes to exceed the memory.max limit
  2024-12-09 17:42 [PATCH] mm: allow exiting processes to exceed the memory.max limit Rik van Riel
@ 2024-12-09 18:08 ` Michal Hocko
  2024-12-09 20:00   ` Rik van Riel
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Michal Hocko @ 2024-12-09 18:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, kernel-team, linux-kernel, linux-mm,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	cgroups

On Mon 09-12-24 12:42:33, Rik van Riel wrote:
> It is possible for programs to get stuck in exit, when their
> memcg is at or above the memory.max limit, and things like
> the do_futex() call from mm_release() need to page memory in.
> 
> This can hang forever, but it really doesn't have to.

Are you sure this is really happening?

> 
> The amount of memory that the exit path will page into memory
> should be relatively small, and letting exit proceed faster
> will free up memory faster.
> 
> Allow PF_EXITING tasks to bypass the cgroup memory.max limit
> the same way PF_MEMALLOC already does.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  mm/memcontrol.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..d1abef1138ff 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2218,11 +2218,12 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
> -	 * allocate memory. This might exceed the limits temporarily,
> -	 * but we prefer facilitating memory reclaim and getting back
> -	 * under the limit over triggering OOM kills in these cases.
> +	 * allocate memory, or the process is exiting. This might exceed
> +	 * the limits temporarily, but we prefer facilitating memory reclaim
> +	 * and getting back under the limit over triggering OOM kills in
> +	 * these cases.
>  	 */
> -	if (unlikely(current->flags & PF_MEMALLOC))
> +	if (unlikely(current->flags & (PF_MEMALLOC | PF_EXITING)))
>  		goto force;

We already have task_is_dying() bail out. Why is that insufficient?
It is currently hitting when the oom situation is triggered while your
patch is triggering this much earlier. We used to do that in the past
but this got changed by a4ebf1b6ca1e ("memcg: prohibit unconditional
exceeding the limit of dying tasks"). I believe the situation in vmalloc
has changed since then but I suspect the fundamental problem that the
amount of memory dying tasks could allocate a lot of memory stays.

There is still this
:     It has been observed that it is not really hard to trigger these
:     bypasses and cause global OOM situation.
that really needs to be re-evaluated.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: allow exiting processes to exceed the memory.max limit
  2024-12-09 18:08 ` Michal Hocko
@ 2024-12-09 20:00   ` Rik van Riel
  2024-12-11 16:28   ` Rik van Riel
  2024-12-12 20:45   ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2024-12-09 20:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, kernel-team, linux-kernel, linux-mm,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	cgroups

On Mon, 2024-12-09 at 19:08 +0100, Michal Hocko wrote:
> On Mon 09-12-24 12:42:33, Rik van Riel wrote:
> > It is possible for programs to get stuck in exit, when their
> > memcg is at or above the memory.max limit, and things like
> > the do_futex() call from mm_release() need to page memory in.
> > 
> > This can hang forever, but it really doesn't have to.
> 
> Are you sure this is really happening?

It turns out it wasn't really forever.

After about a day the zombie task I was bpftracing,
to figure out exactly what was going wrong, finally
succeeded in exiting.

I got as far as seeing try_to_free_mem_cgroup_pages return 0
many, times in a row, looping in try_charge_memcg, which
occasionally returned -ENOMEM to the caller, who then retried
several times.

Each invocation of try_to_free_mem_cgroup_pages also saw
a large number of unsuccessful calls to shrink_folio_list.

It looks like what might be happening instead is that
faultin_page() returns 0 after getting back VM_FAULT_OOM
from handle_mm_fault, causing __get_user_pages() to loop.

Let me send a patch to fix that, instead!

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: allow exiting processes to exceed the memory.max limit
  2024-12-09 18:08 ` Michal Hocko
  2024-12-09 20:00   ` Rik van Riel
@ 2024-12-11 16:28   ` Rik van Riel
  2024-12-12 20:45   ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2024-12-11 16:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, kernel-team, linux-kernel, linux-mm,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	cgroups

On Mon, 2024-12-09 at 19:08 +0100, Michal Hocko wrote:
> On Mon 09-12-24 12:42:33, Rik van Riel wrote:
> > It is possible for programs to get stuck in exit, when their
> > memcg is at or above the memory.max limit, and things like
> > the do_futex() call from mm_release() need to page memory in.
> > 
> > This can hang forever, but it really doesn't have to.
> 
> Are you sure this is really happening?

The stuck is happening, albeit not stuck forever, but exit
taking hours before finally completing.

However, the fix may be to just allow the exiting task
to bypass "zswap no writeback" settings and write some
of the memory of its own cgroup to swap to get out of
the livelock:

https://lkml.org/lkml/2024/12/11/10102

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: allow exiting processes to exceed the memory.max limit
  2024-12-09 18:08 ` Michal Hocko
  2024-12-09 20:00   ` Rik van Riel
  2024-12-11 16:28   ` Rik van Riel
@ 2024-12-12 20:45   ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2024-12-12 20:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, kernel-team, linux-kernel, linux-mm,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	cgroups

On Mon, Dec 09, 2024 at 07:08:19PM +0100, Michal Hocko wrote:
> On Mon 09-12-24 12:42:33, Rik van Riel wrote:
> > It is possible for programs to get stuck in exit, when their
> > memcg is at or above the memory.max limit, and things like
> > the do_futex() call from mm_release() need to page memory in.
> > 
> > This can hang forever, but it really doesn't have to.
> 
> Are you sure this is really happening?
> 
> > 
> > The amount of memory that the exit path will page into memory
> > should be relatively small, and letting exit proceed faster
> > will free up memory faster.
> > 
> > Allow PF_EXITING tasks to bypass the cgroup memory.max limit
> > the same way PF_MEMALLOC already does.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  mm/memcontrol.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7b3503d12aaf..d1abef1138ff 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2218,11 +2218,12 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  
> >  	/*
> >  	 * Prevent unbounded recursion when reclaim operations need to
> > -	 * allocate memory. This might exceed the limits temporarily,
> > -	 * but we prefer facilitating memory reclaim and getting back
> > -	 * under the limit over triggering OOM kills in these cases.
> > +	 * allocate memory, or the process is exiting. This might exceed
> > +	 * the limits temporarily, but we prefer facilitating memory reclaim
> > +	 * and getting back under the limit over triggering OOM kills in
> > +	 * these cases.
> >  	 */
> > -	if (unlikely(current->flags & PF_MEMALLOC))
> > +	if (unlikely(current->flags & (PF_MEMALLOC | PF_EXITING)))
> >  		goto force;
> 
> We already have task_is_dying() bail out. Why is that insufficient?

Note that the current one goes to nomem, which causes the fault to
simply retry. It doesn't actually make forward progress.

> It is currently hitting when the oom situation is triggered while your
> patch is triggering this much earlier. We used to do that in the past
> but this got changed by a4ebf1b6ca1e ("memcg: prohibit unconditional
> exceeding the limit of dying tasks"). I believe the situation in vmalloc
> has changed since then but I suspect the fundamental problem that the
> amount of memory dying tasks could allocate a lot of memory stays.

Before that patch, *every* exiting task was allowed to bypass. That
doesn't seem right, either. But IMO this patch then tossed the baby
out with the bathwater; at least the OOM vic needs to make progress.

> There is still this
> :     It has been observed that it is not really hard to trigger these
> :     bypasses and cause global OOM situation.
> that really needs to be re-evaluated.

This is quite vague, yeah. And not clear if a single task was doing
this, or a large number of concurrently exiting tasks all being
allowed to bypass without even trying. I'm guessing the latter, simply
because OOM victims *are* allowed to tap into the page_alloc reserves;
we'd have seen deadlocks if a single task's exit path vmallocing could
blow the lid on these.

I sent a patch in the other thread, we should discuss over there. I
just wanted to address those two points made here.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-12-12 20:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-09 17:42 [PATCH] mm: allow exiting processes to exceed the memory.max limit Rik van Riel
2024-12-09 18:08 ` Michal Hocko
2024-12-09 20:00   ` Rik van Riel
2024-12-11 16:28   ` Rik van Riel
2024-12-12 20:45   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox