* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-12 18:30 ` Johannes Weiner
@ 2024-12-12 21:35 ` Shakeel Butt
2024-12-12 21:41 ` Yosry Ahmed
2024-12-13 0:32 ` Roman Gushchin
2024-12-16 15:39 ` Michal Hocko
2 siblings, 1 reply; 29+ messages in thread
From: Shakeel Butt @ 2024-12-12 21:35 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Michal Hocko,
Roman Gushchin, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > A task already in exit can get stuck trying to allocate pages, if its
> > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > not compressible.
> > >
> > > This seems like an unlikely confluence of events, but it can happen
> > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > >
> > > When this happens, it can sometimes take hours for tasks to exit,
> > > as they are all trying to squeeze things into zswap to bring the group's
> > > memory consumption below memory.max.
> > >
> > > Allowing these exiting programs to push some memory from their own
> > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > consumption below memory.max, and exit in seconds rather than hours.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> >
> > Thanks for sending a v2.
> >
> > I still think maybe this needs to be fixed on the memcg side, at least
> > by not making exiting tasks try really hard to reclaim memory to the
> > point where this becomes a problem. IIUC there could be other reasons
> > why reclaim may take too long, but maybe not as pathological as this
> > case to be fair. I will let the memcg maintainers chime in for this.
> >
> > If there's a fundamental reason why this cannot be fixed on the memcg
> > side, I don't object to this change.
> >
> > Nhat, any objections on your end? I think your fleet workloads were
> > the first users of this interface. Does this break their expectations?
>
> Yes, I don't think we can do this, unfortunately :( There can be a
> variety of reasons for why a user might want to prohibit disk swap for
> a certain cgroup, and we can't assume it's okay to make exceptions.
>
> There might also not *be* any disk swap to overflow into after Nhat's
> virtual swap patches. Presumably zram would still have the issue too.
Very good points.
>
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
>
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue Mar 5 15:46:47 2019 -0800
>
> memcg: killed threads should not invoke memcg OOM killer
>
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
>
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date: Fri Nov 5 13:38:09 2021 -0700
>
> memcg: prohibit unconditional exceeding the limit of dying tasks
>
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:
>
> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault. This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.
>
> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
>
> ret = task_is_dying() || out_of_memory(&oc);
>
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?
This is a very good point and actually out_of_memory() will mark the
dying process as oom victim and put it in the oom reaper's list which
should help further in such situation.
>
> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?
>
> I think we want something like this:
The following patch looks good to me. Let's test this out (hopefully Rik
will be able to find a live impacted machine) and move forward with this
fix.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_margin(memcg) >= (1 << order))
> goto unlock;
>
> - /*
> - * A few threads which were not waiting at mutex_lock_killable() can
> - * fail to bail out. Therefore, check again after holding oom_lock.
> - */
> - ret = task_is_dying() || out_of_memory(&oc);
> + ret = out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
> @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (unlikely(current->flags & PF_MEMALLOC))
> goto force;
>
> + if (unlikely(tsk_is_oom_victim(current)))
> + goto force;
> +
> if (unlikely(task_in_memcg_oom(current)))
> goto nomem;
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-12 21:35 ` Shakeel Butt
@ 2024-12-12 21:41 ` Yosry Ahmed
0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-12-12 21:41 UTC (permalink / raw)
To: Shakeel Butt
Cc: Johannes Weiner, Rik van Riel, Balbir Singh, Michal Hocko,
Roman Gushchin, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Thu, Dec 12, 2024 at 1:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if its
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the group's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > >
> > > Thanks for sending a v2.
> > >
> > > I still think maybe this needs to be fixed on the memcg side, at least
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > >
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > >
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations?
> >
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> >
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
>
> Very good points.
>
> >
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> >
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue Mar 5 15:46:47 2019 -0800
> >
> > memcg: killed threads should not invoke memcg OOM killer
> >
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> >
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date: Fri Nov 5 13:38:09 2021 -0700
> >
> > memcg: prohibit unconditional exceeding the limit of dying tasks
> >
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> >
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> >
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> >
> > ret = task_is_dying() || out_of_memory(&oc);
> >
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
>
> This is a very good point and actually out_of_memory() will mark the
> dying process as oom victim and put it in the oom reaper's list which
> should help further in such situation.
>
> >
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> >
> > I think we want something like this:
>
> The following patch looks good to me. Let's test this out (hopefully Rik
> will be able to find a live impacted machine) and move forward with this
> fix.
I agree with this too. As Shakeel mentioned, this seemed like a
stopgap and not an actual fix for the underlying problem. Johannes
further outlined how the stopgap can be problematic.
Let's try to fix this on the memcg/reclaim/OOM side, and properly
treat dying tasks instead of forcing them into potentially super slow
reclaim paths. Hopefully Johannes's patch fixes this.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-12 18:30 ` Johannes Weiner
2024-12-12 21:35 ` Shakeel Butt
@ 2024-12-13 0:32 ` Roman Gushchin
2024-12-13 4:42 ` Johannes Weiner
2024-12-16 15:39 ` Michal Hocko
2 siblings, 1 reply; 29+ messages in thread
From: Roman Gushchin @ 2024-12-13 0:32 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Michal Hocko,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > A task already in exit can get stuck trying to allocate pages, if its
> > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > not compressible.
> > >
> > > This seems like an unlikely confluence of events, but it can happen
> > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > >
> > > When this happens, it can sometimes take hours for tasks to exit,
> > > as they are all trying to squeeze things into zswap to bring the group's
> > > memory consumption below memory.max.
> > >
> > > Allowing these exiting programs to push some memory from their own
> > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > consumption below memory.max, and exit in seconds rather than hours.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> >
> > Thanks for sending a v2.
> >
> > I still think maybe this needs to be fixed on the memcg side, at least
> > by not making exiting tasks try really hard to reclaim memory to the
> > point where this becomes a problem. IIUC there could be other reasons
> > why reclaim may take too long, but maybe not as pathological as this
> > case to be fair. I will let the memcg maintainers chime in for this.
> >
> > If there's a fundamental reason why this cannot be fixed on the memcg
> > side, I don't object to this change.
> >
> > Nhat, any objections on your end? I think your fleet workloads were
> > the first users of this interface. Does this break their expectations?
>
> Yes, I don't think we can do this, unfortunately :( There can be a
> variety of reasons for why a user might want to prohibit disk swap for
> a certain cgroup, and we can't assume it's okay to make exceptions.
>
> There might also not *be* any disk swap to overflow into after Nhat's
> virtual swap patches. Presumably zram would still have the issue too.
>
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
>
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue Mar 5 15:46:47 2019 -0800
>
> memcg: killed threads should not invoke memcg OOM killer
>
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
>
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date: Fri Nov 5 13:38:09 2021 -0700
>
> memcg: prohibit unconditional exceeding the limit of dying tasks
>
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:
>
> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault. This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.
>
> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
>
> ret = task_is_dying() || out_of_memory(&oc);
>
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?
>
> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?
>
> I think we want something like this:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_margin(memcg) >= (1 << order))
> goto unlock;
>
> - /*
> - * A few threads which were not waiting at mutex_lock_killable() can
> - * fail to bail out. Therefore, check again after holding oom_lock.
> - */
> - ret = task_is_dying() || out_of_memory(&oc);
> + ret = out_of_memory(&oc);
I like the idea, but at first glance it might reintroduce the problem
fixed by 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer").
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-13 0:32 ` Roman Gushchin
@ 2024-12-13 4:42 ` Johannes Weiner
0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2024-12-13 4:42 UTC (permalink / raw)
To: Roman Gushchin
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Michal Hocko,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Fri, Dec 13, 2024 at 12:32:11AM +0000, Roman Gushchin wrote:
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if its
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the group's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > >
> > > Thanks for sending a v2.
> > >
> > > I still think maybe this needs to be fixed on the memcg side, at least
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > >
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > >
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations?
> >
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> >
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
> >
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> >
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue Mar 5 15:46:47 2019 -0800
> >
> > memcg: killed threads should not invoke memcg OOM killer
> >
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> >
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date: Fri Nov 5 13:38:09 2021 -0700
> >
> > memcg: prohibit unconditional exceeding the limit of dying tasks
> >
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> >
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> >
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> >
> > ret = task_is_dying() || out_of_memory(&oc);
> >
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
> >
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> >
> > I think we want something like this:
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53db98d2c4a1..be6b6e72bde5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > if (mem_cgroup_margin(memcg) >= (1 << order))
> > goto unlock;
> >
> > - /*
> > - * A few threads which were not waiting at mutex_lock_killable() can
> > - * fail to bail out. Therefore, check again after holding oom_lock.
> > - */
> > - ret = task_is_dying() || out_of_memory(&oc);
> > + ret = out_of_memory(&oc);
>
> I like the idea, but at first glance it might reintroduce the problem
> fixed by 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer").
The race and warning pointed out in the changelog might have been
sufficiently mitigated by this more recent commit:
commit 1378b37d03e8147c67fde60caf0474ea879163d8
Author: Yafang Shao <laoar.shao@gmail.com>
Date: Thu Aug 6 23:22:08 2020 -0700
memcg, oom: check memcg margin for parallel oom
If not, another possibility would be this:
ret = tsk_is_oom_victim(task) || out_of_memory(&oc);
But we should probably first restore reliable forward progress on
dying tasks, then worry about the spurious printk later.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-12 18:30 ` Johannes Weiner
2024-12-12 21:35 ` Shakeel Butt
2024-12-13 0:32 ` Roman Gushchin
@ 2024-12-16 15:39 ` Michal Hocko
2025-01-14 16:09 ` Johannes Weiner
2 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2024-12-16 15:39 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
[...]
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
>
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue Mar 5 15:46:47 2019 -0800
>
> memcg: killed threads should not invoke memcg OOM killer
>
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
>
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date: Fri Nov 5 13:38:09 2021 -0700
>
> memcg: prohibit unconditional exceeding the limit of dying tasks
>
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:
IIRC the reason going this route was a lack of per-memcg oom reserves.
Global oom victims are getting some slack because the amount of reserves
be bound. This is not the case for memcgs though.
> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault.
IIRC the task will handle the pending SIGKILL if the #PF fails. If the
charge happens from the exit path then we rely on ENOMEM returned from
gup as a signal to back off. Do we have any caller that keeps retrying
on ENOMEM?
> This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.
> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
>
> ret = task_is_dying() || out_of_memory(&oc);
>
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?
Good question! I suspect this early bail out is based on an assumption
that a dying task will free up the memory soon so oom killer is
unnecessary.
> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?
As mentioned above this seems really surprising and it would indicate
that something in the exit path would keep retrying when getting ENOMEM
from gup or GFP_ACCOUNT allocation. GFP_NOFAIL requests are allowed to
over-consume.
> I think we want something like this:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_margin(memcg) >= (1 << order))
> goto unlock;
>
> - /*
> - * A few threads which were not waiting at mutex_lock_killable() can
> - * fail to bail out. Therefore, check again after holding oom_lock.
> - */
> - ret = task_is_dying() || out_of_memory(&oc);
> + ret = out_of_memory(&oc);
I am not against this as it would allow to do an async oom_reaper memory
reclaim in the worst case. This could potentially reintroduce the "No
victim available" case described by 7775face2079 ("memcg: killed threads
should not invoke memcg OOM killer") but that seemed to be a very
specific and artificial usecase IIRC.
>
> unlock:
> mutex_unlock(&oom_lock);
> @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (unlikely(current->flags & PF_MEMALLOC))
> goto force;
>
> + if (unlikely(tsk_is_oom_victim(current)))
> + goto force;
> +
> if (unlikely(task_in_memcg_oom(current)))
> goto nomem;
This is more problematic as it doesn't cap a potential runaway and
eventual global OOM which is not really great. In the past this could be
possible through vmalloc which didn't bail out early for killed tasks.
That risk has been mitigated by dd544141b9eb ("vmalloc: back off when
the current task is OOM-killed"). I would like to keep some sort of
protection from those runaways. Whether that is a limited "reserve" for
oom victims that would be per memcg or do no let them consume above the
hard limit at all. Fundamentally a limited reserves doesn't solve the
underlying problem, it just make it less likely so the latter would be
preferred by me TBH.
Before we do that it would be really good to understand the source of
those retries. Maybe I am missing something really obvious but those
shouldn't really happen.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2024-12-16 15:39 ` Michal Hocko
@ 2025-01-14 16:09 ` Johannes Weiner
2025-01-14 16:46 ` Michal Hocko
0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2025-01-14 16:09 UTC (permalink / raw)
To: Michal Hocko
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
Hi,
On Mon, Dec 16, 2024 at 04:39:12PM +0100, Michal Hocko wrote:
> On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
> [...]
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> >
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue Mar 5 15:46:47 2019 -0800
> >
> > memcg: killed threads should not invoke memcg OOM killer
> >
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> >
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date: Fri Nov 5 13:38:09 2021 -0700
> >
> > memcg: prohibit unconditional exceeding the limit of dying tasks
> >
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
>
> IIRC the reason going this route was a lack of per-memcg oom reserves.
> Global oom victims are getting some slack because the amount of reserves
> be bound. This is not the case for memcgs though.
>
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault.
>
> IIRC the task will handle the pending SIGKILL if the #PF fails. If the
> charge happens from the exit path then we rely on ENOMEM returned from
> gup as a signal to back off. Do we have any caller that keeps retrying
> on ENOMEM?
We managed to extract a stack trace of the livelocked task:
obj_cgroup_may_swap
zswap_store
swap_writepage
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
charge_memcg
mem_cgroup_swapin_charge_folio
__read_swap_cache_async
swapin_readahead
do_swap_page
handle_mm_fault
do_user_addr_fault
exc_page_fault
asm_exc_page_fault
__get_user
futex_cleanup
fuxtex_exit_release
do_exit
do_group_exit
get_signal
arch_do_signal_or_restart
exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall
entry_SYSCALL_64
syscall
Both memory.max and memory.zswap.max are hit. I don't see how this
could ever make forward progress - the futex fault will retry until it
succeeds. The only workaround for this state right now is to manually
raise memory.max to let the fault succeed and the exit complete.
> > This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
>
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> >
> > ret = task_is_dying() || out_of_memory(&oc);
> >
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
>
> Good question! I suspect this early bail out is based on an assumption
> that a dying task will free up the memory soon so oom killer is
> unnecessary.
Correct. It's not about the kill. The important thing is that at least
one exiting task is getting the extra memory headroom usually afforded
to the OOM victim, to guarantee forward progress in the exit path.
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
>
> As mentioned above this seems really surprising and it would indicate
> that something in the exit path would keep retrying when getting ENOMEM
> from gup or GFP_ACCOUNT allocation. GFP_NOFAIL requests are allowed to
> over-consume.
I hope the path is clear from the stack trace above.
> > I think we want something like this:
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53db98d2c4a1..be6b6e72bde5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > if (mem_cgroup_margin(memcg) >= (1 << order))
> > goto unlock;
> >
> > - /*
> > - * A few threads which were not waiting at mutex_lock_killable() can
> > - * fail to bail out. Therefore, check again after holding oom_lock.
> > - */
> > - ret = task_is_dying() || out_of_memory(&oc);
> > + ret = out_of_memory(&oc);
>
> I am not against this as it would allow to do an async oom_reaper memory
> reclaim in the worst case. This could potentially reintroduce the "No
> victim available" case described by 7775face2079 ("memcg: killed threads
> should not invoke memcg OOM killer") but that seemed to be a very
> specific and artificial usecase IIRC.
+1
> > unlock:
> > mutex_unlock(&oom_lock);
> > @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > if (unlikely(current->flags & PF_MEMALLOC))
> > goto force;
> >
> > + if (unlikely(tsk_is_oom_victim(current)))
> > + goto force;
> > +
> > if (unlikely(task_in_memcg_oom(current)))
> > goto nomem;
>
> This is more problematic as it doesn't cap a potential runaway and
> eventual global OOM which is not really great. In the past this could be
> possible through vmalloc which didn't bail out early for killed tasks.
> That risk has been mitigated by dd544141b9eb ("vmalloc: back off when
> the current task is OOM-killed"). I would like to keep some sort of
> protection from those runaways. Whether that is a limited "reserve" for
> oom victims that would be per memcg or do no let them consume above the
> hard limit at all. Fundamentally a limited reserves doesn't solve the
> underlying problem, it just make it less likely so the latter would be
> preferred by me TBH.
Right. There is no way to limit an OOM victim without risking a memory
deadlock.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:09 ` Johannes Weiner
@ 2025-01-14 16:46 ` Michal Hocko
2025-01-14 16:51 ` Rik van Riel
2025-01-14 16:54 ` Michal Hocko
0 siblings, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 16:46 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> Hi,
>
> On Mon, Dec 16, 2024 at 04:39:12PM +0100, Michal Hocko wrote:
> > On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
[...]
> > > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > > will re-trigger OOM, which will find the existing OOM victim and do
> > > nothing, then restart the fault.
> >
> > IIRC the task will handle the pending SIGKILL if the #PF fails. If the
> > charge happens from the exit path then we rely on ENOMEM returned from
> > gup as a signal to back off. Do we have any caller that keeps retrying
> > on ENOMEM?
>
> We managed to extract a stack trace of the livelocked task:
>
> obj_cgroup_may_swap
> zswap_store
> swap_writepage
> shrink_folio_list
> shrink_lruvec
> shrink_node
> do_try_to_free_pages
> try_to_free_mem_cgroup_pages
OK, so this is the reclaim path and it fails due to reasons you mention
below. This will retry several times until it hits mem_cgroup_oom which
will bail in mem_cgroup_out_of_memory because of task_is_dying (returns
true) and retry the charge + reclaim (as the oom killer hasn't done
anything) with passed_oom = true this time and eventually got to nomem
path and returns ENOMEM. This should propaged -ENOMEM down the path
> charge_memcg
> mem_cgroup_swapin_charge_folio
> __read_swap_cache_async
> swapin_readahead
> do_swap_page
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> __get_user
All the way here and return the failure to futex_cleanup which doesn't
retry __get_user on the failure AFAICS (exit_robust_list). But I might
be missing something, it's been quite some time since I've looked into
futex code.
> futex_cleanup
> fuxtex_exit_release
> do_exit
> do_group_exit
> get_signal
> arch_do_signal_or_restart
> exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall
> entry_SYSCALL_64
> syscall
>
> Both memory.max and memory.zswap.max are hit. I don't see how this
> could ever make forward progress - the futex fault will retry until it
> succeeds.
I must be missing something but I do not see the retry, could you point
me where this is happening please?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:46 ` Michal Hocko
@ 2025-01-14 16:51 ` Rik van Riel
2025-01-14 17:00 ` Michal Hocko
2025-01-14 16:54 ` Michal Hocko
1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2025-01-14 16:51 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner
Cc: Yosry Ahmed, Balbir Singh, Roman Gushchin, hakeel Butt,
Muchun Song, Andrew Morton, cgroups, linux-mm, linux-kernel,
kernel-team, Nhat Pham
On Tue, 2025-01-14 at 17:46 +0100, Michal Hocko wrote:
> On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
>
> >
> > We managed to extract a stack trace of the livelocked task:
> >
> > obj_cgroup_may_swap
> > zswap_store
> > swap_writepage
> > shrink_folio_list
> > shrink_lruvec
> > shrink_node
> > do_try_to_free_pages
> > try_to_free_mem_cgroup_pages
>
> OK, so this is the reclaim path and it fails due to reasons you
> mention
> below. This will retry several times until it hits mem_cgroup_oom
> which
> will bail in mem_cgroup_out_of_memory because of task_is_dying
> (returns
> true) and retry the charge + reclaim (as the oom killer hasn't done
> anything) with passed_oom = true this time and eventually got to
> nomem
> path and returns ENOMEM. This should propaged -ENOMEM down the path
>
> > charge_memcg
> > mem_cgroup_swapin_charge_folio
> > __read_swap_cache_async
> > swapin_readahead
> > do_swap_page
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > __get_user
>
> All the way here and return the failure to futex_cleanup which
> doesn't
> retry __get_user on the failure AFAICS (exit_robust_list). But I
> might
> be missing something, it's been quite some time since I've looked
> into
> futex code.
Can you explain how -ENOMEM would get propagated down
past the page fault handler?
This isn't get_user_pages(), which can just pass
-ENOMEM on to the caller.
If there is code to pass -ENOMEM on past the page
fault exception handler, I have not been able to
find it. How does this work?
>
> > futex_cleanup
> > fuxtex_exit_release
> > do_exit
> > do_group_exit
> > get_signal
> > arch_do_signal_or_restart
> > exit_to_user_mode_prepare
> > syscall_exit_to_user_mode
> > do_syscall
> > entry_SYSCALL_64
> > syscall
> >
> > Both memory.max and memory.zswap.max are hit. I don't see how this
> > could ever make forward progress - the futex fault will retry until
> > it
> > succeeds.
>
> I must be missing something but I do not see the retry, could you
> point
> me where this is happening please?
>
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:51 ` Rik van Riel
@ 2025-01-14 17:00 ` Michal Hocko
2025-01-14 17:11 ` Rik van Riel
0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 17:00 UTC (permalink / raw)
To: Rik van Riel
Cc: Johannes Weiner, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 11:51:18, Rik van Riel wrote:
> On Tue, 2025-01-14 at 17:46 +0100, Michal Hocko wrote:
> > On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> >
> > >
> > > We managed to extract a stack trace of the livelocked task:
> > >
> > > obj_cgroup_may_swap
> > > zswap_store
> > > swap_writepage
> > > shrink_folio_list
> > > shrink_lruvec
> > > shrink_node
> > > do_try_to_free_pages
> > > try_to_free_mem_cgroup_pages
> >
> > OK, so this is the reclaim path and it fails due to reasons you
> > mention
> > below. This will retry several times until it hits mem_cgroup_oom
> > which
> > will bail in mem_cgroup_out_of_memory because of task_is_dying
> > (returns
> > true) and retry the charge + reclaim (as the oom killer hasn't done
> > anything) with passed_oom = true this time and eventually got to
> > nomem
> > path and returns ENOMEM. This should propaged -ENOMEM down the path
> >
> > > charge_memcg
> > > mem_cgroup_swapin_charge_folio
> > > __read_swap_cache_async
> > > swapin_readahead
> > > do_swap_page
> > > handle_mm_fault
> > > do_user_addr_fault
> > > exc_page_fault
> > > asm_exc_page_fault
> > > __get_user
> >
> > All the way here and return the failure to futex_cleanup which
> > doesn't
> > retry __get_user on the failure AFAICS (exit_robust_list). But I
> > might
> > be missing something, it's been quite some time since I've looked
> > into
> > futex code.
>
> Can you explain how -ENOMEM would get propagated down
> past the page fault handler?
>
> This isn't get_user_pages(), which can just pass
> -ENOMEM on to the caller.
>
> If there is code to pass -ENOMEM on past the page
> fault exception handler, I have not been able to
> find it. How does this work?
This might be me misunderstading get_user machinery but doesn't it
return a failure on PF handler returing ENOMEM?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 17:00 ` Michal Hocko
@ 2025-01-14 17:11 ` Rik van Riel
2025-01-14 18:13 ` Michal Hocko
0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2025-01-14 17:11 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue, 2025-01-14 at 18:00 +0100, Michal Hocko wrote:
> On Tue 14-01-25 11:51:18, Rik van Riel wrote:
> > On Tue, 2025-01-14 at 17:46 +0100, Michal Hocko wrote:
> > > On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> > >
> > > > charge_memcg
> > > > mem_cgroup_swapin_charge_folio
> > > > __read_swap_cache_async
> > > > swapin_readahead
> > > > do_swap_page
> > > > handle_mm_fault
> > > > do_user_addr_fault
> > > > exc_page_fault
> > > > asm_exc_page_fault
> > > > __get_user
> > >
> > > All the way here and return the failure to futex_cleanup which
> > > doesn't
> > > retry __get_user on the failure AFAICS (exit_robust_list). But I
> > > might
> > > be missing something, it's been quite some time since I've looked
> > > into
> > > futex code.
> >
> > Can you explain how -ENOMEM would get propagated down
> > past the page fault handler?
> >
> > This isn't get_user_pages(), which can just pass
> > -ENOMEM on to the caller.
> >
> > If there is code to pass -ENOMEM on past the page
> > fault exception handler, I have not been able to
> > find it. How does this work?
>
> This might be me misunderstading get_user machinery but doesn't it
> return a failure on PF handler returing ENOMEM?
I believe __get_user simply does a memcpy, and ends
up in the page fault handler.
It does not access userspace explicitly like we do
with functions like get_user_pages.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 17:11 ` Rik van Riel
@ 2025-01-14 18:13 ` Michal Hocko
2025-01-14 19:23 ` Johannes Weiner
0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 18:13 UTC (permalink / raw)
To: Rik van Riel
Cc: Johannes Weiner, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 12:11:54, Rik van Riel wrote:
> On Tue, 2025-01-14 at 18:00 +0100, Michal Hocko wrote:
> > On Tue 14-01-25 11:51:18, Rik van Riel wrote:
> > > On Tue, 2025-01-14 at 17:46 +0100, Michal Hocko wrote:
> > > > On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> > > >
> > > > > charge_memcg
> > > > > mem_cgroup_swapin_charge_folio
> > > > > __read_swap_cache_async
> > > > > swapin_readahead
> > > > > do_swap_page
> > > > > handle_mm_fault
> > > > > do_user_addr_fault
> > > > > exc_page_fault
> > > > > asm_exc_page_fault
> > > > > __get_user
> > > >
> > > > All the way here and return the failure to futex_cleanup which
> > > > doesn't
> > > > retry __get_user on the failure AFAICS (exit_robust_list). But I
> > > > might
> > > > be missing something, it's been quite some time since I've looked
> > > > into
> > > > futex code.
> > >
> > > Can you explain how -ENOMEM would get propagated down
> > > past the page fault handler?
> > >
> > > This isn't get_user_pages(), which can just pass
> > > -ENOMEM on to the caller.
> > >
> > > If there is code to pass -ENOMEM on past the page
> > > fault exception handler, I have not been able to
> > > find it. How does this work?
> >
> > This might be me misunderstading get_user machinery but doesn't it
> > return a failure on PF handler returing ENOMEM?
>
> I believe __get_user simply does a memcpy, and ends
> up in the page fault handler.
It's been ages since I've looked into that code and my memory might be
very rusty. But IIRC the page fault would be handled through exception
table and return EFAULT on the failure. But I am not really sure whether
that is the case for all errors returned by the page fault handler or
only for SEGV/SIGBUS. I need to refresh my memory on that.
Anyway, have you tried to reproduce with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..9c30c442e3b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1627,7 +1627,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = task_is_dying() || out_of_memory(&oc);
+ ret = out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
proposed by Johannes earlier? This should help to trigger the oom reaper
to free up some memory.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 18:13 ` Michal Hocko
@ 2025-01-14 19:23 ` Johannes Weiner
2025-01-14 19:42 ` Michal Hocko
0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2025-01-14 19:23 UTC (permalink / raw)
To: Michal Hocko
Cc: Rik van Riel, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue, Jan 14, 2025 at 07:13:07PM +0100, Michal Hocko wrote:
> Anyway, have you tried to reproduce with
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..9c30c442e3b0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1627,7 +1627,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = task_is_dying() || out_of_memory(&oc);
> + ret = out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
>
> proposed by Johannes earlier? This should help to trigger the oom reaper
> to free up some memory.
Yes, I was wondering about that too.
If the OOM reaper can be our reliable way of forward progress, we
don't need any reserve or headroom beyond memory.max.
IIRC it can fail if somebody is holding mmap_sem for writing. The exit
path at some point takes that, but also around the time it frees up
all its memory voluntarily, so that should be fine. Are you aware of
other scenarios where it can fail?
What if everything has been swapped out already and there is nothing
to reap? IOW, only unreclaimable/kernel memory remaining in the group.
It still seems to me that allowing the OOM victim (and only the OOM
victim) to bypass memory.max is the only guarantee to progress.
I'm not really concerned about side effects. Any runaway allocation in
the exit path (like the vmalloc one you referenced before) is a much
bigger concern for exceeding the physical OOM reserves in the page
allocator. What's a containment failure for cgroups would be a memory
deadlock at the system level. It's a class of kernel bug that needs
fixing, not something we can really work around in the cgroup code.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 19:23 ` Johannes Weiner
@ 2025-01-14 19:42 ` Michal Hocko
2025-01-15 17:35 ` Rik van Riel
0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 19:42 UTC (permalink / raw)
To: Johannes Weiner
Cc: Rik van Riel, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 14:23:22, Johannes Weiner wrote:
> On Tue, Jan 14, 2025 at 07:13:07PM +0100, Michal Hocko wrote:
> > Anyway, have you tried to reproduce with
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7b3503d12aaf..9c30c442e3b0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1627,7 +1627,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > * A few threads which were not waiting at mutex_lock_killable() can
> > * fail to bail out. Therefore, check again after holding oom_lock.
> > */
> > - ret = task_is_dying() || out_of_memory(&oc);
> > + ret = out_of_memory(&oc);
> >
> > unlock:
> > mutex_unlock(&oom_lock);
> >
> > proposed by Johannes earlier? This should help to trigger the oom reaper
> > to free up some memory.
>
> Yes, I was wondering about that too.
>
> If the OOM reaper can be our reliable way of forward progress, we
> don't need any reserve or headroom beyond memory.max.
>
> IIRC it can fail if somebody is holding mmap_sem for writing. The exit
> path at some point takes that, but also around the time it frees up
> all its memory voluntarily, so that should be fine. Are you aware of
> other scenarios where it can fail?
Setting MMF_OOM_SKIP is the final moment when oom reaper can act. This
is after exit_mm_release which releases futex. Also get_user callers
shouldn't be holding exclusive mmap_lock as that would deadlock when PF
path takes the read lock, right?
> What if everything has been swapped out already and there is nothing
> to reap? IOW, only unreclaimable/kernel memory remaining in the group.
Yes, this is possible. It is also possible the the oom victim depletes
oom reserves globally and fail the allocation resulting in the same
problem. Reserves do buy some time but do not solve the underlying
issue.
> It still seems to me that allowing the OOM victim (and only the OOM
> victim) to bypass memory.max is the only guarantee to progress.
>
> I'm not really concerned about side effects. Any runaway allocation in
> the exit path (like the vmalloc one you referenced before) is a much
> bigger concern for exceeding the physical OOM reserves in the page
> allocator. What's a containment failure for cgroups would be a memory
> deadlock at the system level. It's a class of kernel bug that needs
> fixing, not something we can really work around in the cgroup code.
I do agreee that a memory deadlock is not really proper way to deal with
the issue. I have to admit that my understanding was based on ENOMEM
being properly propagated out of in kernel user page faults. It seems I
was wrong about that. On the other hand wouldn't that be a proper way to
deal with the issue? Relying on allocations never failing is quite
fragile.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 19:42 ` Michal Hocko
@ 2025-01-15 17:35 ` Rik van Riel
2025-01-15 19:41 ` Michal Hocko
0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2025-01-15 17:35 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner
Cc: Yosry Ahmed, Balbir Singh, Roman Gushchin, hakeel Butt,
Muchun Song, Andrew Morton, cgroups, linux-mm, linux-kernel,
kernel-team, Nhat Pham
On Tue, 2025-01-14 at 20:42 +0100, Michal Hocko wrote:
> O
> I do agreee that a memory deadlock is not really proper way to deal
> with
> the issue. I have to admit that my understanding was based on ENOMEM
> being properly propagated out of in kernel user page faults.
It looks like it kind of is.
In case of VM_FAULT_OOM, the page fault code calls
kernelmode_fixup_or_oops(), which a few functions
down calls ex_handler_default(), which advances
regs->ip to the next instruction after the one
that faulted.
Of course, if we have a copy_from_user loop, we
could end up there a bunch of times :)
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-15 17:35 ` Rik van Riel
@ 2025-01-15 19:41 ` Michal Hocko
0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2025-01-15 19:41 UTC (permalink / raw)
To: Rik van Riel
Cc: Johannes Weiner, Yosry Ahmed, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Wed 15-01-25 12:35:37, Rik van Riel wrote:
> On Tue, 2025-01-14 at 20:42 +0100, Michal Hocko wrote:
> > O
> > I do agreee that a memory deadlock is not really proper way to deal
> > with
> > the issue. I have to admit that my understanding was based on ENOMEM
> > being properly propagated out of in kernel user page faults.
>
> It looks like it kind of is.
>
> In case of VM_FAULT_OOM, the page fault code calls
> kernelmode_fixup_or_oops(), which a few functions
> down calls ex_handler_default(), which advances
> regs->ip to the next instruction after the one
> that faulted.
OK, so we do not have the endless loop. Good. Sorry I didn't get to read
through the fixup tables maze. Thanks for confirming.
> Of course, if we have a copy_from_user loop, we
> could end up there a bunch of times :)
Yes, the robust list might have many elements and if each and every is
swapped out then this can take a lot of time if the reclaim path is
desperately retrying the whole reclaim. All that being said, does the
change (partial revert) suggested by Johannes
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..9c30c442e3b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1627,7 +1627,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = task_is_dying() || out_of_memory(&oc);
+ ret = out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
Or is the exit still taking unbearably too long? If yes maybe we can
help to ENOMEM already killed and oom reaped tasks earlier?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:46 ` Michal Hocko
2025-01-14 16:51 ` Rik van Riel
@ 2025-01-14 16:54 ` Michal Hocko
2025-01-14 16:56 ` Rik van Riel
2025-01-14 16:56 ` Michal Hocko
1 sibling, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 16:54 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 17:46:39, Michal Hocko wrote:
> On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> > Hi,
> >
> > On Mon, Dec 16, 2024 at 04:39:12PM +0100, Michal Hocko wrote:
> > > On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
> [...]
> > > > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > > > will re-trigger OOM, which will find the existing OOM victim and do
> > > > nothing, then restart the fault.
> > >
> > > IIRC the task will handle the pending SIGKILL if the #PF fails. If the
> > > charge happens from the exit path then we rely on ENOMEM returned from
> > > gup as a signal to back off. Do we have any caller that keeps retrying
> > > on ENOMEM?
> >
> > We managed to extract a stack trace of the livelocked task:
> >
> > obj_cgroup_may_swap
> > zswap_store
> > swap_writepage
> > shrink_folio_list
> > shrink_lruvec
> > shrink_node
> > do_try_to_free_pages
> > try_to_free_mem_cgroup_pages
>
> OK, so this is the reclaim path and it fails due to reasons you mention
> below. This will retry several times until it hits mem_cgroup_oom which
> will bail in mem_cgroup_out_of_memory because of task_is_dying (returns
> true) and retry the charge + reclaim (as the oom killer hasn't done
> anything) with passed_oom = true this time and eventually got to nomem
> path and returns ENOMEM. SUSE Labs
Btw. is there any actual reason why we cannot go nomem without going
to the oom killer (just to bail out) and go through the whole cycle
again? That seems arbitrary and simply burning a lot of cycle without
much chances to make any better outcome
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..eb45eaf0acfc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2268,8 +2268,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- /* Avoid endless loop for tasks bypassed by the oom killer */
- if (passed_oom && task_is_dying())
+ if (task_is_dying())
goto nomem;
/*
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:54 ` Michal Hocko
@ 2025-01-14 16:56 ` Rik van Riel
2025-01-14 16:56 ` Michal Hocko
1 sibling, 0 replies; 29+ messages in thread
From: Rik van Riel @ 2025-01-14 16:56 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner
Cc: Yosry Ahmed, Balbir Singh, Roman Gushchin, hakeel Butt,
Muchun Song, Andrew Morton, cgroups, linux-mm, linux-kernel,
kernel-team, Nhat Pham
On Tue, 2025-01-14 at 17:54 +0100, Michal Hocko wrote:
> O
> Btw. is there any actual reason why we cannot go nomem without going
> to the oom killer (just to bail out) and go through the whole cycle
> again? That seems arbitrary and simply burning a lot of cycle without
> much chances to make any better outcome
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..eb45eaf0acfc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2268,8 +2268,7 @@ int try_charge_memcg(struct mem_cgroup *memcg,
> gfp_t gfp_mask,
> if (gfp_mask & __GFP_RETRY_MAYFAIL)
> goto nomem;
>
> - /* Avoid endless loop for tasks bypassed by the oom killer
> */
> - if (passed_oom && task_is_dying())
> + if (task_is_dying())
> goto nomem;
>
> /*
When we return from the page fault handler, we
restart the instruction that faulted.
That means we could just end up repeating the
same fault over and over again.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
2025-01-14 16:54 ` Michal Hocko
2025-01-14 16:56 ` Rik van Riel
@ 2025-01-14 16:56 ` Michal Hocko
1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2025-01-14 16:56 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Rik van Riel, Balbir Singh, Roman Gushchin,
hakeel Butt, Muchun Song, Andrew Morton, cgroups, linux-mm,
linux-kernel, kernel-team, Nhat Pham
On Tue 14-01-25 17:54:17, Michal Hocko wrote:
> On Tue 14-01-25 17:46:39, Michal Hocko wrote:
> > On Tue 14-01-25 11:09:55, Johannes Weiner wrote:
> > > Hi,
> > >
> > > On Mon, Dec 16, 2024 at 04:39:12PM +0100, Michal Hocko wrote:
> > > > On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
> > [...]
> > > > > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > > > > will re-trigger OOM, which will find the existing OOM victim and do
> > > > > nothing, then restart the fault.
> > > >
> > > > IIRC the task will handle the pending SIGKILL if the #PF fails. If the
> > > > charge happens from the exit path then we rely on ENOMEM returned from
> > > > gup as a signal to back off. Do we have any caller that keeps retrying
> > > > on ENOMEM?
> > >
> > > We managed to extract a stack trace of the livelocked task:
> > >
> > > obj_cgroup_may_swap
> > > zswap_store
> > > swap_writepage
> > > shrink_folio_list
> > > shrink_lruvec
> > > shrink_node
> > > do_try_to_free_pages
> > > try_to_free_mem_cgroup_pages
> >
> > OK, so this is the reclaim path and it fails due to reasons you mention
> > below. This will retry several times until it hits mem_cgroup_oom which
> > will bail in mem_cgroup_out_of_memory because of task_is_dying (returns
> > true) and retry the charge + reclaim (as the oom killer hasn't done
> > anything) with passed_oom = true this time and eventually got to nomem
> > path and returns ENOMEM. SUSE Labs
>
> Btw. is there any actual reason why we cannot go nomem without going
> to the oom killer (just to bail out) and go through the whole cycle
> again? That seems arbitrary and simply burning a lot of cycle without
> much chances to make any better outcome
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..eb45eaf0acfc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2268,8 +2268,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (gfp_mask & __GFP_RETRY_MAYFAIL)
> goto nomem;
>
> - /* Avoid endless loop for tasks bypassed by the oom killer */
> - if (passed_oom && task_is_dying())
> + if (task_is_dying())
> goto nomem;
>
> /*
Just to clarify, only if we have strong reasons to keep bail out in the
oom killer path. If we go with the change proposed in the other email,
this doesn't make sense.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread