From: Yosry Ahmed <yosryahmed@google.com>
To: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, kernel-team@meta.com,
Nhat Pham <nphamcs@gmail.com>
Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap
Date: Wed, 11 Dec 2024 09:00:41 -0800 [thread overview]
Message-ID: <CAJD7tkYpk4kZChj9f-2EMp0XET6OUNbHqfVBgdFTEMnN+iomww@mail.gmail.com> (raw)
In-Reply-To: <768a404c6f951e09c4bfc93c84ee1553aa139068.camel@surriel.com>
On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com>
> > wrote:
> > >
> > > +++ b/mm/memcontrol.c
> > > @@ -5371,6 +5371,15 @@ bool
> > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > > if (!zswap_is_enabled())
> > > return true;
> > >
> > > + /*
> > > + * Always allow exiting tasks to push data to swap. A
> > > process in
> > > + * the middle of exit cannot get OOM killed, but may need
> > > to push
> > > + * uncompressible data to swap in order to get the cgroup
> > > memory
> > > + * use below the limit, and make progress with the exit.
> > > + */
> > > + if ((current->flags & PF_EXITING) && memcg ==
> > > mem_cgroup_from_task(current))
> > > + return true;
> > > +
> >
> > I have a few questions:
> > (a) If the task is being OOM killed it should be able to charge
> > memory
> > beyond memory.max, so why do we need to get the usage down below the
> > limit?
> >
> If it is a kernel directed memcg OOM kill, that is
> true.
>
> However, if the exit comes from somewhere else,
> like a userspace oomd kill, we might not hit that
> code path.
Why do we treat dying tasks differently based on the source of the kill?
>
> > Looking at the other thread with Michal, it looks like it's because
> > we
> > have to go into reclaim first before we get to the point of force
> > charging for dying tasks, and we spend too much time in reclaim. Is
> > that correct?
> >
> > If that's the case, I am wondering if the real problem is that we
> > check mem_cgroup_zswap_writeback_enabled() too late in the process.
> > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap
> > entries, only to realize it cannot swap in swap_writepage().
> >
> > Should we check for this in can_reclaim_anon_pages()? If zswap
> > writeback is disabled and we are already at the memcg limit (or zswap
> > limit for that matter), we should avoid scanning anon memory to begin
> > with. The problem is that if we race with memory being freed we may
> > have some extra OOM kills, but I am not sure how common this case
> > would be.
>
> However, we don't know until the attempted zswap write
> whether the memory is compressible, and whether doing
> a bunch of zswap writes will help us bring our memcg
> down below its memory.max limit.
If we are at memory.max (or memory.zswap.max), we can't compress pages
into zswap anyway, regardless of their compressibility.
So what I am saying is, if we are already at the limit (pages cannot
go into zswap), and writeback is disabled (pages cannot go into
swapfiles), then we should probably avoid scanning the anon LRUs and
spending all those wasted cycles trying to isolate, unmap, and reclaim
them only to fail at the last step.
>
> >
> > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in
> > case we are reclaiming from an ancestor and we hit the limit of that
> > ancestor?
> >
> I don't know if we need or want to reclaim from any
> other memcgs than those of the exiting process itself.
>
> A small blast radius seems like it could be desirable,
> but I'm open to other ideas :)
The exiting process is part of all the ancestor cgroups by the hierarchy.
If we have the following hierarchy:
root
|
A
|
B
Then a process in cgroup B could be getting OOM killed due to hitting
the limit of A, not B. In which case, reclaiming from A helps us get
below the limit. We can check if the cgroup is an ancestor and it hit
its limit, but maybe that's an overkill.
>
> > (c) mem_cgroup_from_task() should be called in an RCU read section
> > (or
> > we need something like rcu_access_point() if we are not dereferencing
> > the pointer).
> >
> I'll add this in v2.
>
> --
> All Rights Reversed.
next prev parent reply other threads:[~2024-12-11 17:01 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-11 15:53 Rik van Riel
2024-12-11 16:26 ` Yosry Ahmed
2024-12-11 16:34 ` Rik van Riel
2024-12-11 17:00 ` Yosry Ahmed [this message]
2024-12-11 17:19 ` Rik van Riel
2024-12-11 17:30 ` Yosry Ahmed
2024-12-11 17:49 ` Rik van Riel
2024-12-11 23:15 ` Balbir Singh
2024-12-12 1:21 ` Rik van Riel
2024-12-12 3:25 ` Balbir Singh
2024-12-12 14:03 ` Rik van Riel
2024-12-12 23:39 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkYpk4kZChj9f-2EMp0XET6OUNbHqfVBgdFTEMnN+iomww@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox