Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Haifeng Xu <haifeng.xu@shopee.com>
Cc: hannes@cmpxchg.org, roman.gushchin@linux.dev,
	shakeelb@google.com, cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
Date: Wed, 27 Sep 2023 15:36:56 +0200	[thread overview]
Message-ID: <ZRQv+E1plKLj8Xe3@dhcp22.suse.cz> (raw)
In-Reply-To: <fe80b246-3f92-2a83-6e50-3b923edce27c@shopee.com>

On Tue 26-09-23 22:39:11, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 20:37, Michal Hocko wrote:
> > On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
> >>
> >>
> >> On 2023/9/25 19:38, Michal Hocko wrote:
> >>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> >>>>
> >>>>
> >>>> On 2023/9/25 15:57, Michal Hocko wrote:
> >>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >>>>>> When application in userland receives oom notification from kernel
> >>>>>> and reads the oom_control file, it's confusing that under_oom is 0
> >>>>>> though the omm killer hasn't finished. The reason is that under_oom
> >>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >>>>>> action that unmark under_oom after completing oom handling. Therefore,
> >>>>>> the value of under_oom won't mislead users.
> >>>>>
> >>>>> I do not really remember why are we doing it this way but trying to track
> >>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> >>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
> >>>>> established behavior for 10 years now. Do we really need to change it
> >>>>> now? The interface is legacy and hopefully no new workloads are
> >>>>> emerging.
> >>>>>
> >>>>> I agree that the placement is surprising but I would rather not change
> >>>>> that unless there is a very good reason for that. Do you have any actual
> >>>>> workload which depends on the ordering? And if yes, how do you deal with
> >>>>> timing when the consumer of the notification just gets woken up after
> >>>>> mem_cgroup_out_of_memory completes?
> >>>>
> >>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> >>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
> >>>
> >>> OK, I do understand what you mean and I could have made myself
> >>> more clear previously. Even if the state is cleared _after_
> >>> mem_cgroup_out_of_memory then you won't get what you need I am
> >>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
> >>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> >>> any memory on its own. It relies on the task to wake up and die or
> >>> oom_reaper to do the work on its behalf. All of that is time dependent.
> >>> under_oom would have to be reimplemented to be cleared when a memory is
> >>> unchanrged to meet your demands. Something that has never really been
> >>> the semantic.
> >>>
> >>
> >> yes, but at least before we create the new process, it has more chance to get some memory freed.
> > 
> > The time window we are talking about is the call of
> > mem_cgroup_out_of_memory which, depending on the number of evaluated
> > processes, could be a very short time. So what kind of practical
> > difference does this have on your workload? Is this measurable in any
> > way.
> 
> The oom events in this group seems less than before.

Let me see if I follow. You are launching new workloads after oom
happens as soon as under_oom becomes 0. With the patch applied you see
fewer oom invocations which imlies that fewer re-launchings hit the
stil-under-oom situations? I would also expect that those are compared
over the same time period. Do you have any actual numbers to present?
Are they statistically representative?

I really have to say that I am skeptical over the presented usecase.
Optimizing over oom events seems just like a very wrong way to scale the
workload. Timing of oom handling is a subject to change at any time and
what you are optimizing for might change.

That being said, I do not see any obvious problem with the patch. IMO we
should rather not apply it because it is slighly changing a long term
behavior for something that is in a legacy mode now. But I will not Nack
it either as it is just a trivial thing. I just do not like an idea we
would be changing the timing of under_oom clearing just to fine tune
some workloads.
 
> >>> Btw. is this something new that you are developing on top of v1? And if
> >>> yes, why don't you use v2?
> >>>
> >>
> >> yes, v2 doesn't have the "cgroup.event_control" file.
> > 
> > Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
> > far from ideal as v1 is deprecated and mostly frozen. Why do you need to
> > rely on the oom notifications (or oom behavior in general) in the first
> > place? Could you share more about your workload and your requirements?
> > 
> 
> for example, we want to run processes in the group but those parametes related to 
> memory allocation is hard to decide, so use the notifications to inform us that we
> need to adjust the paramters automatically and we don't need to create the new processes
> manually.

I do understand that but OOM is just way too late to tune anything
upon. Cgroup v2 has a notion of high limit which can throttle memory
allocations way before the hard limit is set and this along with PSI
metrics could give you a much better insight on the memory pressure
in a memcg.

-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2023-09-27 13:37 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-22  7:05 Haifeng Xu
2023-09-22 23:17 ` Roman Gushchin
2023-09-23  8:05   ` Haifeng Xu
2023-09-25  7:57 ` Michal Hocko
2023-09-25  9:03   ` Haifeng Xu
2023-09-25 11:38     ` Michal Hocko
2023-09-25 12:28       ` Haifeng Xu
2023-09-25 12:37         ` Michal Hocko
2023-09-26 14:39           ` Haifeng Xu
2023-09-27 13:36             ` Michal Hocko [this message]
2023-09-28  3:03               ` Haifeng Xu
2023-10-03  7:50                 ` Michal Hocko
2023-10-11  1:59                   ` Haifeng Xu
2023-10-25 21:48                     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRQv+E1plKLj8Xe3@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=cgroups@vger.kernel.org \
    --cc=haifeng.xu@shopee.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox