From: Thomas Gleixner <tglx@linutronix.de>
To: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Nico Pache <npache@redhat.com>,
linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
Joel Savitz <jsavitz@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, Rafael Aquini <aquini@redhat.com>,
Waiman Long <longman@redhat.com>, Baoquan He <bhe@redhat.com>,
Christoph von Recklinghausen <crecklin@redhat.com>,
Don Dutile <ddutile@redhat.com>,
"Herton R . Krzesinski" <herton@redhat.com>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Darren Hart <dvhart@infradead.org>,
Andre Almeida <andrealmeid@collabora.com>,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper
Date: Wed, 23 Mar 2022 11:30:49 +0100 [thread overview]
Message-ID: <87ils59d0m.ffs@tglx> (raw)
In-Reply-To: <YjrlqAMyJg3GKZVs@dhcp22.suse.cz>
Michal!
On Wed, Mar 23 2022 at 10:17, Michal Hocko wrote:
> Let me skip over futex part which I need to digest and only focus on the
> oom side of the things for clarification.
The most important thing to know about futexes is: They are cursed.
> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
> [...]
>> > While some places can be handled by changing uninterruptible waiting to
>> > killable there are places which are not really fixable, e.g. lock
>> > chain dependency which leads to memory allocation.
>>
>> I'm not following. Which lock chain dependency causes memory allocation?
>
> Consider an oom victim is blocked on a lock or waiting for an event to
> happen but the lock holder is stuck allocating or the wake up depends on
> an allocation. Many sleeping locks are doing GFP_KERNEL allocations.
Fair enough.
>> That will prevent the enforced race in most cases and allow the exiting
>> and/or killed processes to cleanup themself. Not pretty, but it should
>> reduce the chance of the reaper to win the race with the exiting and/or
>> killed process significantly.
>>
>> It's not going to work when the problem is combined with a heavy VM
>> overload situation which keeps a guest (or one/some it's vCPUs) away
>> from being scheduled. See below for a discussion of guarantees.
>>
>> If it failed to do so when the sleep returns, then you still can reap
>> it.
>
> Yes, this is certainly an option. Please note that the oom_reaper is not
> the only way to trigger this. process_mrelease syscall performs the same
> operation from the userspace. Arguably process_mrelease could be used
> sanely/correctly because the userspace oom killer can do pro-cleanup
> steps before going to final SIGKILL & process_mrelease. One way would be
> to send SIGTERM in the first step and allow the victim to perform its
> cleanup.
A potential staged approach would be:
Send SIGTERM
wait some time
Send SIGKILL
wait some time
sys_process_mrelease()
Needs proper documentation though.
>> That said, the robust list is no guarantee. It's a best effort approach
>> which works well most of the time at least for the "normal" issues where
>> a task holding a futex dies unexpectedly. But there is no guarantee that
>> it works under all circumstances, e.g. OOM.
>
> OK, so this is an important note. I am all fine by documenting this
> restriction. It is not like oom victims couldn't cause other disruptions
> by leaving inconsistent/stale state behind.
Correct. Futexes are a small part of the overall damage.
>> Wrong. The whole point of robust lists is to handle the "normal" case
>> gracefully. A process being OOM killed is _NOT_ in the "normal"
>> category.
>>
>> Neither is it "normal" that a VM is scheduled out long enough to miss a
>> 1 second deadline. That might be considered normal by cloud folks, but
>> that's absolute not normal from an OS POV. Again, that's not a OS
>> problem, that's an operator/admin problem.
>
> Thanks for this clarification. I would tend to agree. Following a
> previous example that oom victims can leave inconsistent state behind
> which can influence other processes. I am wondering what kind of
> expectations about the lock protected state can we make when the holder
> of the lock has been interrupted at any random place in the critical
> section.
Right. If a futex is released by the exit cleanup, it is marked with
FUTEX_OWNER_DIED. User space is supposed to handle this.
pthread_mutex_lock() returns EOWNERDEAD to the caller if the owner died
bit is set. It's the callers responsibility to deal with potentially
corrupted or inconsistent state.
>> So let me summarize what I think needs to be done in priority order:
>>
>> #1 Delay the oom reaper so the normal case of a process being able to
>> exit is not subject to a pointless race to death.
>>
>> #2 If #1 does not result in the desired outcome, reap the mm (which is
>> what we have now).
>>
>> #3 If it's expected that #2 will allow the stuck process to make
>> progress on the way towards cleanup, then do not reap any VMA
>> containing a robust list head of a thread which belongs to the
>> exiting and/or killed process.
>>
>> The remaining cases, i.e. the lock chain example I pointed out above or
>> the stuck forever task are going to be rare and fall under the
>> collateral damage and no guarantee rule.
>
> I do agree that delaying oom_reaper wake up is the simplest thing to do
> at this stage and it could catch up most of the failures. We still have
> process_mrelease syscall case but I guess we can document this as a
> caveat into the man page.
Yes. The user space oom killer should better know what it is doing :)
Thanks,
tglx
next prev parent reply other threads:[~2022-03-23 10:30 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-18 3:36 Nico Pache
2022-03-21 8:55 ` Michal Hocko
2022-03-21 22:45 ` Nico Pache
[not found] ` <20220322004231.rwmnbjpq4ms6fnbi@offworld>
2022-03-22 1:53 ` Nico Pache
[not found] ` <20220322025724.j3japdo5qocwgchz@offworld>
2022-03-22 3:09 ` Nico Pache
2022-03-22 8:26 ` Michal Hocko
2022-03-22 15:17 ` Thomas Gleixner
2022-03-22 16:36 ` Michal Hocko
2022-03-22 22:43 ` Thomas Gleixner
2022-03-23 9:17 ` Michal Hocko
2022-03-23 10:30 ` Thomas Gleixner [this message]
2022-03-23 11:11 ` Peter Zijlstra
2022-03-30 9:18 ` Michal Hocko
2022-03-30 18:18 ` Nico Pache
2022-03-30 21:36 ` Nico Pache
2022-04-06 17:22 ` Nico Pache
2022-04-06 17:36 ` Nico Pache
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ils59d0m.ffs@tglx \
--to=tglx@linutronix.de \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andrealmeid@collabora.com \
--cc=aquini@redhat.com \
--cc=bhe@redhat.com \
--cc=crecklin@redhat.com \
--cc=dave@stgolabs.net \
--cc=ddutile@redhat.com \
--cc=dvhart@infradead.org \
--cc=herton@redhat.com \
--cc=jsavitz@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=npache@redhat.com \
--cc=peterz@infradead.org \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox