linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nico Pache <npache@redhat.com>
To: Waiman Long <longman@redhat.com>, Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, jsavitz@redhat.com,
	peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com,
	dvhart@infradead.org, dave@stgolabs.net,
	andrealmeid@collabora.com
Subject: Re: [PATCH v3] mm/oom: do not oom reap task with an unresolved robust futex
Date: Mon, 17 Jan 2022 17:56:28 -0500	[thread overview]
Message-ID: <43a6c470-9fc2-6195-9a25-5321d17540e5@redhat.com> (raw)
In-Reply-To: <ad639326-bea8-9bfb-23e3-4e2b216d9645@redhat.com>



On 1/17/22 11:05, Waiman Long wrote:
> On 1/17/22 03:52, Michal Hocko wrote:
>> On Fri 14-01-22 13:01:35, Nico Pache wrote:
>>> In the case that two or more processes share a futex located within
>>> a shared mmaped region, such as a process that shares a lock between
>>> itself and child processes, we have observed that when a process holding
>>> the lock is oom killed, at least one waiter is never alerted to this new
>>> development and simply continues to wait.
>>>
>>> This is visible via pthreads by checking the __owner field of the
>>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>>
>>> We identify reproduction of this issue by checking a waiting process of
>>> a test program and viewing the contents of the pthread_mutex_t, taking note
>>> of the value in the owner field, and then checking dmesg to see if the
>>> owner has already been killed.
>> I believe we really need to find out why the original holder of the
>> futex is not woken up to release the lock when exiting.
> 
> For a robust futex lock holder or waiter that is to be killed, it is not the
> responsibility of the task itself to wake up and release the lock. It is the
> kernel that recognizes that the task is holding or waiting for the robust futex
> and clean thing up.
> 
> 
>>> As mentioned by Michal in his patchset introducing the oom reaper,
>>> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
>>> oom reaper is to try and free memory more quickly; however, In the case
>>> that a robust futex is being used, we want to avoid utilizing the
>>> concurrent oom reaper. This is due to a race that can occur between the
>>> SIGKILL handling the robust futex, and the oom reaper freeing the memory
>>> needed to maintain the robust list.
>> OOM reaper is only unmapping private memory. It doesn't touch a shared
>> mappings. So how could it interfere?
>>
> The futex itself may be in shared memory, however the robust list entry can be
> in private memory. So when the robust list is being scanned in this case, we can
> be in a use-after-free situation.

I believe this is true.  The userspace allocation for the pthread occurs as a
private mapping:
https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L368

>>> In the case that the oom victim is utilizing a robust futex, and the
>>> SIGKILL has not yet handled the futex death, the tsk->robust_list should
>>> be non-NULL. This issue can be tricky to reproduce, but with the
>>> modifications of this patch, we have found it to be impossible to
>>> reproduce.
>> We really need a deeper analysis of the udnerlying problem because right
>> now I do not really see why the oom reaper should interfere with shared
>> futex.
> As I said above, the robust list processing can involve private memory.
>>
>>> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
>>> early and prevent waking the oom reaper.
>>>
>>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>>
>>> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>   mm/oom_kill.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>           return;
>>>   +#ifdef CONFIG_FUTEX
>>> +    /*
>>> +     * If the ooming task's SIGKILL has not finished handling the
>>> +     * robust futex it is not correct to reap the mm concurrently.
>>> +     * Do not wake the oom reaper when the task still contains a
>>> +     * robust list.
>>> +     */
>>> +    if (tsk->robust_list)
>>> +        return;
>>> +#ifdef CONFIG_COMPAT
>>> +    if (tsk->compat_robust_list)
>>> +        return;
>>> +#endif
>>> +#endif
>> If this turns out to be really needed, which I do not really see at the
>> moment, then this is not the right way to handle this situation. The oom
>> victim could get stuck and the oom killer wouldn't be able to move
>> forward. If anything the victim would need to get MMF_OOM_SKIP set.

I will try this, but I don't immediately see any difference between this return
case and setting the bit, passing the oom_reaper_list, then skipping it based on
the flag. Do you mind explaining how this could lead to the oom killer getting
stuck?

Cheers,
-- Nico
> 
> There can be other way to do that, but letting the normal kill signal processing
> finishing its job and properly invoke futex_cleanup() is certainly one possible
> solution.
> 
> Cheers,
> Longman
> 



  reply	other threads:[~2022-01-17 22:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-14 18:01 Nico Pache
2022-01-17  8:52 ` Michal Hocko
2022-01-17 16:05   ` Waiman Long
2022-01-17 22:56     ` Nico Pache [this message]
2022-01-18  8:51       ` Michal Hocko
2022-02-14 20:39         ` Nico Pache
2022-03-02 14:24           ` Michal Hocko
2022-03-02 17:26             ` Nico Pache
2022-03-03  7:48               ` Michal Hocko
2022-03-09  0:24                 ` Nico Pache

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43a6c470-9fc2-6195-9a25-5321d17540e5@redhat.com \
    --to=npache@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrealmeid@collabora.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=jsavitz@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox