From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79390C433EF for ; Mon, 21 Mar 2022 08:56:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B9D26B0072; Mon, 21 Mar 2022 04:56:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7428F6B0073; Mon, 21 Mar 2022 04:56:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5BD936B0074; Mon, 21 Mar 2022 04:56:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 49C4C6B0072 for ; Mon, 21 Mar 2022 04:56:01 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2015F23BA9 for ; Mon, 21 Mar 2022 08:56:01 +0000 (UTC) X-FDA: 79267786122.09.558E139 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf25.hostedemail.com (Postfix) with ESMTP id 8BFA9A001D for ; Mon, 21 Mar 2022 08:56:00 +0000 (UTC) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 21E0E1F37C; Mon, 21 Mar 2022 08:55:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1647852959; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4lXn8H7X0Cxzr80o+IrzVtUQtim0twN1s6sfwci9lLw=; b=nfdEZ03cTSQUD18tX1cLxj3cH0gtcr2EperTXe2rkHlZ8do31B1BYzocA15T1xZRXd/CrM DtMwTsQJkldSHRP5Oq/Ntu6eju6gCJiYENn728M8KLbf/aQG8+VMzkNk8CQXHbfSHNNmGU XpmdlKGWI4+EU6At+MTHoG5ejhNaloI= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 7E205A3B8A; Mon, 21 Mar 2022 08:55:58 +0000 (UTC) Date: Mon, 21 Mar 2022 09:55:57 +0100 From: Michal Hocko To: Nico Pache Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rafael Aquini , Waiman Long , Baoquan He , Christoph von Recklinghausen , Don Dutile , "Herton R . Krzesinski" , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Darren Hart , Davidlohr Bueso , Andre Almeida , David Rientjes , Andrea Arcangeli , Andrew Morton , Joel Savitz Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper Message-ID: References: <20220318033621.626006-1-npache@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220318033621.626006-1-npache@redhat.com> X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8BFA9A001D X-Stat-Signature: 8k9iqjkrna4iofe7cxonkxm1sqsjshmn Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=nfdEZ03c; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf25.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com X-HE-Tag: 1647852960-326427 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 17-03-22 21:36:21, Nico Pache wrote: > The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can > be targeted by the oom reaper. This mapping is used to store the futex > robust list; the kernel does not keep a copy of the robust list and instead > references a userspace address to maintain the robustness during a process > death. A race can occur between exit_mm and the oom reaper that allows > the oom reaper to free the memory of the futex robust list before the > exit path has handled the futex death: > > CPU1 CPU2 > ------------------------------------------------------------------------ > page_fault > out_of_memory > do_exit "signal" > wake_oom_reaper > oom_reaper > oom_reap_task_mm (invalidates mm) > exit_mm > exit_mm_release > futex_exit_release > futex_cleanup > exit_robust_list > get_user (EFAULT- can't access memory) I still think it is useful to be explicit about the consequences of the EFAULT here. Did you want to mention that a failing get_user in this path would result in a hang because nobody is woken up when the current holder of the lock terminates. > While in the oom reaper thread, we must handle the futex cleanup without > sleeping. To achieve this, add the boolean `try` to futex_exit_begin(). > This will control weather or not we use a trylock. Also introduce > try_futex_exit_release() which will utilize the trylock version of the > futex_cleanup_begin(). Also call kthread_use_mm in this context to assure > the get_user call in futex_cleanup() does not return with EFAULT. This alone is not sufficient. get_user can sleep in the #PF handler path (e.g. by waiting for swap in). Or is there any guarantee that the page is never swapped out? If we cannot rule out #PF then this is not a viable way to address the problem I am afraid. [...] > @@ -587,6 +588,18 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > goto out_unlock; > } > > + /* We can't reap a process holding a robust_list; the pthread > + * struct is allocated in userspace using PRIVATE | ANONYMOUS > + * memory which when reaped before futex_cleanup() can leave > + * the waiting process stuck. Try to perform the futex_cleanup, > + * and if unsuccessful, skip the OOM reaping. > + */ > + if (task_has_robust_list(tsk) && !try_futex_exit_release(tsk)) { > + trace_skip_task_reaping(tsk->pid); > + pr_info("oom_reaper: skipping task as it contains a robust list"); > + goto out_finish; > + } > + > trace_start_task_reaping(tsk->pid); > > /* failed to reap part of the address space. Try again later */ Please also note that this all is done after mmap_lock has been already taken so a page fault could deadlock on the mmap_lock. The more I am thinking about this the more I am getting convinced that we should rather approach this differently and skip over vmas which can be holding the list. Have you considered this option? -- Michal Hocko SUSE Labs