From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7534AC433F5 for ; Wed, 23 Mar 2022 11:12:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA6FC6B0072; Wed, 23 Mar 2022 07:12:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A564D6B0073; Wed, 23 Mar 2022 07:12:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 91DBF6B0074; Wed, 23 Mar 2022 07:12:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0218.hostedemail.com [216.40.44.218]) by kanga.kvack.org (Postfix) with ESMTP id 835696B0072 for ; Wed, 23 Mar 2022 07:12:07 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 32F821828A458 for ; Wed, 23 Mar 2022 11:12:07 +0000 (UTC) X-FDA: 79275386694.29.D943112 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) by imf03.hostedemail.com (Postfix) with ESMTP id 0DA3F20021 for ; Wed, 23 Mar 2022 11:12:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=YUUeaXeizdEbhTfbv9oM2YEpstZpvlj67k8v1Zerx7I=; b=B2fbnZPOT1h23P2XKSNtHyXWtw CSes4QkfbR4mT6NuFszS6Gddq7SeNz2UFZuAIkcgofqiHXvICYEL/wKRDIQBHpHv8j62ne1KOENjc DxIo14i3wV0Asb2oxVU/82QTpmuJ7Wf2ZUhfkyqKyfLYZ5cnpvtPr68iwfK5Yzyk6/6gZ0AP4rc/G /rvF9GNI3bKG5YQ/yFyugVF65PgckQS/gcT+SCH80aa08ZIhl5zOTdNpnfCgXVs0gtIiCmO4asfHx phv35XUB9AMKdhEuQ2GkJf6gWtPzXyXpiwa/BXFIbcZ6K5BQBHZwf7AC0NJ6geZZuiG0JXM3Vj6lT 3C5SDZKg==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nWyuG-003k7a-5W; Wed, 23 Mar 2022 11:12:00 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 9961D3000E6; Wed, 23 Mar 2022 12:11:57 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 54D0F2C870D41; Wed, 23 Mar 2022 12:11:57 +0100 (CET) Date: Wed, 23 Mar 2022 12:11:57 +0100 From: Peter Zijlstra To: Michal Hocko Cc: Thomas Gleixner , Davidlohr Bueso , Nico Pache , linux-mm@kvack.org, Andrea Arcangeli , Joel Savitz , Andrew Morton , linux-kernel@vger.kernel.org, Rafael Aquini , Waiman Long , Baoquan He , Christoph von Recklinghausen , Don Dutile , "Herton R . Krzesinski" , Ingo Molnar , Darren Hart , Andre Almeida , David Rientjes Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper Message-ID: References: <20220318033621.626006-1-npache@redhat.com> <20220322004231.rwmnbjpq4ms6fnbi@offworld> <20220322025724.j3japdo5qocwgchz@offworld> <87bkxyaufi.ffs@tglx> <87zglha9rt.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 0DA3F20021 X-Stat-Signature: zkqzu8zox74ouea4yysxj7xh55y9p8j7 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=infradead.org header.s=desiato.20200630 header.b=B2fbnZPO; dmarc=none; spf=none (imf03.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.92.199) smtp.mailfrom=peterz@infradead.org X-Rspam-User: X-HE-Tag: 1648033925-896455 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 23, 2022 at 10:17:28AM +0100, Michal Hocko wrote: > > Neither is it "normal" that a VM is scheduled out long enough to miss a > > 1 second deadline. That might be considered normal by cloud folks, but > > that's absolute not normal from an OS POV. Again, that's not a OS > > problem, that's an operator/admin problem. > > Thanks for this clarification. I would tend to agree. Following a > previous example that oom victims can leave inconsistent state behind > which can influence other processes. I am wondering what kind of > expectations about the lock protected state can we make when the holder > of the lock has been interrupted at any random place in the critical > section. Right, this is why the new owner gets the OWNER_DIED bit so it can see something really dodgy happened. Getting that means it needs to validate state consistency or just print a nice error and fully terminate things. So robust futexes: - rely on userspace to maintain a linked list of held locks, - rely on lock acquire to check OWNER_DIED and handle state inconsistency. If userspace manages to screw up either one of those, it's game over. Nothing we can do about it. Software really has to be built do deal with this, it doesn't magically work (IOW, in 99% of the case it just doesn't work right). > [...] > > > And just to be clear, this is clearly a bug in the oom_reaper per se. > > > Originally I thought that relaxing the locking (using trylock and > > > retry/bail out on failure) would help but as I've learned earlier this > > > day this is not really possible because of #PF at least. The most self > > > contained solution would be to skip over vmas which are backing the > > > robust list which would allow the regular exit path to do the proper > > > cleanup. > > > > That's not sufficient because you have to guarantee that the relevant > > shared futex is accessible. See the lock chain example above. > > Yeah, my previous understanding was that the whole linked list lives in > the single mapping and we can just look at their addresses. Nope; shared futexes live in shared memory and as such the robust_list entry must live there too. That is, the robust_list entry is embedded in the lock itself along the lines of: struct robust_mutex { u32 futex; struct robust_list list; }; and then you register the robust_list_head with: .futex_offset = offsetof(struct robust_mutex, futex) - offsetof(struct robust_mutex, list); or somesuch (glibc does all this). And the locks themselves are spread all over the place.