From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7447DC352A2 for ; Fri, 7 Feb 2020 08:52:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 362E72082E for ; Fri, 7 Feb 2020 08:52:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="t2OybLXJ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 362E72082E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B4A446B0003; Fri, 7 Feb 2020 03:52:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AFC746B0005; Fri, 7 Feb 2020 03:52:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9EAD66B0007; Fri, 7 Feb 2020 03:52:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0100.hostedemail.com [216.40.44.100]) by kanga.kvack.org (Postfix) with ESMTP id 83F816B0003 for ; Fri, 7 Feb 2020 03:52:42 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 2760D181AEF00 for ; Fri, 7 Feb 2020 08:52:42 +0000 (UTC) X-FDA: 76462715364.15.line17_51b6b90d64240 X-HE-Tag: line17_51b6b90d64240 X-Filterd-Recvd-Size: 5931 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Fri, 7 Feb 2020 08:52:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Hb8wGJ0hmjawP2r5lfA6ngW8T2bigLDM3CQmaPJSIBU=; b=t2OybLXJZs9VfdimbrcPe2iTow OZTnnmNvKAalWU39GTCZohsXJtH8E8vv/6uLg3vTdjmQQpa02yRUX46jQ3ku9tbktfUx2GxYmZ7+d U+safJAQAqDMNFp8/XZdw5KeBDcS27BJLv86TPBgEr8CmN4ehsMSaxQ9PYzwULREF4p+ify7VYQeg Tk/goJIV4c3shy2d8CQfiEWSnXxsSRbgRpFxiFFUnNJ6u7vpXrfnJKPgqKCylAyrt7tIvkW1zN6Mg ZiJsc1fa32k9NrpdSHzhiQpAGrljLy+lu9SD4dqxxRBa2guEMzHh5yDf4JEHptQBOg/CGfwfLcNxy 1Ao7UFiw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1izzNM-00029Q-F3; Fri, 07 Feb 2020 08:52:36 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 4394E3008A9; Fri, 7 Feb 2020 09:50:47 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 0C1242B83446C; Fri, 7 Feb 2020 09:52:34 +0100 (CET) Date: Fri, 7 Feb 2020 09:52:34 +0100 From: Peter Zijlstra To: Matthew Wilcox Cc: SeongJae Park , Michal Hocko , Vlastimil Babka , "Kirill A. Shutemov" , linux-mm@kvack.org Subject: Re: Re: Splitting the mmap_sem Message-ID: <20200207085234.GB14914@hirez.programming.kicks-ass.net> References: <20200109170715.GV4951@dhcp22.suse.cz> <20200109173206.3731-1-sj38.park@gmail.com> <20200109201320.GO6788@bombadil.infradead.org> <20200206135920.GS14914@hirez.programming.kicks-ass.net> <20200206201536.GX8731@bombadil.infradead.org> <20200206205529.GZ14914@hirez.programming.kicks-ass.net> <20200206212024.GB8731@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200206212024.GB8731@bombadil.infradead.org> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote: > On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: > > > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); > > > walk the tree as we do now, allocate memory for PTEs, sleep waiting for > > > pages to arrive back from disc, etc, etc, then at the end of the fault, > > > call srcu_read_unlock(&vma_srcu). > > > > So far so good,... > > > > > munmap() would consist of removing the > > > VMA from the tree, then calling synchronize_srcu() to wait for all faults > > > to finish, then putting the backing file, etc, etc and freeing the VMA. > > > > call_srcu(), and the (s)rcu callback will then fput() and such things > > more. > > > > synchronize_srcu() (like synchronize_rcu()) is stupid slow and would > > make munmap()/exit()/etc.. unusable. > > I'll need to think about that a bit. I was convinced we needed to wait > for the current pagefaults to finish before we could return from munmap(). > I need to convince myself that it's OK to return to userspace while the > page faults for that range are still proceeding on other CPUs. File-io might be in progress, any actual faults will result in SIGFAULT instead of installing a PTE. It is not fundamentally different from any threaded uaf race. > > > This seems pretty reasonable, and investigation could actually proceed > > > before the Maple tree work lands. Today, that would be: > > > > > > srcu_read_lock(&vmas_srcu); > > > down_read(&mm->mmap_sem); > > > find_vma(mm, address); > > > up_read(&mm->mmap_sem); > > > ... rest of fault handler path ... > > > srcu_read_unlock(&vmas_srcu); > > > > > > Kind of a pain because we still call find_vma() in the per-arch page > > > fault handler, but for prototyping, we'd only have to do one or two > > > architectures. > > > > If you look at the earlier speculative page-fault patches by Laurent, > > which were based on my still earlier patches, you'll find most of this > > there. > > > > The tricky bit was validating everything on the second page-table walk, > > so see if nothing had fundamentally changed, specifically the VMA, > > before installing the PTE. If you do this without mmap_sem, you need to > > hold ptlock to pin stuff while validating everything you did earlier. > > The patches Laurent posted used regular RCU and a per-VMA refcount, not > SRCU. That are his later patches, and I distinctly disagree with that approach. If you look at the patches here: https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ibm.com you'll find it uses SRCU. > If you use SRCU, why would you need a second page table walk? Because SRCU only ensures the VMA object remains extant, it does not prevent modification of it, normally that guarantee is provided by mmap_sem, but we're not going to use that. Instead, what we serialize on is the (split) ptlock. So we do the first page-walk and ptlock to verify the vma-lookup, then we drop ptlock and do the file-io, then we page-walk and take ptlock again, verify the vma (again) and install the PTE. If anything goes wrong, we bail. See this patch: https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com