From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6503FC47404 for ; Fri, 4 Oct 2019 11:42:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1734620867 for ; Fri, 4 Oct 2019 11:42:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="Kebviybu" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1734620867 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A4ACE6B0007; Fri, 4 Oct 2019 07:42:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D3748E0003; Fri, 4 Oct 2019 07:42:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89A1C6B000A; Fri, 4 Oct 2019 07:42:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0076.hostedemail.com [216.40.44.76]) by kanga.kvack.org (Postfix) with ESMTP id 621FF6B0007 for ; Fri, 4 Oct 2019 07:42:55 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id ECA63824CA26 for ; Fri, 4 Oct 2019 11:42:54 +0000 (UTC) X-FDA: 76005915468.19.pigs22_367c67ee1a323 X-HE-Tag: pigs22_367c67ee1a323 X-Filterd-Recvd-Size: 7484 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Oct 2019 11:42:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Transfer-Encoding :Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=WIvDWFtv2p/WZ58W6pXljHGiVfYoXqP27Nes4L6u4OM=; b=Kebviybu4eI80VBS6jZGt8Kia9 v/roRs8L242mCUK4xtYI4O+imHzI2lAOI39SuiNHc8S1fwlKmGAMn8npPxn94N7HECIm3tu0JvSGx nnjeGrRUnp9ZN7PEdoSSPhCKMrn8ZVUbD9DzYRsy8RBMQFSoHerrtdvSbrFn0e3z/FcPuFXr3iqgO rWeTgJXWD3+/H1p214aM+GJbVD9m8HWYXEpVOPHvcZcKAwzz5JgD4dGD6fgM6LIP2wG9480tfvcq/ SS3KO8+/ivJaZezrOFqCYQnpx4FpJvD5/0BGyKhYzhdklY+XRNHXlY3/Ru9gJfbdZKdvuDDm0Ppeg mr/QvjXA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.92.2 #3 (Red Hat Linux)) id 1iGLyp-0006q5-RM; Fri, 04 Oct 2019 11:42:40 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 57BA030034F; Fri, 4 Oct 2019 13:41:47 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 05D67203E50D2; Fri, 4 Oct 2019 13:42:37 +0200 (CEST) Date: Fri, 4 Oct 2019 13:42:36 +0200 From: Peter Zijlstra To: Leonardo Bras Cc: Song Liu , Michal Hocko , "Dmitry V. Levin" , Keith Busch , linux-mm@kvack.org, Paul Mackerras , Christoph Lameter , Ira Weiny , Dan Williams , Elena Reshetova , linux-arch@vger.kernel.org, Santosh Sivaraj , Davidlohr Bueso , "Aneesh Kumar K.V" , Bartlomiej Zolnierkiewicz , Mike Rapoport , Jason Gunthorpe , Vlastimil Babka , Mahesh Salgaonkar , Andrey Ryabinin , Alexey Dobriyan , Ingo Molnar , Andrea Arcangeli , Ralph Campbell , Arnd Bergmann , Jann Horn , John Hubbard , Jesper Dangaard Brouer , Nicholas Piggin , =?iso-8859-1?B?Suly9G1l?= Glisse , Mathieu Desnoyers , kvm-ppc@vger.kernel.org, Thomas Gleixner , Reza Arbab , Allison Randal , Christian Brauner , Greg Kroah-Hartman , linux-kernel@vger.kernel.org, Logan Gunthorpe , Souptick Joarder , Andrew Morton , linuxppc-dev@lists.ozlabs.org, Roman Gushchin , "Kirill A. Shutemov" , Al Viro Subject: Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Message-ID: <20191004114236.GD19463@hirez.programming.kicks-ass.net> References: <20191003013325.2614-1-leonardo@linux.ibm.com> <20191003072952.GN4536@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 03, 2019 at 05:36:31PM -0300, Leonardo Bras wrote: > > Also, I'm not sure I understand things properly. > >=20 > > So serialize_against_pte_lookup() wants to wait for all currently > > out-standing __find_linux_pte() instances (which are very similar to > > gup_fast). > >=20 > > It seems to want to do this before flushing the THP TLB for some reason; > > why? Should not THP observe the normal page table freeing rules which > > includes a RCU-like grace period like this already. > >=20 > > Why is THP special here? This doesn't seem adequately explained. >=20 > "It's necessary to monitor lockless pagetable walks, in order to avoid > doing THP splitting/collapsing during them." >=20 > If a there is a THP split/collapse during the lockless pagetable walk, > the returned ptep can be a pointing to an invalid pte.=20 So the whole premise of lockless page-table walks (gup_fast) is that it can work on in-flux page-tables. Specifically gup_fast() never returns PTEs, only struct page *, and only if it can increment the page refcount. In order to enable this, page-table pages are RCU(-like) freed, such that even if we access page-tables that have (concurrently) been unlinked, we'll not UaF (see asm-generic/tlb.h, the comment at HAVE_RCU_TABLE_FREE). IOW, the worst case if not getting a struct page *. I really don't see how THP splitting/collapsing is special here, either we see the PMD and find a struct page * or we see a PTE and find the same struct page * (compound page head). The only thing that needs to be guaranteed is that both PTEs and PMD page-tables are valid. Is this not so? > To avoid that, the pmd is updated, then serialize_against_pte_lookup is > ran. Serialize runs a do_nothing in all cpu in cpu_mask.=20 >=20 > So, after all cpus finish running do_nothing(), there is a guarantee > that if there is any 'lockless pagetable walk' it is running on top of > a updated version of this pmd, and so, collapsing/splitting THP is > safe. But why would it matter?! It would find the same struct page * through either version of the page-tables. *confused* > > Also, specifically to munmap(), this seems entirely superfluous, > > munmap() uses the normal page-table freeing code and should be entirely > > fine without additional waiting. >=20 > To be honest, I remember it being needed in munmap case, but I really > don't remember the details. I will take a deeper look and come back > with this answer.=20 munmap does normal mmu_gather page-table teardown, the THP PMD should be RCU-like freed just like any other PMD. Which should be perfectly safe vs lockless page-table walks. If you can find anything there that isn't right, please explain that in detail and we'll need to look hard at fixing _that_. > > Furthermore, Power never accurately tracks mm_cpumask(), so using that > > makes the whole thing more expensive than it needs to be. Also, I > > suppose that is buggered vs file backed THP. >=20 > That accuracy of mm_cpumask is above my knowledge right now. =3D) Basically PowerPC only ever sets bits in there, unlike x86 that also clears bits (at expense, but it's worth it for us).