From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E68BC02182 for ; Thu, 23 Jan 2025 21:45:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 254BE28001D; Thu, 23 Jan 2025 16:45:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 203F828001A; Thu, 23 Jan 2025 16:45:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A4AF28001D; Thu, 23 Jan 2025 16:45:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E069C28001A for ; Thu, 23 Jan 2025 16:45:38 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5FEF61A080E for ; Thu, 23 Jan 2025 21:45:38 +0000 (UTC) X-FDA: 83040048756.13.85CBB4C Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf17.hostedemail.com (Postfix) with ESMTP id 203BE40014 for ; Thu, 23 Jan 2025 21:45:35 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=YBIeX7k9; spf=none (imf17.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737668736; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0ryReEaLv9VJSrht6M/iKgKbZfxbY4PApYKNbuu1/eU=; b=n89e/XN9VdIp8meolq5NPw6WcFEqiwnD2hGcHPUNuqTqc/fIzD5N3/0yonj63dSY3qD18Y WRSMrgWoWtQHkEyP22giTbwgLJ12diR7C2wVagqt68gAhGhU4MvFXJ8GFdWcgX+v4Y2Ls6 LE0GEWwK2pFMZQ7sSQ4foiEgk3LbwMg= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=YBIeX7k9; spf=none (imf17.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737668736; a=rsa-sha256; cv=none; b=8GhdzEE7nydYuYWcQXt1Rl8XSLRv92ngvB8Y+WRyObAufsJxvlDChU+wYcKWAQxhDMBLe2 1TWBvb7iqB5ZKiPz8PwPIBHBEBvaSInhKRnR7jPlTQa+U4ZCdKqiiCJPYGIygRS7mBCh00 KuL77xgPw7xB7tB/Vt6i8ZYNzldwvbE= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=0ryReEaLv9VJSrht6M/iKgKbZfxbY4PApYKNbuu1/eU=; b=YBIeX7k9rsT9EPcCVlghsJPvfN guw5XGAa8v402Sr3jknRXViKp+LSnhELEyYDq9P8dbyvs8iXRlro7wME8AY1vzcoijzqqU2MVIX5X KlxZvVCuvJXmClr1YgxWFNH11/yWMS7uMgqeC5o3/ZTTOcaPaUFWINU6IO5IfxVN1CLPsjfADZvjO y1T18ljvmwqUcAkMb8rhGR4N2tvj2t29yN9Xz0niz2UpYSg1VIJY15N6YNrTtMbou3BcgMbkYmw4F jP+fz124Ugpio58M1iZ7JQ6AjP4NSRCWIM0SYeEGTwcoxVsmi3/JTUuALIOj0J7LXYvkWiDWAnIgy +UZsYxOQ==; Received: from 77-249-17-89.cable.dynamic.v4.ziggo.nl ([77.249.17.89] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98 #2 (Red Hat Linux)) id 1tb513-0000000B3gP-0vhu; Thu, 23 Jan 2025 21:45:33 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id E01773006E6; Thu, 23 Jan 2025 22:45:31 +0100 (CET) Date: Thu, 23 Jan 2025 22:45:31 +0100 From: Peter Zijlstra To: Roman Gushchin Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Jann Horn , Will Deacon , "Aneesh Kumar K.V" , Nick Piggin , Hugh Dickins , linux-arch@vger.kernel.org Subject: Re: [PATCH v2] mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() Message-ID: <20250123214531.GA969@noisy.programming.kicks-ass.net> References: <20250122232716.1321171-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250122232716.1321171-1-roman.gushchin@linux.dev> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 203BE40014 X-Stat-Signature: ymbh791yo9hbafshedrqp4p9nmkrjixu X-Rspam-User: X-HE-Tag: 1737668735-220156 X-HE-Meta: U2FsdGVkX18fVZOiVamiiCX7zn2z+nIWoLde0OFGZL7gxpKyd6AG8YGbZQUkmra4JhLvZ+vNhXKOdBFfWkgHX/9LyPTFdq4H1tXmrvzxzfsAZNkOrqoyxNrip6aERiqzYaDGQz/vWjyLhTiSSi7yip5EIg2zzjsyyQuTCDuOn4Kj2ildS+y+E//LfONAT19UrKFN0IM8X8b8xPC0MElgZQkIHxCGe6+5Yu+Hykia/AE1VqeqodhXwtfwgV6v2AE7yesd5G3BcwyXsC7HCUtXJmxmu4dmhu1jpogr4TA0ZhLApkHedOKeoFaWaU75jQm/LGSGo+q5nJTrXNUFyLvwdhZyyS81Ep3Sg8p9T9mmPNyHE4FubsN8+fMj3ZBfxC+0G5Zfah47/GVwnVCSFYi4DXXUfeC3hedT+/e4U5wAmvAPUZjuVlZzTvb7Y/yp5R1wIoWXZvOaoo4ifK8ET7nX53BwO2yPSU+C70NlJNy9/LVyR02QaiJIRkd5aN8JSLjlIp+3+IVK1uIbRc1xRZBSJe7krv6QX3LXPY45H4Lmr+WJSejH50KCz42OfUSJ0fsNPeMld34tq4kfpMxEk43QBH0le7KXD/atOEtrs/Fe8J3wMZIlT+I9FI7x3VHEU+1ggW2r0/kmQ2IpYpOZGDNXVvLrMaPww5r6fsy1qGq2RI7iiCVVYUQNWGMsi/RZXr/56tY2AaS5TjTC5f2IRu7BN9i5e0irJUsDzf3kYvf4nKiVFLeu0jOltLf2PXLYNfPgWPi1BhOe09DRV2qtqhYuPUyBQfPr5RlJBJeZ9dsHITn5DwzT49/1J5hpnkA1G4SGHVlaIvwyeT6A4qfh2jloZ/y+4bSKprmgQuO40X4vMLv7zDQMQ6gzaM/Ey3/ulzNfkkC85RNFeusNAHA35nL7Ol6J8KnMRRpPUeENPxCjv+21fKHcdibEluInPQgFoi1owXdRgYMWq5ueWxUH8Gt k1X+8fWr BH2AmiZxOQi5CzOaJLxbu9jYsNfQULfr/BYoCN0YOovLYnXEYQNq8GdhbN/p+5fGRkwP24g/NwCrMzlLFqbQRIq2frALfBIepqyHOkzVuzXVNfn8JS6d/td4OplPfKlhqwh58tCXxvwfNi/0UfufpF7OnWYeyx65ZNZhQ9EWyVdusrnH6vrPdrkjnnU5qNmgNpNDaX8UPmKcqF9DHGLZI+BVkMMBfxbW8NWcR4hQo6lKHdNQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote: > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > race between munmap() and unmap_mapping_range(). However it added some > overhead to other paths where tlb_vma_end() is used, but vmas are not > removed, e.g. madvise(MADV_DONTNEED). > > Fix this by moving the tlb flush out of tlb_end_vma() into > free_pgtables(), somewhat similar to the stable version of the > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > for PFNMAP mappings before unlink_file_vma()"). > diff --git a/mm/memory.c b/mm/memory.c > index 398c031be9ba..c2a9effb2e32 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > { > struct unlink_vma_file_batch vb; > > + /* > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > + * force flush TLBs for such ranges to avoid munmap() vs > + * unmap_mapping_range() races. > + */ > + tlb_flush_mmu_pfnmap(tlb); > + > do { > unsigned long addr = vma->vm_start; > struct vm_area_struct *next; So I'm not sure I muc like this name, it is fairly random and does very much not convey the reason we're calling this. Anyway, going back to reading the original commit (because this changelog isn't helping me much), the problem appears to be that unlinking the vma will make unmap_mapping_range() skip things (no vma, nothing to do, etc) return early and bad things happen. So am I right in thinking we need this tlb flush before all those unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range() that goes and invalidates the page-tables is too late? So how about we do something like so instead? --- diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 709830274b75..481a24f2b839 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -58,6 +58,11 @@ * Defaults to flushing at tlb_end_vma() to reset the range; helps when * there's large holes between the VMAs. * + * - tlb_free_vma() + * + * tlb_free_vma() marks the start of unlinking the vma and freeing + * page-tables. + * * - tlb_remove_table() * * tlb_remove_table() is the basic primitive to free page-table directories @@ -384,7 +389,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) * Do not reset mmu_gather::vma_* fields here, we do not * call into tlb_start_vma() again to set them if there is an * intermediate flush. + * + * Except for vma_pfn, that only cares if there's pending TLBI. */ + tlb->vma_pfn = 0; } #ifdef CONFIG_MMU_GATHER_NO_RANGE @@ -449,7 +457,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) */ tlb->vma_huge = is_vm_hugetlb_page(vma); tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); + + /* + * Track if there's at least one VM_PFNMAP/VM_MIXEDMAP vma + * in the tracked range, see tlb_free_vma(). + */ + tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -548,23 +561,39 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * } static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) + return; + + /* + * Do a TLB flush and reset the range at VMA boundaries; this avoids + * the ranges growing with the unused space between consecutive VMAs, + * but also the mmu_gather::vma_* flags from tlb_start_vma() rely on + * this. + */ + tlb_flush_mmu_tlbonly(tlb); +} + +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { if (tlb->fullmm) return; /* * VM_PFNMAP is more fragile because the core mm will not track the - * page mapcount -- there might not be page-frames for these PFNs after - * all. Force flush TLBs for such ranges to avoid munmap() vs - * unmap_mapping_range() races. + * page mapcount -- there might not be page-frames for these PFNs + * after all. + * + * Specifically() there is a race between munmap() and + * unmap_mapping_range(), where munmap() will unlink the VMA, such + * that unmap_mapping_range() will no longer observe the VMA and + * no-op, without observing the TLBI, returning prematurely. + * + * So if we're about to unlink such a VMA, and we have pending + * TLBI for such a vma, flush things now. */ - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { - /* - * Do a TLB flush and reset the range at VMA boundaries; this avoids - * the ranges growing with the unused space between consecutive VMAs. - */ + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) tlb_flush_mmu_tlbonly(tlb); - } } /* diff --git a/mm/memory.c b/mm/memory.c index 398c031be9ba..41b037803fd9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -365,6 +365,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, { struct unlink_vma_file_batch vb; + tlb_free_vma(tlb, vma); + do { unsigned long addr = vma->vm_start; struct vm_area_struct *next;