From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B8DCC04FFE for ; Fri, 26 Apr 2024 14:00:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A8E7F6B008C; Fri, 26 Apr 2024 10:00:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A3EE46B0092; Fri, 26 Apr 2024 10:00:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9057C6B0093; Fri, 26 Apr 2024 10:00:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 72B7C6B008C for ; Fri, 26 Apr 2024 10:00:51 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D73E8413E8 for ; Fri, 26 Apr 2024 14:00:50 +0000 (UTC) X-FDA: 82051843860.12.16F665F Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf19.hostedemail.com (Postfix) with ESMTP id DB0AC1A0022 for ; Fri, 26 Apr 2024 14:00:46 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=VD+4bGCK; spf=none (imf19.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714140048; a=rsa-sha256; cv=none; b=Kh/n5lBYIFhnNxbbR6rNVMHXU34ezb9ChpbvBMcQ6i2c5DQW2vwKEBNG9KyA88ZS4ccaMN 8AyycgWLyWV1IJniZy9WESXb9a24l/8UUtkeoSbhxizSB5uFaYIJKf0WUVivmn0myqCdqS t8XboH1UT7uMhj4UGHQ2dKI2KVnyTWA= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=VD+4bGCK; spf=none (imf19.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714140048; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ozktkBb65QqgX3MHrht7P71tJ771tCL0v98m2ZOYQyg=; b=ZC2/+In5AJZfEyL6xu08UwqPDWzV64VtUMkL7Ijnuf1hhGxVGRsPU74UEM5X8kyLlr7hUa u/fUr9JoIChCPHaWrBF81eLbOEi/WFBGNrXKjW5YODO8VdbcHUvRTBdsVNdQzndNYlm8tQ RJk9G54cWNj8CuA6gbAfE2haSwYHSNc= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=ozktkBb65QqgX3MHrht7P71tJ771tCL0v98m2ZOYQyg=; b=VD+4bGCKYLvG+aIbvzz6Nmz2r7 O40nBwFLoq/L5Dc7DIbY3oxbli0QlzmSFEe9PcV0ma4OpAa9i1NOf4Udt4IGxvoNsCPY5XfeuF2kl Rrv0YnRtYwJEXCD4NSk3eVUpTk4eVrjJomf8KZFzvmK5ioEW2gZ7rottPPUdBvwqywc8n+MCn3r3q xxNHFs4QFF+Rg3ZNetkZAHkW4SrfHfOXQW+vyOYEJv8ktcrzqw/Tls/IQLtbEIp/0vbd6X+I1gzuO fzRS/lz74+3mindtdyOVDM3n+ZsL2Yyywu1dSiW4Xm+0jQ/uTOm8jQoIwbuivUWYB5EGImXnuDaaO 6KVbYEEg==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1s0M82-00000005Kll-2grR; Fri, 26 Apr 2024 14:00:42 +0000 Date: Fri, 26 Apr 2024 15:00:42 +0100 From: Matthew Wilcox To: Peter Xu Cc: "Liam R. Howlett" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Suren Baghdasaryan , Lokesh Gidra , Alistair Popple Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks Message-ID: References: <20240410170621.2011171-1-peterx@redhat.com> <20240411171319.almhz23xulg4f7op@revolver> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: DB0AC1A0022 X-Stat-Signature: 459bm5ukgizfsduqbhanmx4niz9os66z X-Rspam-User: X-HE-Tag: 1714140046-830632 X-HE-Meta: U2FsdGVkX18b3mVquBXP44MeNn18t1+n5ahreM98SZxL7jwwQIhaEncS1FGWEG0uRvEUmlgLRbzyb4UPAnA5EMs5rAmkMxZ4/7n42eLLSpkQYmFPuWQdtTxwQkzoCWJUBDr07q2M937mFpYa6uM6lvo+xGFdM1DVrOsu0M0P69ny+bAXk91L7MktjFTvkD+qqmR7Fntr+kJ7BYvz5vWPOtDpQ/YdQ77WBCmi22lx1eek9KNSFe/G3c8n6vJFm6u/vqP8BuTuOHcZJSlMcsVJ0dxyQ4BbdvN8pnHIF7glTuZflkOr9RQctxwu4mjk3d5eLMtkJT04PuK9m31+PhBPO4oF419A91rVsAgoeI/pUgAmcqK+jws6ysbzBniYxccOavUrNp6kzeXsluOUsgCH23D/BYRZEKZp05dgPusjcFoEviXfaYYzFMXgK4hJAOlsXLH/gI4uwWqnXm5AsYPXoUdfVExbe4+5imMdZqkQAM57FRSRejJlCeP2Q3W6xEKZW29I7GzZDx+4+fSMaGQmawp4Tts0qXjQlTMoROlgYCJUiSGY6dLbCYDkpTYgAqYVL9IPI+MR5nryAQWFqujUfzeUdmw+Q+1uMuEGhZIgupNfp+9V4RZFbrVNL2wjBbXGv/bARSrulSJ5mcUE8qFXFDsxrzj6YpmhLKIy/of8NdgG5oVXsILEuwkOMt0o2I17cig5UXwMujAJHmJ84RSVazWWuaYC83gm0koyaZU2ZHfBeyC7OYJ8cna6Gp5Bnx5CHKbsafM1SzYKECNNuGkPlTjlel7pJdltchZVgtuSF0pBht4b4ff6BU9IvIFtlDUoPZrhBRHmHtSIzGjhYQvY1bz4idS8a/FrZy/X31VirfqsJcCPnbKFvIakf1eaNbfPgP2tK4g60PwzfKf2nAs0e+4iCWUpoKejJURov1XbmciXdp5sjjzZ5nG0K8I4kTnAZ+0WtGWZF0BpdfAWOg2 I8Y5NNSu 6lEqytwW8SEfTdt36ySYYIMt4rzPykDlcXq3aUVwQhpI0ACt1Py6G12Yx1Xm/rbzQGQJWLzIIpvigDr/sUS8AlIHazJa0BUdjCQ6JcODZguFB5Jn8few4GrfeHSMm2277dkThuWGzFLR/bAklcGcj3Vq7y5Is9OhQ6eRdQLXPdUdAxZCWTZdm5RCZgNYeMD3LGHMkIJKvwGPWkjwC1WIdhzUTXWsHE+DorHfWGLLwvniQnwXWr/RllmK8m8QxzOzO3RS2q3mkZNzi1Mo+ELXQ9E5eiGsCfJ8G1308 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote: > Suren, what would you think to this? > > diff --git a/mm/memory.c b/mm/memory.c > index 6e2fe960473d..e495adcbe968 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > if (!vma_start_read(vma)) > goto inval; > > - /* > - * find_mergeable_anon_vma uses adjacent vmas which are not locked. > - * This check must happen after vma_start_read(); otherwise, a > - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA > - * from its anon_vma. > - */ > - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) > - goto inval_end_read; > - > /* Check since vm_start/vm_end might change before we lock the VMA */ > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) > goto inval_end_read; > > That takes a few insns out of the page fault path (good!) at the cost > of one extra trip around the fault handler for the first fault on an > anon vma. It makes the file & anon paths more similar to each other > (good!) > > We'd need some data to be sure it's really a win, but less code is > always good. Intel's 0day got back to me with data and it's ridiculously good. Headline figure: over 3x throughput improvement with vm-scalability https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ I can't see why it's that good. It shouldn't be that good. I'm seeing big numbers here: 4366 ± 2% +565.6% 29061 perf-stat.overall.cycles-between-cache-misses and the code being deleted is only checking vma->vm_ops and vma->anon_vma. Surely that cache line is referenced so frequently during pagefault that deleting a reference here will make no difference at all? We've clearly got an inlining change. viz: 72.57 -72.6 0.00 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access 73.28 -72.6 0.70 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access 72.55 -72.5 0.00 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access 69.93 -69.9 0.00 perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access 69.12 -69.1 0.00 perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault 68.78 -68.8 0.00 perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault 65.78 -65.8 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault 65.43 -65.4 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma 11.22 +86.5 97.68 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 11.14 +86.5 97.66 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64 3.17 ± 2% +94.0 97.12 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff 3.45 ± 2% +94.1 97.59 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 0.00 +98.2 98.15 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +98.2 98.16 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe so maybe the compiler has been able to eliminate some loads from contended cachelines? 703147 -87.6% 87147 ± 2% perf-stat.ps.context-switches 663.67 ± 5% +7551.9% 50783 vm-scalability.time.involuntary_context_switches 1.105e+08 -86.7% 14697764 ± 2% vm-scalability.time.voluntary_context_switches indicates to me that we're taking the mmap rwsem far less often (those would be accounted as voluntary context switches). So maybe the cache miss reduction is a consequence of just running for longer before being preempted. I still don't understand why we have to take the mmap_sem less often. Is there perhaps a VMA for which we have a NULL vm_ops, but don't set an anon_vma on a page fault?