From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B2EAC4345F for ; Fri, 26 Apr 2024 15:08:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9CC856B0089; Fri, 26 Apr 2024 11:08:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97CA16B008C; Fri, 26 Apr 2024 11:08:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8443B6B0096; Fri, 26 Apr 2024 11:08:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 612986B0089 for ; Fri, 26 Apr 2024 11:08:05 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id DB83841409 for ; Fri, 26 Apr 2024 15:08:04 +0000 (UTC) X-FDA: 82052013288.20.2F13F89 Received: from mail-yb1-f178.google.com (mail-yb1-f178.google.com [209.85.219.178]) by imf24.hostedemail.com (Postfix) with ESMTP id A4C28180077 for ; Fri, 26 Apr 2024 15:08:01 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hjTvMbf4; spf=pass (imf24.hostedemail.com: domain of surenb@google.com designates 209.85.219.178 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714144081; a=rsa-sha256; cv=none; b=nj9VbBUmbMP7qO9ViyacuBA/qYY38PqMYEBMvBk+t6wPH0IbFBz7opF3E8dNsDV86kyVSR AmurNnEF4+Y0RpbT4MYGqYXFYNJjOkz5XOgjKDddaxox2A7yIBPEP4n97gL5Xz2U0TW8D1 0fynEEiW8pl5DZIBIv3idLzVHPApY4E= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hjTvMbf4; spf=pass (imf24.hostedemail.com: domain of surenb@google.com designates 209.85.219.178 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714144081; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AshHIINCHfrEJ1aw52tTBKusLV1VbfsknGj7LmigqNg=; b=rV9hh+5WEbBhei+WNIvTsGpJDV21Zpgwi1mG41t4RMoHgmH64afWIpZJgMmnAupbcCyZw3 T+TzSkn6tZn4okvCjTVdwd7ChvEvVYmnk5hhgB39JMw0Mb26F/ZxyJMD93K7lo6JiNLBAT NnJb/3vqZxZvFscbEzPhvNrX49zhQog= Received: by mail-yb1-f178.google.com with SMTP id 3f1490d57ef6-de477949644so2480413276.1 for ; Fri, 26 Apr 2024 08:08:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714144078; x=1714748878; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AshHIINCHfrEJ1aw52tTBKusLV1VbfsknGj7LmigqNg=; b=hjTvMbf4PrOBdKcdletRnk/Cak3F6Ap1QxDAfVGM/8jmfZpIcxzTWuixFUlzmZIWyi Y8+NVaf+rnPVQLdDLFaN45JixJNOabrdLZiiif5cdst9YOJA4yoEKKF+6HjXXwctzdsU Ty+o7ooPPGgHySbRQIn3uZt8MXXrxwwNWu8a+LLMA1mEA512+Ju2sJ5H7VzjrIjalzxW a/LXPkhiWZrQp4OGOMeYqL0iTGe60j3qET2oX4SZxP+Ul2SWQPVkTKXnmrH3TKsFO8Tx py4cC3mlGOC6OUYpY6tx2Q6xNOflRHmIYoJxYPiq7jYVS/s2FD3TsXX9TV/ea6vrorWV 8P2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714144078; x=1714748878; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AshHIINCHfrEJ1aw52tTBKusLV1VbfsknGj7LmigqNg=; b=JqgcxOItJ+JjUpuC1GlOmXKiuwrM7JTRrNtY5KjjqTTG0U3kW61MawC+/F4kidm4ih h1rLTmWdHD7I2SkKmv8cU+DWZEQ16s0Uvvbd735BH2eNGPCLLRTB6gvKrgvGFZk699Di /0ebtH56bEUf18hf50eyzL0f92uPZmSDUrCTfmDSImcTUgT+fxIp433kL+C/QbeFlAxW CPxKXlv0ATux9UXHmR1LUqkFArZyS88HN1CAODkH30byj1KLGTleAETxdObIAGppuL2N rNsflM5p8v41m0rJ7hWSsU0g7aXBE+29m4LnQ6boysfCvxxxeLr93xH9ZNfhsrQiT6ps Gxig== X-Forwarded-Encrypted: i=1; AJvYcCVzKlSw3zv+/Iu922OBxRnMyxVf84pn38ae4qnuW4zyGoo43+L4GOt46giRqMTUT49KSGyt9bBJNNc3qPbYCuCy2hk= X-Gm-Message-State: AOJu0YziZu8t7sPQx7g81bHZ1r/YSjaT+X/HzWMbmTdRM5TxeGHHJfMP wy2pCFl6O6lFPmizOTek8/Xv+LvNqLzDy4MXaJxwONhg/hfpRS9jshbCa/qZQ315NnLZT76K0ne rlJVoFKFvvzNi8J6F8bQzU8L5bYdzvm0Fiw7A X-Google-Smtp-Source: AGHT+IEPdJOCECpQwgyW7T9g2AjIKgZqrtGF5uj7NmaxDYMgsp9CKHbIVdePT4Mz8xBUIi2KaQJh+V9suKlsaqMKDKE= X-Received: by 2002:a25:8487:0:b0:dcd:3172:7269 with SMTP id v7-20020a258487000000b00dcd31727269mr3662087ybk.2.1714144078186; Fri, 26 Apr 2024 08:07:58 -0700 (PDT) MIME-Version: 1.0 References: <20240410170621.2011171-1-peterx@redhat.com> <20240411171319.almhz23xulg4f7op@revolver> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 26 Apr 2024 08:07:45 -0700 Message-ID: Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks To: Matthew Wilcox Cc: Peter Xu , "Liam R. Howlett" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Lokesh Gidra , Alistair Popple Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: A4C28180077 X-Stat-Signature: 7jzd45tnaxwpeaxuxm3qpmoyhjxar7sr X-Rspam-User: X-HE-Tag: 1714144081-225680 X-HE-Meta: U2FsdGVkX184xF+KLpkb5bjv7q+EgNNjTfC4ZdHTWQcpxP4QwqQoBavlZ3Al14rp/oA7ygpkFP1JpARsZnLPpqwmQ8tzL2q038y3ta5/kN0bEPbYcLLRo0Qsl/fKx8SS1xyIb/xvIeGr2UmPV1KL4ID1thqfiUhG2U9rZn9JpvjbE1sjfg8uSE1FA2WnnHiGJKP8pnsxCWIvXcgN8LZ6AJ6qQtSoonnk4ZK9spMsbMKH5MVFtzLjXCqYHJgVo60A6nw1f/W6I06SPfaDfcdai+WgyDrJR8J1fkpXfhEvLGQMl7PGHVMKkOB9zTGkxh23zyMtgY9n5nfKwBu7UX8+qxlyeKPB+KNqqVs3u5oBiTRRTblmjaK8OKdLZH2AG8O5WKABp8CbVl5WOGJ/N3cuRQH2pvy2dWMiQJN+bfs3rietfxtOSNQEuF6R8nSyFhABkypRqdBCSikt6zmg7+1t1H9WfovxfgLfTtPlOeJFIHGehRyhrrZjn11nvEwcULUBya7Vrf3uNyvUXDh2QBAbUDF6gYcpp9IySR8g403+HkISCd57sAYvCNruFYjURkglZLXcyBwYyVCyrbPg4KKjK8ZbB2NGJqq75jbVTeUVuI1Uz1pCzRdQXkEL7JAC8PhmfF3pLBYxPROkeaYC/W/M0cW6Twz4GAVgMsXU3fIOMFhraDAwEzsbNroqYmz9Rd27V8pxxsB/j6DntDsUGk2Zmro6xf8fBnKJF9cesxvbfanRzfq0EVQi8XXeKYu6JWM3qZAJuVgU3ajVW+ani6fO+9mLMbdDxiMdS+ARLi4S3ZcCpjmk/NGS7dcEuoKFG5Qttp3NhZ3CL4PquiHzsEh0J+xYu8ra/p+cdm3a+mMi3fBlkpxx3KqE4yhWRvu/QeRk8aiT1m0/i8JCnZQ7nq1WPNNhCuRs2EwOuzlclBcmkeI2EqFqDHPp+5aF+KTAiJKAVQ5IUJ9Hdvtk2yfjB8v L4jT+OSH XEivhvfM2jymGjI5Xl2+e49Cz0doQ2wTWqh3kg3e15TystSa9xGKfGgCwf/bb5b+KxfQig/xnrwt6l9sJ2uB1knWq9kI4HVZiqrldA8p5IHNn1UBbxPKDpfRoYx11M7VUvQNV6lrUUkJpfT4OVWiIruJbVlOo5k8Hm595ZWGTXt+FTD70BtxizRjqyE2YWWEFXVEXKfQb5axgWVCjWHnRZbN2r9lt9Jvdryts9ytIN3saYJQ2/8ako5Yu0pq2brhyTVPnz49TJdnZuZPXg5ml3GhcIC1dkfHeZP6RshDRP4ssCDojvBlpUVLiWPQeWs5SqcngdTjuVPxuytX3OBpGyS3g9DsXM2fRWsCPBdtuhm6+GTTQ5m5ssv4PUwAofvlDaA2i6MYZXLASmSDgDShDr72knKv6Na2v0WHccF0IRnCaBRBnzdZOEDOSdmyugdFrB87Iub2zpHTh3hS4buz3UC95PD8YuCXYOr3R X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 26, 2024 at 7:00=E2=80=AFAM Matthew Wilcox wrote: > > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote: > > Suren, what would you think to this? > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 6e2fe960473d..e495adcbe968 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct= mm_struct *mm, > > if (!vma_start_read(vma)) > > goto inval; > > > > - /* > > - * find_mergeable_anon_vma uses adjacent vmas which are not loc= ked. > > - * This check must happen after vma_start_read(); otherwise, a > > - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate t= he VMA > > - * from its anon_vma. > > - */ > > - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) > > - goto inval_end_read; > > - > > /* Check since vm_start/vm_end might change before we lock the = VMA */ > > if (unlikely(address < vma->vm_start || address >=3D vma->vm_en= d)) > > goto inval_end_read; > > > > That takes a few insns out of the page fault path (good!) at the cost > > of one extra trip around the fault handler for the first fault on an > > anon vma. It makes the file & anon paths more similar to each other > > (good!) > > > > We'd need some data to be sure it's really a win, but less code is > > always good. > > Intel's 0day got back to me with data and it's ridiculously good. > Headline figure: over 3x throughput improvement with vm-scalability > https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ > > I can't see why it's that good. It shouldn't be that good. I'm > seeing big numbers here: > > 4366 =C4=85 2% +565.6% 29061 perf-stat.overall.cycl= es-between-cache-misses > > and the code being deleted is only checking vma->vm_ops and > vma->anon_vma. Surely that cache line is referenced so frequently > during pagefault that deleting a reference here will make no difference > at all? That indeed looks overly good. Sorry, I didn't have a chance to run the benchmarks on my side yet because of the ongoing Android bootcamp this week. > > We've clearly got an inlining change. viz: > > 72.57 -72.6 0.00 perf-profile.calltrace.cycl= es-pp.exc_page_fault.asm_exc_page_fault.do_access > 73.28 -72.6 0.70 perf-profile.calltrace.cycl= es-pp.asm_exc_page_fault.do_access > 72.55 -72.5 0.00 perf-profile.calltrace.cycl= es-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access > 69.93 -69.9 0.00 perf-profile.calltrace.cycl= es-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_f= ault.do_access > 69.12 -69.1 0.00 perf-profile.calltrace.cycl= es-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_f= ault.asm_exc_page_fault > 68.78 -68.8 0.00 perf-profile.calltrace.cycl= es-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_u= ser_addr_fault.exc_page_fault > 65.78 -65.8 0.00 perf-profile.calltrace.cycl= es-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_m= m_and_find_vma.do_user_addr_fault > 65.43 -65.4 0.00 perf-profile.calltrace.cycl= es-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_s= lowpath.down_read_killable.lock_mm_and_find_vma > > 11.22 +86.5 97.68 perf-profile.calltrace.cycl= es-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry= _SYSCALL_64_after_hwframe > 11.14 +86.5 97.66 perf-profile.calltrace.cycl= es-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap= _pgoff.do_syscall_64 > 3.17 =C4=85 2% +94.0 97.12 perf-profile.calltrace= .cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_wr= ite_killable.vm_mmap_pgoff > 3.45 =C4=85 2% +94.1 97.59 perf-profile.calltrace= .cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killa= ble.vm_mmap_pgoff.ksys_mmap_pgoff > 0.00 +98.2 98.15 perf-profile.calltrace.cycl= es-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hw= frame > 0.00 +98.2 98.16 perf-profile.calltrace.cycl= es-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe > > so maybe the compiler has been able to eliminate some loads from > contended cachelines? > > 703147 -87.6% 87147 =C4=85 2% perf-stat.ps.context-s= witches > 663.67 =C4=85 5% +7551.9% 50783 vm-scalability.time.in= voluntary_context_switches > 1.105e+08 -86.7% 14697764 =C4=85 2% vm-scalability.time.vo= luntary_context_switches > > indicates to me that we're taking the mmap rwsem far less often (those > would be accounted as voluntary context switches). > > So maybe the cache miss reduction is a consequence of just running for > longer before being preempted. > > I still don't understand why we have to take the mmap_sem less often. > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set > an anon_vma on a page fault? I think the only path in either do_anonymous_page() or do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is the "Use the zero-page for reads" here: https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I didn't look into this particular benchmark yet but will try it out once I have some time to benchmark your change. >