From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5FFA8C3DA4A for ; Fri, 9 Aug 2024 08:15:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DCAC66B0089; Fri, 9 Aug 2024 04:15:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D7B696B008A; Fri, 9 Aug 2024 04:15:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C42EF6B008C; Fri, 9 Aug 2024 04:15:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9DF3C6B0089 for ; Fri, 9 Aug 2024 04:15:18 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 54A791C4145 for ; Fri, 9 Aug 2024 08:15:18 +0000 (UTC) X-FDA: 82431997116.18.9A132CC Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) by imf24.hostedemail.com (Postfix) with ESMTP id DF21618001D for ; Fri, 9 Aug 2024 08:15:14 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WKiJzAad; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723191305; a=rsa-sha256; cv=none; b=gV5PPS/fcfS2wM1jo2vhveTwbjgNmyIEnf4inXqpVTPzMRJdYSguNSqNOQfN0u2pbvNw8b 2ai5eYTl6nBf2FhvT6ifMcjTa3do05Q9vtXzvTo7BJVm0uDjs5QAm0Ycrp5UqzooR7JisM oxashDRJ0CNywzWMxfBeXu0yHzAFmtY= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WKiJzAad; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723191305; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CTY3OFn8xmWS9BaXbrM/9l+/iBTQNq3zaueVhWRUhf8=; b=TfMF0xLeO8wUpRwqfeC2hn+twmTQMZulFRLFGVumbIp2uWIbvrPSmMEa1uheQSa7SjuFk+ nQXu59/5Y4l2U4esJRpjDrcQaDlsVULQR3GE01rtQnfi/TUsheZzcxNmvL/z8nfdSZJF2D VuhwwW7QqGbgoqMiAaUHALvxvKGmXGQ= Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-a7abe5aa9d5so232160766b.1 for ; Fri, 09 Aug 2024 01:15:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723191313; x=1723796113; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CTY3OFn8xmWS9BaXbrM/9l+/iBTQNq3zaueVhWRUhf8=; b=WKiJzAadeLNwT3kFDPE/ecj58QAWwDfkuKg4d7zF4TWTX0MSlORyl6IiLbJW26kjxz GVPZY87lupEQamOgA8q0uwAZXVVlmCoOz3l4ZYdN/zOoYZIV66mMbC+qgMVB5jT/bRKK dBjR3u5xu9CXAH1edQrQgpK9SURsu1/qwNaCBwCRr8yHRGCPu+8dxOPdGarLyAHd/i2d W9B4ydLox2inoq9j5+yHRMzINf+FSmiByTuEyNwN9DUjsTKziD7qZMnVRYvkmKhhluYz VNRowi5IkVpMbpTrSixmeTFV5VLc4WmT41/DZrHyxY8luUKkXgTWRvzARkJ3VjrXrCkz foPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723191313; x=1723796113; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CTY3OFn8xmWS9BaXbrM/9l+/iBTQNq3zaueVhWRUhf8=; b=aDOmurAL9tyYCSWo3cOOWz7LFn8oBjqWcy5wFOrBrCNyjBvhjxCm1PoNH9BM8dw1zU QgYf+0ouAPq9zTG0adxk7/k7/0huNTl8tOa+CZiYA5ggVSH+W+V1Sabdtaatda7Z9akA j8W75VtdQkKzF8lLM0O2Uk9gOgCU6Rf8WzDv/LMBRVPU6tvkVLWhD4Ioc0+HhvfnJq15 1gCR1r0xNlnbmN6U+zZDBHvgN/ff7xda1TRbf/tn42DSXkBhk4tH1bNX5E+nYyhkR7sC ZD2iduJU5oqH+HAPw+lbyyaHqy7DGL9yg93LpxBBw6mv2veMtuli2uTURTX7PEVFqy25 dUIA== X-Forwarded-Encrypted: i=1; AJvYcCUhOMSsLbYJqvHuWnnoqE5+SWaswk7O80aWCNUFBcBnp+O/dFDjRf5o3oYOcLczElrcV6e1WMf0ZYbYn0ZlN69ms9Y= X-Gm-Message-State: AOJu0Yx0YvLxqnQT+uXhNDQDBkq2+VQ6EZVUAmTR5mE7tvXgrwMl0VGF vRFz3Zc/7frojjDslML5D0Gj4j/9jWbN5J4Gx5XNnk5y5t7Zsb6pyDcYG9HllXi6AfuuV69yPSc JDt6AUYJUUHm8AS5LJUx5KOZmzHU= X-Google-Smtp-Source: AGHT+IGiPO5v32DWtLwQFdYePAJVXi0zCEhdLWuvX1iXJVCrV/J6gfKCgXG6r3xipl43sMqgS93Z6mJDlQIWL20n/9Q= X-Received: by 2002:a17:907:e697:b0:a7a:a212:be4e with SMTP id a640c23a62f3a-a80aa5516cdmr63270466b.7.1723191312590; Fri, 09 Aug 2024 01:15:12 -0700 (PDT) MIME-Version: 1.0 References: <20240808185949.1094891-1-mjguzik@gmail.com> In-Reply-To: From: Mateusz Guzik Date: Fri, 9 Aug 2024 10:14:58 +0200 Message-ID: Subject: Re: [RFC PATCH] vm: align vma allocation and move the lock back into the struct To: Suren Baghdasaryan Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Liam.Howlett@oracle.com, vbabka@suse.cz, pedro.falcato@gmail.com, Lorenzo Stoakes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: DF21618001D X-Rspamd-Server: rspam01 X-Stat-Signature: 4xrmhod5bf1qsfm5pud5y8uo1cbsptd1 X-HE-Tag: 1723191314-439424 X-HE-Meta: U2FsdGVkX1+n/LSSCJn3dvgR1vlMsdW+aMKvNMfN+yVI9RC8eZM/1/4v6N2M1X/eOI821oh2xFpRjxUqpx1tpkvrd71MqtMCBfi3eAlCulrsjTgwdfz5PnjkVi1queZYlybMYRQ5YQQEZ6tEeToJGoUqFXJLeq+oi6lQAJvlDeYPArEY92XxXKdmSSQsA2lmUAZIcPOffQfIEHmXH9hT0IwuvtwaDOTdi6PIKrr3TnIcVOB6jBjIasnH4b0VYdO3XWEgRSNjBDNvh9NG9ge1mQNGzuLjOgF0KC6Ye2SMsaEqCjRnmveWretaX4zVtqBbLB7W18eBRC1d88kWMXGIWs7ZXXsn5N3tYQbJ8NFkF0wgS0667KpR0TQqf1lFFK8fo+cJ9KQC3bQvDlKgzodFtRZqrkTdMgqXky6oNzf52rSneQWr9dSeYB5Mn/uDTThSZvcNVtus3zZMs7WPpWMNMlE5uMzkmVoZKx27SUrl2kKPZ00vEsWhmpI/1UJUEM6oUK0jCmaOJlU6rKcCaOg2UcBkNO2he7oIWA1fWwxOZG51QxvyPTU+1aXAOCCxfhqVYb09k6N+TP5Io8GBPtqWNTRWFXXHJNTcgG2S7fTOs7jPWOR6wbDw9xASeNolE2wbv6SVMXqgeiuYmE6VCAzBZI2REwdYDNkB60oyw49/oTKoJZP63nltdQvUaRP/kWAVgdDbp2+woigxa5Rw4l/CQ4ikxV99K1SlDmENu8SdXZEnGG61y0UA+4z3HDzycedhIrT1LyyX8NFUsxCgN3f//HzIuW76vGJUtF3gem/8Lru+cp+ZTe3zDxkNW0AOGQWndZUVmsY9tB8avheSomqC3TO01W7tbhPE0PqEOlM4TmDXBxRxuhQQHXRK5wthfOSqZYtvnEv8dFXznB1o7FCovkT0G6vhfCxWbuVoeA7NRab1WOCwWg2PhmDwcGo9nPVfKgc7ZwtdPBf50PheNrE HpjUyFLV Vpu4SlpjWDd52002bsq2lkTYlH75GmDxIzLr4NLIUDrQsfDX5rC+5W0gSFt9pC9EP/3O1Vl77VBWWiexaKg6DI+KC+C+QIGso9OxM+ujM0z1hf1W/4cS+Fq78tNsC1Fs6n0y13Qixq/v84OPWbSSqV2ZsUK2cGeIfcm0knb6pgr3rst1qIReeXIvGI9/kUq0ASxfLmefORnbGXCTcrDaQRhogSxdrtQWUr9pUW4dcwpdsN4QQ0ZEUJm0OzP45/pCsR21ymaV38wkuntANRyDEUOPBYnlUWTDLkor8+4/t94F9/foHgzYDfmsq9munQFlXSvgb0RFvI9zsZz5yOFrhXUkEgOfngRidHa69jiruEv7IfwUQHAGb6QxPT4pE6oW437UdU+ZQjw5jH+Flbn8kE1Av38epQRHDVMqkubwdAUwjf3vzPL4V26taAQfXPDKUuHg11T0DpaHMsNM4fQUDjPyItQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 9, 2024 at 5:57=E2=80=AFAM Suren Baghdasaryan wrote: > > On Thu, Aug 8, 2024 at 9:19=E2=80=AFPM Suren Baghdasaryan wrote: > > > > On Thu, Aug 8, 2024 at 1:04=E2=80=AFPM Mateusz Guzik wrote: > > > > > > On Thu, Aug 8, 2024 at 9:39=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > > On Thu, Aug 8, 2024 at 7:00=E2=80=AFPM Mateusz Guzik wrote: > > > > > > > > > > ACHTUNG: this is more of a request for benchmarking than a patch > > > > > proposal at this stage > > > > > > > > > > I was pointed at your patch which moved the vma lock to a separat= e > > > > > allocation [1]. The commit message does not say anything about ma= king > > > > > sure the object itself is allocated with proper alignment and I f= ound > > > > > that the cache creation lacks the HWCACHE_ALIGN flag, which may o= r may > > > > > not be the problem. > > > > > > > > > > I verified with a simple one-liner than on a stock kernel the vma= s keep > > > > > roaming around with a 16-byte alignment: > > > > > # bpftrace -e 'kretprobe:vm_area_alloc { @[retval & 0x3f] =3D co= unt(); }' > > > > > @[16]: 39 > > > > > @[0]: 46 > > > > > @[32]: 53 > > > > > @[48]: 56 > > > > > > > > > > Note the stock vma lock cache also lacks the alignment flag. Whil= e I > > > > > have not verified experimentally, if they are also romaing it wou= ld mean > > > > > that 2 unrelated vmas can false-share locks. If the patch below i= s a > > > > > bust, the flag should probably be added to that one. > > > > > > > > > > The patch has slapped-around vma lock cache removal + HWALLOC for= the > > > > > vma cache. I left a pointer to not change relative offsets betwee= n > > > > > current fields. I does compile without CONFIG_PER_VMA_LOCK. > > > > > > > > > > Vlastimil says you tested a case where the struct got bloated to = 256 > > > > > bytes, but the lock remained separate. It is unclear to me if thi= s > > > > > happened with allocations made with the HWCACHE_ALIGN flag though= . > > > > > > > > > > There is 0 urgency on my end, but it would be nice if you could t= ry > > > > > this out with your test rig. > > > > > > > > Hi Mateusz, > > > > Sure, I'll give it a spin but I'm not optimistic. Your code looks > > > > almost identical to my latest attempt where I tried placing vm_lock > > > > into different cachelines including a separate one and using > > > > HWCACHE_ALIGN. And yet all my attempts showed regression. > > > > Just FYI, the test I'm using is the pft-threads test from mmtests > > > > suite. I'll post results today evening. > > > > Thanks, > > > > Suren. > > > > > > Ok, well maybe you did not leave the pointer in place? :) > > > > True, maybe that will make a difference. I'll let you know soon. > > > > > > > > It is plausible the problem is on vs off cpu behavior of rwsems -- > > > there is a corner case where they neglect to spin. It is plausible > > > perf goes down simply because there is less on cpu time. > > > > > > Thus you bench can you make sure to time(1)? > > > > Sure, will do once I'm home. Thanks for the hints! > > Unfortunately the same regression shows its ugly face: > > compare-mmtests.pl Hmean results: > Hmean faults/cpu-1 471264.4904 ( 0.00%) 473085.6736 * 0.39%* > Hmean faults/cpu-4 434571.7116 ( 0.00%) 431214.3974 * -0.77%* > Hmean faults/cpu-7 407755.3217 ( 0.00%) 395773.4052 * -2.94%* > Hmean faults/cpu-12 335604.9251 ( 0.00%) 285426.3358 * -14.95%* > Hmean faults/cpu-21 187588.9077 ( 0.00%) 171227.7179 * -8.72%* > Hmean faults/cpu-30 140875.7878 ( 0.00%) 124120.3437 * -11.89%* > Hmean faults/cpu-48 106175.5493 ( 0.00%) 93073.1499 * -12.34%* > Hmean faults/cpu-56 92585.2536 ( 0.00%) 82837.4299 * -10.53%* > Hmean faults/sec-1 470924.4946 ( 0.00%) 472730.9937 * 0.38%* > Hmean faults/sec-4 1714823.8198 ( 0.00%) 1693226.7248 * -1.26%* > Hmean faults/sec-7 2801395.1898 ( 0.00%) 2717561.9417 * -2.99%* > Hmean faults/sec-12 3934168.2690 ( 0.00%) 3319710.7540 * -15.62%* > Hmean faults/sec-21 3736832.4592 ( 0.00%) 3444687.9145 * -7.82%* > Hmean faults/sec-30 3845187.2636 ( 0.00%) 3403585.7064 * -11.48%* > Hmean faults/sec-48 4712317.7461 ( 0.00%) 4180658.4710 * -11.28%* > Hmean faults/sec-56 4873233.9844 ( 0.00%) 4423608.6568 * -9.23%* > > This is the time(1) output with the baseline: > 920.47user 7748.31system 18:30.85elapsed 780%CPU (0avgtext+0avgdata > 26385096maxresident)k > 140848inputs+19744outputs (66major+1583463207minor)pagefaults 0swaps > > This is the time(1) output with your change: > 1025.73user 8618.74system 19:10.79elapsed 838%CPU (0avgtext+0avgdata > 26385116maxresident)k > 16584inputs+19512outputs (61major+1583468687minor)pagefaults 0swaps > > Maybe it has something to do with NUMA? The system I'm running has 2 NUMA= nodes: > hrmpf. final cheap stab I forgot to mention is that plausibly this is all about the adjacent cacheline prefetcher. google-fu temporarily fails me, but there was a one-liner to toggle that on Linux. Worst case you can flip it in the BIOS if that does not change anything, I'm going to grab a numa box of similar scale to poke around myself, but I don't have an ETA even so, do you have a handy one-liner to run the case with 56 threads? *maybe* comparing instructions which generate cache misses before/after will explain what's up > $ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Address sizes: 46 bits physical, 48 bits virtual > Byte Order: Little Endian > CPU(s): 56 > On-line CPU(s) list: 0-55 > Vendor ID: GenuineIntel > Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz > CPU family: 6 > Model: 79 > Thread(s) per core: 2 > Core(s) per socket: 14 > Socket(s): 2 > Stepping: 1 > CPU max MHz: 3500.0000 > CPU min MHz: 1200.0000 > BogoMIPS: 5188.26 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht > tm pbe syscall nx pdpe1gb rdtscp lm constant_ts > c arch_perfmon pebs bts rep_good nopl > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p > cid dca sse4_1 sse4_2 x2apic movbe popcnt > tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch > cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ss > bd ibrs ibpb stibp tpr_shadow flexpriority > ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms > invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsave > opt cqm_llc cqm_occup_llc cqm_mbm_total > cqm_mbm_local dtherm ida arat pln pts vnmi md_clear flush_l1d > Virtualization features: > Virtualization: VT-x > Caches (sum of all): > L1d: 896 KiB (28 instances) > L1i: 896 KiB (28 instances) > L2: 7 MiB (28 instances) > L3: 70 MiB (2 instances) > NUMA: > NUMA node(s): 2 > NUMA node0 CPU(s): 0-13,28-41 > NUMA node1 CPU(s): 14-27,42-55 > > Any ideas? > > > > > > > > > > > > > > > > For example with zsh I got: > > > ./run-mmtests.sh --no-monitor --config configs/config-workload-pft-th= reads > > > > > > 39.35s user 445.45s system 390% cpu 124.04s (2:04.04) total > > > > > > I verified with offcputime-bpfcc -K that indeed there is a bunch of > > > pft going off cpu from down_read/down_write even at the modest scale > > > this was running in my case. > > > > > > > > > > > > > > > > > 1: https://lore.kernel.org/all/20230227173632.3292573-34-surenb@g= oogle.com/T/#u > > > > > > > > > > --- > > > > > include/linux/mm.h | 18 +++++++-------- > > > > > include/linux/mm_types.h | 10 ++++----- > > > > > kernel/fork.c | 47 ++++------------------------------= ------ > > > > > mm/userfaultfd.c | 6 ++--- > > > > > 4 files changed, 19 insertions(+), 62 deletions(-) > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > index 43b40334e9b2..6d8b668d3deb 100644 > > > > > --- a/include/linux/mm.h > > > > > +++ b/include/linux/mm.h > > > > > @@ -687,7 +687,7 @@ static inline bool vma_start_read(struct vm_a= rea_struct *vma) > > > > > if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_= mm->mm_lock_seq)) > > > > > return false; > > > > > > > > > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) =3D= =3D 0)) > > > > > + if (unlikely(down_read_trylock(&vma->vm_lock) =3D=3D 0)) > > > > > return false; > > > > > > > > > > /* > > > > > @@ -702,7 +702,7 @@ static inline bool vma_start_read(struct vm_a= rea_struct *vma) > > > > > * This pairs with RELEASE semantics in vma_end_write_all= (). > > > > > */ > > > > > if (unlikely(vma->vm_lock_seq =3D=3D smp_load_acquire(&vm= a->vm_mm->mm_lock_seq))) { > > > > > - up_read(&vma->vm_lock->lock); > > > > > + up_read(&vma->vm_lock); > > > > > return false; > > > > > } > > > > > return true; > > > > > @@ -711,7 +711,7 @@ static inline bool vma_start_read(struct vm_a= rea_struct *vma) > > > > > static inline void vma_end_read(struct vm_area_struct *vma) > > > > > { > > > > > rcu_read_lock(); /* keeps vma alive till the end of up_re= ad */ > > > > > - up_read(&vma->vm_lock->lock); > > > > > + up_read(&vma->vm_lock); > > > > > rcu_read_unlock(); > > > > > } > > > > > > > > > > @@ -740,7 +740,7 @@ static inline void vma_start_write(struct vm_= area_struct *vma) > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > return; > > > > > > > > > > - down_write(&vma->vm_lock->lock); > > > > > + down_write(&vma->vm_lock); > > > > > /* > > > > > * We should use WRITE_ONCE() here because we can have co= ncurrent reads > > > > > * from the early lockless pessimistic check in vma_start= _read(). > > > > > @@ -748,7 +748,7 @@ static inline void vma_start_write(struct vm_= area_struct *vma) > > > > > * we should use WRITE_ONCE() for cleanliness and to keep= KCSAN happy. > > > > > */ > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > - up_write(&vma->vm_lock->lock); > > > > > + up_write(&vma->vm_lock); > > > > > } > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct= *vma) > > > > > @@ -760,7 +760,7 @@ static inline void vma_assert_write_locked(st= ruct vm_area_struct *vma) > > > > > > > > > > static inline void vma_assert_locked(struct vm_area_struct *vma) > > > > > { > > > > > - if (!rwsem_is_locked(&vma->vm_lock->lock)) > > > > > + if (!rwsem_is_locked(&vma->vm_lock)) > > > > > vma_assert_write_locked(vma); > > > > > } > > > > > > > > > > @@ -827,10 +827,6 @@ static inline void assert_fault_locked(struc= t vm_fault *vmf) > > > > > > > > > > extern const struct vm_operations_struct vma_dummy_vm_ops; > > > > > > > > > > -/* > > > > > - * WARNING: vma_init does not initialize vma->vm_lock. > > > > > - * Use vm_area_alloc()/vm_area_free() if vma needs locking. > > > > > - */ > > > > > static inline void vma_init(struct vm_area_struct *vma, struct m= m_struct *mm) > > > > > { > > > > > memset(vma, 0, sizeof(*vma)); > > > > > @@ -839,6 +835,8 @@ static inline void vma_init(struct vm_area_st= ruct *vma, struct mm_struct *mm) > > > > > INIT_LIST_HEAD(&vma->anon_vma_chain); > > > > > vma_mark_detached(vma, false); > > > > > vma_numab_state_init(vma); > > > > > + init_rwsem(&vma->vm_lock); > > > > > + vma->vm_lock_seq =3D -1; > > > > > } > > > > > > > > > > /* Use when VMA is not part of the VMA tree and needs no locking= */ > > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > > > > index 003619fab20e..caffdb4eeb94 100644 > > > > > --- a/include/linux/mm_types.h > > > > > +++ b/include/linux/mm_types.h > > > > > @@ -615,10 +615,6 @@ static inline struct anon_vma_name *anon_vma= _name_alloc(const char *name) > > > > > } > > > > > #endif > > > > > > > > > > -struct vma_lock { > > > > > - struct rw_semaphore lock; > > > > > -}; > > > > > - > > > > > struct vma_numab_state { > > > > > /* > > > > > * Initialised as time in 'jiffies' after which VMA > > > > > @@ -716,8 +712,7 @@ struct vm_area_struct { > > > > > * slowpath. > > > > > */ > > > > > int vm_lock_seq; > > > > > - /* Unstable RCU readers are allowed to read this. */ > > > > > - struct vma_lock *vm_lock; > > > > > + void *vm_dummy; > > > > > #endif > > > > > > > > > > /* > > > > > @@ -770,6 +765,9 @@ struct vm_area_struct { > > > > > struct vma_numab_state *numab_state; /* NUMA Balancing= state */ > > > > > #endif > > > > > struct vm_userfaultfd_ctx vm_userfaultfd_ctx; > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > > + struct rw_semaphore vm_lock ____cacheline_aligned_in_smp; > > > > > +#endif > > > > > } __randomize_layout; > > > > > > > > > > #ifdef CONFIG_NUMA > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index 92bfe56c9fed..eab04a24d5f1 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep; > > > > > /* SLAB cache for mm_struct structures (tsk->mm) */ > > > > > static struct kmem_cache *mm_cachep; > > > > > > > > > > -#ifdef CONFIG_PER_VMA_LOCK > > > > > - > > > > > -/* SLAB cache for vm_area_struct.lock */ > > > > > -static struct kmem_cache *vma_lock_cachep; > > > > > - > > > > > -static bool vma_lock_alloc(struct vm_area_struct *vma) > > > > > -{ > > > > > - vma->vm_lock =3D kmem_cache_alloc(vma_lock_cachep, GFP_KE= RNEL); > > > > > - if (!vma->vm_lock) > > > > > - return false; > > > > > - > > > > > - init_rwsem(&vma->vm_lock->lock); > > > > > - vma->vm_lock_seq =3D -1; > > > > > - > > > > > - return true; > > > > > -} > > > > > - > > > > > -static inline void vma_lock_free(struct vm_area_struct *vma) > > > > > -{ > > > > > - kmem_cache_free(vma_lock_cachep, vma->vm_lock); > > > > > -} > > > > > - > > > > > -#else /* CONFIG_PER_VMA_LOCK */ > > > > > - > > > > > -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { = return true; } > > > > > -static inline void vma_lock_free(struct vm_area_struct *vma) {} > > > > > - > > > > > -#endif /* CONFIG_PER_VMA_LOCK */ > > > > > - > > > > > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) > > > > > { > > > > > struct vm_area_struct *vma; > > > > > @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct = mm_struct *mm) > > > > > return NULL; > > > > > > > > > > vma_init(vma, mm); > > > > > - if (!vma_lock_alloc(vma)) { > > > > > - kmem_cache_free(vm_area_cachep, vma); > > > > > - return NULL; > > > > > - } > > > > > > > > > > return vma; > > > > > } > > > > > @@ -496,10 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm= _area_struct *orig) > > > > > * will be reinitialized. > > > > > */ > > > > > data_race(memcpy(new, orig, sizeof(*new))); > > > > > - if (!vma_lock_alloc(new)) { > > > > > - kmem_cache_free(vm_area_cachep, new); > > > > > - return NULL; > > > > > - } > > > > > + init_rwsem(&new->vm_lock); > > > > > + new->vm_lock_seq =3D -1; > > > > > INIT_LIST_HEAD(&new->anon_vma_chain); > > > > > vma_numab_state_init(new); > > > > > dup_anon_vma_name(orig, new); > > > > > @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vm= a) > > > > > { > > > > > vma_numab_state_free(vma); > > > > > free_anon_vma_name(vma); > > > > > - vma_lock_free(vma); > > > > > kmem_cache_free(vm_area_cachep, vma); > > > > > } > > > > > > > > > > @@ -522,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_he= ad *head) > > > > > vm_rcu); > > > > > > > > > > /* The vma should not be locked while being destroyed. */ > > > > > - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma); > > > > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock), vma); > > > > > __vm_area_free(vma); > > > > > } > > > > > #endif > > > > > @@ -3192,10 +3156,7 @@ void __init proc_caches_init(void) > > > > > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUN= T, > > > > > NULL); > > > > > > > > > > - vm_area_cachep =3D KMEM_CACHE(vm_area_struct, SLAB_PANIC|= SLAB_ACCOUNT); > > > > > -#ifdef CONFIG_PER_VMA_LOCK > > > > > - vma_lock_cachep =3D KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_= ACCOUNT); > > > > > -#endif > > > > > + vm_area_cachep =3D KMEM_CACHE(vm_area_struct, SLAB_PANIC|= SLAB_ACCOUNT|SLAB_HWCACHE_ALIGN); > > > > > mmap_init(); > > > > > nsproxy_cache_init(); > > > > > } > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index 3b7715ecf292..e95ecb2063d2 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -92,7 +92,7 @@ static struct vm_area_struct *uffd_lock_vma(str= uct mm_struct *mm, > > > > > * mmap_lock, which guarantees that nobody can lo= ck the > > > > > * vma for write (vma_start_write()) under us. > > > > > */ > > > > > - down_read(&vma->vm_lock->lock); > > > > > + down_read(&vma->vm_lock); > > > > > } > > > > > > > > > > mmap_read_unlock(mm); > > > > > @@ -1468,9 +1468,9 @@ static int uffd_move_lock(struct mm_struct = *mm, > > > > > * See comment in uffd_lock_vma() as to why not u= sing > > > > > * vma_start_read() here. > > > > > */ > > > > > - down_read(&(*dst_vmap)->vm_lock->lock); > > > > > + down_read(&(*dst_vmap)->vm_lock); > > > > > if (*dst_vmap !=3D *src_vmap) > > > > > - down_read_nested(&(*src_vmap)->vm_lock->l= ock, > > > > > + down_read_nested(&(*src_vmap)->vm_lock, > > > > > SINGLE_DEPTH_NESTING); > > > > > } > > > > > mmap_read_unlock(mm); > > > > > -- > > > > > 2.43.0 > > > > > > > > > > > > > > > > > -- > > > Mateusz Guzik --=20 Mateusz Guzik