From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CCE41C3DA4A
	for <linux-mm@archiver.kernel.org>; Fri,  9 Aug 2024 03:57:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4F6116B008C; Thu,  8 Aug 2024 23:57:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4A5B36B0092; Thu,  8 Aug 2024 23:57:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 36E576B0095; Thu,  8 Aug 2024 23:57:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 13F096B008C
	for <linux-mm@kvack.org>; Thu,  8 Aug 2024 23:57:44 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 8415E1A071D
	for <linux-mm@kvack.org>; Fri,  9 Aug 2024 03:57:43 +0000 (UTC)
X-FDA: 82431348006.22.9441C0D
Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182])
	by imf21.hostedemail.com (Postfix) with ESMTP id B527A1C000A
	for <linux-mm@kvack.org>; Fri,  9 Aug 2024 03:57:41 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=J3AFd+08;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf21.hostedemail.com: domain of surenb@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=surenb@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1723175788;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=v75qNMkF1itJRgEzPS/hkWQB2P7HT0C8iD8WI/N+FVU=;
	b=T+BJKt/t4Pv5avgVF5bnZTQPK80avgsKk8m2+PwXlJ96ShOZ6g/ydRfHnBL8V3B5UYIKS1
	Do/zzfXK52ivFSQpuFBBLbIux4EY3Xo4m629BS3DCsn194T5zXjHrHGjyS7/wdxcC1q8/j
	wUdyFRbDpbYqpwOtDbTHYdAQLq12IhQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723175788; a=rsa-sha256;
	cv=none;
	b=EnvVFaBoUlndYHVeSPQVXA2cVQc2gYEhtJVFJoqP80RcCO/vIMzFbOzylizpve5cxqrXAk
	Kl97NLdfHB/jVdeDAYW5EDENFEl7qWhyJJDLURee/MgEfa/c02gnpzrtbg/oph5ZCTawxm
	sCIwEebl51Ba2FxZ22mvv+QbaUIyIPY=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=J3AFd+08;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf21.hostedemail.com: domain of surenb@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=surenb@google.com
Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-6659e81bc68so17778227b3.0
        for <linux-mm@kvack.org>; Thu, 08 Aug 2024 20:57:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1723175861; x=1723780661; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=v75qNMkF1itJRgEzPS/hkWQB2P7HT0C8iD8WI/N+FVU=;
        b=J3AFd+08O5bxpiXaFXOUOtgB486vUbrnDh+ciSiq69o5xhwtJd2ZQlqrYkrX1fTPmS
         NIXzS9LtISppXwcbRolTvmopm9JQhwV0z4kN31MssElKROIZuqPWQZmCxgau+t3IPbQJ
         AFPd46O2e4bjWxdtf6Ir1wT+it/jZsQCNYdQ+LN9tg7vaDDhinTWolh2ypaCa6USc6oF
         CCzLaTZFJ4yIB4OgDZRfuP3fkCNuJ1zNpxCNYVFtLAk6oKJvADHb0bd/VwIgLS91ZJuj
         UTp4WUnqNK7gR1XvAuwmCVYY8zgZf2LUsFY767anp4c9RL5Qx5DEVw64NjvPHkoanvRF
         xmgw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1723175861; x=1723780661;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=v75qNMkF1itJRgEzPS/hkWQB2P7HT0C8iD8WI/N+FVU=;
        b=K6kIyIgifvPY+9XiD7+xCR7NLDruK+nU6j31r/1OxJzDnyTaH1igE//DkI2UZLLi1n
         HKgPwWAHJZUoBugqRxqS1iiFTholR0Xo/Gtph6JfLa62vZBSrOl+YLH6ibJuIL8dv1D3
         8237TU8TgqWwvA85grrjxwmYZaLon8l7sCaVx+kG1kzevkx3lAxEhFzY6Q5A0cKzXA0y
         60NRnAnGytrt3ynCMAbiTu/ssaptUPikze042aJAICNYRnmcaqnttrLGeIPdVieOQigI
         4xduNeykNccMtVmTbvnBnr8fbatH9xc0ywd/v+NjAhnxke7WXbJxkJwYqJiK8xSNwor2
         XEUA==
X-Forwarded-Encrypted: i=1; AJvYcCVHPOZRSdskJWQz2jsTC5tGK4V369+f/FRDR3dh//ccFGNisFcbQfMgHksPCZRQwkz5pC3C0CXJELJGUUe2Be1H4ik=
X-Gm-Message-State: AOJu0YzAKrdDhLHv1tcVbu9WqILNPi+K2RxJfhwSnvAAkbv92QW0mIRr
	XmBKQ3U7Giyn6OHed5IM9SzW4MCPkg58CFKATb/guh1FAEP64gKNzz/v5x16R7khcvUreG70/QL
	salfoZwnf7/pWZ8yjsETMb2GmpLOR9hIkKyi2
X-Google-Smtp-Source: AGHT+IFKOkUw821rpNUbs9WDK4rwNk2H5UpUewI2ksr1GbhpXYZ1qY6507s868PyJevKYxOrrFU1VyuPt9gcPbrnH48=
X-Received: by 2002:a05:690c:4788:b0:699:7b60:d349 with SMTP id
 00721157ae682-69ec54abe96mr3321087b3.11.1723175860321; Thu, 08 Aug 2024
 20:57:40 -0700 (PDT)
MIME-Version: 1.0
References: <20240808185949.1094891-1-mjguzik@gmail.com> <CAJuCfpEsYi77cuUhvQrFOazFX1OK0vp0PyevKqZsi0f1DeT3NA@mail.gmail.com>
 <CAGudoHHHOH=+ka=xw6cy51EYaGsUZEaC=LXYFvnXgFT+co9mKQ@mail.gmail.com> <CAJuCfpFXdx40UGRsXUYFgFGvEy-nM02f+TQ9nOPPepw6gbykmA@mail.gmail.com>
In-Reply-To: <CAJuCfpFXdx40UGRsXUYFgFGvEy-nM02f+TQ9nOPPepw6gbykmA@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Fri, 9 Aug 2024 03:57:27 +0000
Message-ID: <CAJuCfpH36sXvCaYL88nzi_8-Yr1tpxHuaLfCMqCac-zN6QHWmg@mail.gmail.com>
Subject: Re: [RFC PATCH] vm: align vma allocation and move the lock back into
 the struct
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Liam.Howlett@oracle.com, 
	vbabka@suse.cz, lstoakes@gmail.com, pedro.falcato@gmail.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: B527A1C000A
X-Stat-Signature: 4w91apocgp4dsmyye79ziq6c1cfwzbka
X-Rspam-User: 
X-HE-Tag: 1723175861-182575
X-HE-Meta: U2FsdGVkX1856ekthVyM5bPIApp79DbjhJAOvO8zCyR3yTyohHpajQ95nqy1n+TgV2BQnzfHRSb1yy10EaLzBL0897YVLKBoQNtz2nsVWoYjRIRwlIe2cuCjMS1cbDqeq6GWtrKLtKS7wpncVxh3GcU5S2wpRkZqoKNeYT+ii01rf8HdYSekfUouhIoayl1my3yzFA9aWoi8Z1MIfhiWH4iNrYXcYBh31KhGY4nF6ZuhsxruuRbdQiDUchUnnMhPSULTZ61xd8QWYtPX0Q6/3kHMONt35CcNkcz1zV4aHHI+NFkonh5cRABRljdFbqUarC6ZnnYcMQjNCy6gTkONp9pQJzcmeHgs5jq/9Erz16CwXMdz7cKcSqw2QJ91cWi4vZRheA0ppCmyN+9A4abSV6Vw1TXTmpkEjRZuzeezsKYI8xlSAES3wBOjiXtbbch6cs1x3GVk/fL6DIBHzUAwZFh/umB8V5HkGNCmkj+/mzQdSFto7OEwDb9JhKfio07zxg6B4jZT7lqKH/LMc+o1vLUUZDpUWfvQRLGKeZnsLn9XyrwraAri3XPATMqoubbpC6CmBjL07AAG/GmY6Rq5XaDpfuJDnPuj8suC0YWamFxRJeBmxUx88mZbZ09admCHBmreil0Mi1DldcuX/0SzQmiuZm48wAmGzUTtpjXOFQFYJLLCS7EYBlGcQhVbmeOgBF5LyVDkXN4EsaqbUOAoxXQghGnMXyw3bNyHTS7BfsViWE/LudQqs1qYH0m3G59vTAdGXdH6EMIc/3y/aoogOuZ0hrJDf45eoDPvQm6IyrdLPDCvWxUgfhB8REdIhvtBvfigZYrisWvKpj/vuMcseoW24YG998s4Gx5SjfF+DOcP52Oe3ooPq2qPpLiEIkFDakT78ozt06PJgh4t9I/tyZBP+yciFeOi6lRVdGRMghovHL5oiFhpTiUx5rJyZ7Jn+vvf4eq1jmZgGeU3XXB
 AEKvUiAe
 tdR7oq6iRDCD1kVBPtORjgqnfZA57oWyTzg2FNwAObHbwgJBju6EqJ5insrh57BiGtCRnIn3blBGMeTllLjPvAuU6Ehz2XOBgpuYn6CbjxTC34eR6RKvJpaSgUt9THtGqurwwB7lfGbk4Opfkhnwanw/lmFq5LfQk74WgumOODJqHr8/XTCTRbAumrIw4YpWljvR1+4wwyF+t/l/lxWgPdQJKQEsbSMcavZtXYPIPzv3uP3vkr5EZewOfOD7iMd5J4lLZZfYwNSiiUIaNq/kdC1I9aYIptKEKZoP5vRXUm51vnFH6/7+2tT4C7hYqSI6xHo4NJ/kBjpKyYLi8FedRyxqGfg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Aug 8, 2024 at 9:19=E2=80=AFPM Suren Baghdasaryan <surenb@google.co=
m> wrote:
>
> On Thu, Aug 8, 2024 at 1:04=E2=80=AFPM Mateusz Guzik <mjguzik@gmail.com> =
wrote:
> >
> > On Thu, Aug 8, 2024 at 9:39=E2=80=AFPM Suren Baghdasaryan <surenb@googl=
e.com> wrote:
> > >
> > > On Thu, Aug 8, 2024 at 7:00=E2=80=AFPM Mateusz Guzik <mjguzik@gmail.c=
om> wrote:
> > > >
> > > > ACHTUNG: this is more of a request for benchmarking than a patch
> > > > proposal at this stage
> > > >
> > > > I was pointed at your patch which moved the vma lock to a separate
> > > > allocation [1]. The commit message does not say anything about maki=
ng
> > > > sure the object itself is allocated with proper alignment and I fou=
nd
> > > > that the cache creation lacks the HWCACHE_ALIGN flag, which may or =
may
> > > > not be the problem.
> > > >
> > > > I verified with a simple one-liner than on a stock kernel the vmas =
keep
> > > > roaming around with a 16-byte alignment:
> > > > # bpftrace -e 'kretprobe:vm_area_alloc  { @[retval & 0x3f] =3D coun=
t(); }'
> > > > @[16]: 39
> > > > @[0]: 46
> > > > @[32]: 53
> > > > @[48]: 56
> > > >
> > > > Note the stock vma lock cache also lacks the alignment flag. While =
I
> > > > have not verified experimentally, if they are also romaing it would=
 mean
> > > > that 2 unrelated vmas can false-share locks. If the patch below is =
a
> > > > bust, the flag should probably be added to that one.
> > > >
> > > > The patch has slapped-around vma lock cache removal + HWALLOC for t=
he
> > > > vma cache. I left a pointer to not change relative offsets between
> > > > current fields. I does compile without CONFIG_PER_VMA_LOCK.
> > > >
> > > > Vlastimil says you tested a case where the struct got bloated to 25=
6
> > > > bytes, but the lock remained separate. It is unclear to me if this
> > > > happened with allocations made with the HWCACHE_ALIGN flag though.
> > > >
> > > > There is 0 urgency on my end, but it would be nice if you could try
> > > > this out with your test rig.
> > >
> > > Hi Mateusz,
> > > Sure, I'll give it a spin but I'm not optimistic. Your code looks
> > > almost identical to my latest attempt where I tried placing vm_lock
> > > into different cachelines including a separate one and using
> > > HWCACHE_ALIGN. And yet all my attempts showed regression.
> > > Just FYI, the test I'm using is the pft-threads test from mmtests
> > > suite. I'll post results today evening.
> > > Thanks,
> > > Suren.
> >
> > Ok, well maybe you did not leave the pointer in place? :)
>
> True, maybe that will make a difference. I'll let you know soon.
>
> >
> > It is plausible the problem is on vs off cpu behavior of rwsems --
> > there is a corner case where they neglect to spin. It is plausible
> > perf goes down simply because there is less on cpu time.
> >
> > Thus you bench can you make sure to time(1)?
>
> Sure, will do once I'm home. Thanks for the hints!

Unfortunately the same regression shows its ugly face:

compare-mmtests.pl Hmean results:
Hmean     faults/cpu-1    471264.4904 (   0.00%)   473085.6736 *   0.39%*
Hmean     faults/cpu-4    434571.7116 (   0.00%)   431214.3974 *  -0.77%*
Hmean     faults/cpu-7    407755.3217 (   0.00%)   395773.4052 *  -2.94%*
Hmean     faults/cpu-12   335604.9251 (   0.00%)   285426.3358 * -14.95%*
Hmean     faults/cpu-21   187588.9077 (   0.00%)   171227.7179 *  -8.72%*
Hmean     faults/cpu-30   140875.7878 (   0.00%)   124120.3437 * -11.89%*
Hmean     faults/cpu-48   106175.5493 (   0.00%)    93073.1499 * -12.34%*
Hmean     faults/cpu-56    92585.2536 (   0.00%)    82837.4299 * -10.53%*
Hmean     faults/sec-1    470924.4946 (   0.00%)   472730.9937 *   0.38%*
Hmean     faults/sec-4   1714823.8198 (   0.00%)  1693226.7248 *  -1.26%*
Hmean     faults/sec-7   2801395.1898 (   0.00%)  2717561.9417 *  -2.99%*
Hmean     faults/sec-12  3934168.2690 (   0.00%)  3319710.7540 * -15.62%*
Hmean     faults/sec-21  3736832.4592 (   0.00%)  3444687.9145 *  -7.82%*
Hmean     faults/sec-30  3845187.2636 (   0.00%)  3403585.7064 * -11.48%*
Hmean     faults/sec-48  4712317.7461 (   0.00%)  4180658.4710 * -11.28%*
Hmean     faults/sec-56  4873233.9844 (   0.00%)  4423608.6568 *  -9.23%*

This is the time(1) output with the baseline:
920.47user 7748.31system 18:30.85elapsed 780%CPU (0avgtext+0avgdata
26385096maxresident)k
140848inputs+19744outputs (66major+1583463207minor)pagefaults 0swaps

This is the time(1) output with your change:
1025.73user 8618.74system 19:10.79elapsed 838%CPU (0avgtext+0avgdata
26385116maxresident)k
16584inputs+19512outputs (61major+1583468687minor)pagefaults 0swaps

Maybe it has something to do with NUMA? The system I'm running has 2 NUMA n=
odes:

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   56
  On-line CPU(s) list:    0-55
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    CPU family:           6
    Model:                79
    Thread(s) per core:   2
    Core(s) per socket:   14
    Socket(s):            2
    Stepping:             1
    CPU max MHz:          3500.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             5188.26
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
tm pbe syscall nx pdpe1gb rdtscp lm constant_ts
                          c arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                          cid dca sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ss
                          bd ibrs ibpb stibp tpr_shadow flexpriority
ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsave
                          opt cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features:
  Virtualization:         VT-x
Caches (sum of all):
  L1d:                    896 KiB (28 instances)
  L1i:                    896 KiB (28 instances)
  L2:                     7 MiB (28 instances)
  L3:                     70 MiB (2 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-13,28-41
  NUMA node1 CPU(s):      14-27,42-55

Any ideas?


>
> >
> > For example with zsh I got:
> > ./run-mmtests.sh --no-monitor --config configs/config-workload-pft-thre=
ads
> >
> > 39.35s user 445.45s system 390% cpu 124.04s (2:04.04) total
> >
> > I verified with offcputime-bpfcc -K that indeed there is a bunch of
> > pft going off cpu from down_read/down_write even at the modest scale
> > this was running in my case.
> >
> > >
> > > >
> > > > 1: https://lore.kernel.org/all/20230227173632.3292573-34-surenb@goo=
gle.com/T/#u
> > > >
> > > > ---
> > > >  include/linux/mm.h       | 18 +++++++--------
> > > >  include/linux/mm_types.h | 10 ++++-----
> > > >  kernel/fork.c            | 47 ++++--------------------------------=
----
> > > >  mm/userfaultfd.c         |  6 ++---
> > > >  4 files changed, 19 insertions(+), 62 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 43b40334e9b2..6d8b668d3deb 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -687,7 +687,7 @@ static inline bool vma_start_read(struct vm_are=
a_struct *vma)
> > > >         if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_mm=
->mm_lock_seq))
> > > >                 return false;
> > > >
> > > > -       if (unlikely(down_read_trylock(&vma->vm_lock->lock) =3D=3D =
0))
> > > > +       if (unlikely(down_read_trylock(&vma->vm_lock) =3D=3D 0))
> > > >                 return false;
> > > >
> > > >         /*
> > > > @@ -702,7 +702,7 @@ static inline bool vma_start_read(struct vm_are=
a_struct *vma)
> > > >          * This pairs with RELEASE semantics in vma_end_write_all()=
.
> > > >          */
> > > >         if (unlikely(vma->vm_lock_seq =3D=3D smp_load_acquire(&vma-=
>vm_mm->mm_lock_seq))) {
> > > > -               up_read(&vma->vm_lock->lock);
> > > > +               up_read(&vma->vm_lock);
> > > >                 return false;
> > > >         }
> > > >         return true;
> > > > @@ -711,7 +711,7 @@ static inline bool vma_start_read(struct vm_are=
a_struct *vma)
> > > >  static inline void vma_end_read(struct vm_area_struct *vma)
> > > >  {
> > > >         rcu_read_lock(); /* keeps vma alive till the end of up_read=
 */
> > > > -       up_read(&vma->vm_lock->lock);
> > > > +       up_read(&vma->vm_lock);
> > > >         rcu_read_unlock();
> > > >  }
> > > >
> > > > @@ -740,7 +740,7 @@ static inline void vma_start_write(struct vm_ar=
ea_struct *vma)
> > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > >                 return;
> > > >
> > > > -       down_write(&vma->vm_lock->lock);
> > > > +       down_write(&vma->vm_lock);
> > > >         /*
> > > >          * We should use WRITE_ONCE() here because we can have conc=
urrent reads
> > > >          * from the early lockless pessimistic check in vma_start_r=
ead().
> > > > @@ -748,7 +748,7 @@ static inline void vma_start_write(struct vm_ar=
ea_struct *vma)
> > > >          * we should use WRITE_ONCE() for cleanliness and to keep K=
CSAN happy.
> > > >          */
> > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > > -       up_write(&vma->vm_lock->lock);
> > > > +       up_write(&vma->vm_lock);
> > > >  }
> > > >
> > > >  static inline void vma_assert_write_locked(struct vm_area_struct *=
vma)
> > > > @@ -760,7 +760,7 @@ static inline void vma_assert_write_locked(stru=
ct vm_area_struct *vma)
> > > >
> > > >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> > > >  {
> > > > -       if (!rwsem_is_locked(&vma->vm_lock->lock))
> > > > +       if (!rwsem_is_locked(&vma->vm_lock))
> > > >                 vma_assert_write_locked(vma);
> > > >  }
> > > >
> > > > @@ -827,10 +827,6 @@ static inline void assert_fault_locked(struct =
vm_fault *vmf)
> > > >
> > > >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> > > >
> > > > -/*
> > > > - * WARNING: vma_init does not initialize vma->vm_lock.
> > > > - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> > > > - */
> > > >  static inline void vma_init(struct vm_area_struct *vma, struct mm_=
struct *mm)
> > > >  {
> > > >         memset(vma, 0, sizeof(*vma));
> > > > @@ -839,6 +835,8 @@ static inline void vma_init(struct vm_area_stru=
ct *vma, struct mm_struct *mm)
> > > >         INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > >         vma_mark_detached(vma, false);
> > > >         vma_numab_state_init(vma);
> > > > +       init_rwsem(&vma->vm_lock);
> > > > +       vma->vm_lock_seq =3D -1;
> > > >  }
> > > >
> > > >  /* Use when VMA is not part of the VMA tree and needs no locking *=
/
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 003619fab20e..caffdb4eeb94 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -615,10 +615,6 @@ static inline struct anon_vma_name *anon_vma_n=
ame_alloc(const char *name)
> > > >  }
> > > >  #endif
> > > >
> > > > -struct vma_lock {
> > > > -       struct rw_semaphore lock;
> > > > -};
> > > > -
> > > >  struct vma_numab_state {
> > > >         /*
> > > >          * Initialised as time in 'jiffies' after which VMA
> > > > @@ -716,8 +712,7 @@ struct vm_area_struct {
> > > >          * slowpath.
> > > >          */
> > > >         int vm_lock_seq;
> > > > -       /* Unstable RCU readers are allowed to read this. */
> > > > -       struct vma_lock *vm_lock;
> > > > +       void *vm_dummy;
> > > >  #endif
> > > >
> > > >         /*
> > > > @@ -770,6 +765,9 @@ struct vm_area_struct {
> > > >         struct vma_numab_state *numab_state;    /* NUMA Balancing s=
tate */
> > > >  #endif
> > > >         struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +       struct rw_semaphore vm_lock ____cacheline_aligned_in_smp;
> > > > +#endif
> > > >  } __randomize_layout;
> > > >
> > > >  #ifdef CONFIG_NUMA
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 92bfe56c9fed..eab04a24d5f1 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
> > > >  /* SLAB cache for mm_struct structures (tsk->mm) */
> > > >  static struct kmem_cache *mm_cachep;
> > > >
> > > > -#ifdef CONFIG_PER_VMA_LOCK
> > > > -
> > > > -/* SLAB cache for vm_area_struct.lock */
> > > > -static struct kmem_cache *vma_lock_cachep;
> > > > -
> > > > -static bool vma_lock_alloc(struct vm_area_struct *vma)
> > > > -{
> > > > -       vma->vm_lock =3D kmem_cache_alloc(vma_lock_cachep, GFP_KERN=
EL);
> > > > -       if (!vma->vm_lock)
> > > > -               return false;
> > > > -
> > > > -       init_rwsem(&vma->vm_lock->lock);
> > > > -       vma->vm_lock_seq =3D -1;
> > > > -
> > > > -       return true;
> > > > -}
> > > > -
> > > > -static inline void vma_lock_free(struct vm_area_struct *vma)
> > > > -{
> > > > -       kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> > > > -}
> > > > -
> > > > -#else /* CONFIG_PER_VMA_LOCK */
> > > > -
> > > > -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { re=
turn true; }
> > > > -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> > > > -
> > > > -#endif /* CONFIG_PER_VMA_LOCK */
> > > > -
> > > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > >  {
> > > >         struct vm_area_struct *vma;
> > > > @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm=
_struct *mm)
> > > >                 return NULL;
> > > >
> > > >         vma_init(vma, mm);
> > > > -       if (!vma_lock_alloc(vma)) {
> > > > -               kmem_cache_free(vm_area_cachep, vma);
> > > > -               return NULL;
> > > > -       }
> > > >
> > > >         return vma;
> > > >  }
> > > > @@ -496,10 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_a=
rea_struct *orig)
> > > >          * will be reinitialized.
> > > >          */
> > > >         data_race(memcpy(new, orig, sizeof(*new)));
> > > > -       if (!vma_lock_alloc(new)) {
> > > > -               kmem_cache_free(vm_area_cachep, new);
> > > > -               return NULL;
> > > > -       }
> > > > +       init_rwsem(&new->vm_lock);
> > > > +       new->vm_lock_seq =3D -1;
> > > >         INIT_LIST_HEAD(&new->anon_vma_chain);
> > > >         vma_numab_state_init(new);
> > > >         dup_anon_vma_name(orig, new);
> > > > @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> > > >  {
> > > >         vma_numab_state_free(vma);
> > > >         free_anon_vma_name(vma);
> > > > -       vma_lock_free(vma);
> > > >         kmem_cache_free(vm_area_cachep, vma);
> > > >  }
> > > >
> > > > @@ -522,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head=
 *head)
> > > >                                                   vm_rcu);
> > > >
> > > >         /* The vma should not be locked while being destroyed. */
> > > > -       VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> > > > +       VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock), vma);
> > > >         __vm_area_free(vma);
> > > >  }
> > > >  #endif
> > > > @@ -3192,10 +3156,7 @@ void __init proc_caches_init(void)
> > > >                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > > >                         NULL);
> > > >
> > > > -       vm_area_cachep =3D KMEM_CACHE(vm_area_struct, SLAB_PANIC|SL=
AB_ACCOUNT);
> > > > -#ifdef CONFIG_PER_VMA_LOCK
> > > > -       vma_lock_cachep =3D KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_AC=
COUNT);
> > > > -#endif
> > > > +       vm_area_cachep =3D KMEM_CACHE(vm_area_struct, SLAB_PANIC|SL=
AB_ACCOUNT|SLAB_HWCACHE_ALIGN);
> > > >         mmap_init();
> > > >         nsproxy_cache_init();
> > > >  }
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index 3b7715ecf292..e95ecb2063d2 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -92,7 +92,7 @@ static struct vm_area_struct *uffd_lock_vma(struc=
t mm_struct *mm,
> > > >                  * mmap_lock, which guarantees that nobody can lock=
 the
> > > >                  * vma for write (vma_start_write()) under us.
> > > >                  */
> > > > -               down_read(&vma->vm_lock->lock);
> > > > +               down_read(&vma->vm_lock);
> > > >         }
> > > >
> > > >         mmap_read_unlock(mm);
> > > > @@ -1468,9 +1468,9 @@ static int uffd_move_lock(struct mm_struct *m=
m,
> > > >                  * See comment in uffd_lock_vma() as to why not usi=
ng
> > > >                  * vma_start_read() here.
> > > >                  */
> > > > -               down_read(&(*dst_vmap)->vm_lock->lock);
> > > > +               down_read(&(*dst_vmap)->vm_lock);
> > > >                 if (*dst_vmap !=3D *src_vmap)
> > > > -                       down_read_nested(&(*src_vmap)->vm_lock->loc=
k,
> > > > +                       down_read_nested(&(*src_vmap)->vm_lock,
> > > >                                          SINGLE_DEPTH_NESTING);
> > > >         }
> > > >         mmap_read_unlock(mm);
> > > > --
> > > > 2.43.0
> > > >
> >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>