[PATCH v7 00/17] move per-vma lock into vm_area

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/17] move per-vma lock into vm_area_struct
@ 2024-12-26 17:06 Suren Baghdasaryan
  2024-12-26 17:06 ` [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (16 more replies)
  0 siblings, 17 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v6 [4]
- Fixed vma_start_read_locked() condition in uffd_move_lock(),
per Lokesh Gidra
- Moved more conditions into unlikely() in vma_start_read(),
per Peter
- Renamed VMA_LOCK_LOCKED into VMA_LOCK_OFFSET, removed
VMA_STATE_{A|DE}TACHED, introduced VMA_REF_LIMIT, per Peter
- Made sure no re-attach or re-detach operation is happening, added
assertions to catch such cases, per Peter
- Added a parameter to vma_iter_store{_gfp} to indicate when a new vma is
being added or existing being modified, to avoid re-attaching existing vma
- Refactored patches to implement detached guarantees in the single patch

What I did not include in this patchset:
- Changing of vma locking patterns;
- Changing do_vmi_align_munmap() to avoid reattach_vmas()
This cleanup needs more discussion and can be done independently as this
patchset is already quite large.

Patchset applies over mm-unstable.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241216192419.2970941-1-surenb@google.com/

Suren Baghdasaryan (17):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma
  mm: mark vmas detached upon exit
  mm/nommu: fix the last places where vma is not locked before being
    attached
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm/debug: print vm_refcnt state when dumping the vma
  mm: remove extra vma_numab_state_init() call
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 +++++----
 include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
 include/linux/mm_types.h           |  36 ++++---
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  20 +++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      |  87 +++++------------
 mm/debug.c                         |   4 +-
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  85 +++++++++++++---
 mm/mmap.c                          |   3 +-
 mm/nommu.c                         |   6 +-
 mm/userfaultfd.c                   |  31 +++---
 mm/vma.c                           |  31 +++---
 mm/vma.h                           |  13 ++-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  93 ++++++++----------
 19 files changed, 385 insertions(+), 263 deletions(-)


base-commit: 431614f1580a03c1a653340c55ea76bd12a9403f
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-08 14:59   ` Liam R. Howlett
  2024-12-26 17:06 ` [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read.  Replace direct accesses to vma->vm_lock with
these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/userfaultfd.c   | 22 +++++-----------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 406b981af881..a48e207d25f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,6 +735,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af3dfc3633db..4527c385935b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
-	if (!IS_ERR(vma)) {
-		/*
-		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
-		 */
-		down_read(&vma->vm_lock->lock);
-	}
+	if (!IS_ERR(vma))
+		vma_start_read_locked(vma);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1491,14 +1483,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		/*
-		 * See comment in uffd_lock_vma() as to why not using
-		 * vma_start_read() here.
-		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers
  2024-12-26 17:06 ` [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2025-01-08 14:59   ` Liam R. Howlett
  0 siblings, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 14:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> Introduce helper functions which can be used to read-lock a VMA when
> holding mmap_lock for read.  Replace direct accesses to vma->vm_lock with
> these new helpers.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  include/linux/mm.h | 24 ++++++++++++++++++++++++
>  mm/userfaultfd.c   | 22 +++++-----------------
>  2 files changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 406b981af881..a48e207d25f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -735,6 +735,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	return true;
>  }
>  
> +/*
> + * Use only while holding mmap read lock which guarantees that locking will not
> + * fail (nobody can concurrently write-lock the vma). vma_start_read() should
> + * not be used in such cases because it might fail due to mm_lock_seq overflow.
> + * This functionality is used to obtain vma read lock and drop the mmap read lock.
> + */
> +static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> +{
> +	mmap_assert_locked(vma->vm_mm);
> +	down_read_nested(&vma->vm_lock->lock, subclass);
> +}
> +
> +/*
> + * Use only while holding mmap read lock which guarantees that locking will not
> + * fail (nobody can concurrently write-lock the vma). vma_start_read() should
> + * not be used in such cases because it might fail due to mm_lock_seq overflow.
> + * This functionality is used to obtain vma read lock and drop the mmap read lock.
> + */
> +static inline void vma_start_read_locked(struct vm_area_struct *vma)
> +{
> +	mmap_assert_locked(vma->vm_mm);
> +	down_read(&vma->vm_lock->lock);
> +}
> +
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af3dfc3633db..4527c385935b 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
>  
>  	mmap_read_lock(mm);
>  	vma = find_vma_and_prepare_anon(mm, address);
> -	if (!IS_ERR(vma)) {
> -		/*
> -		 * We cannot use vma_start_read() as it may fail due to
> -		 * false locked (see comment in vma_start_read()). We
> -		 * can avoid that by directly locking vm_lock under
> -		 * mmap_lock, which guarantees that nobody can lock the
> -		 * vma for write (vma_start_write()) under us.
> -		 */
> -		down_read(&vma->vm_lock->lock);
> -	}
> +	if (!IS_ERR(vma))
> +		vma_start_read_locked(vma);
>  
>  	mmap_read_unlock(mm);
>  	return vma;
> @@ -1491,14 +1483,10 @@ static int uffd_move_lock(struct mm_struct *mm,
>  	mmap_read_lock(mm);
>  	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>  	if (!err) {
> -		/*
> -		 * See comment in uffd_lock_vma() as to why not using
> -		 * vma_start_read() here.
> -		 */
> -		down_read(&(*dst_vmap)->vm_lock->lock);
> +		vma_start_read_locked(*dst_vmap);
>  		if (*dst_vmap != *src_vmap)
> -			down_read_nested(&(*src_vmap)->vm_lock->lock,
> -					 SINGLE_DEPTH_NESTING);
> +			vma_start_read_locked_nested(*src_vmap,
> +						SINGLE_DEPTH_NESTING);
>  	}
>  	mmap_read_unlock(mm);
>  	return err;
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-12-26 17:06 ` [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-08 14:59   ` Liam R. Howlett
  2024-12-26 17:06 ` [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing.  Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].

Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code.  When that split-away part is a lock, it complicates
things even further.  With no performance benefits, there are no reasons
for this split.  Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well.  With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes.  Therefore these calculations are close to the
worst case scenario.  A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 28 ++++++++++--------
 include/linux/mm_types.h         |  6 ++--
 kernel/fork.c                    | 49 ++++----------------------------
 tools/testing/vma/vma_internal.h | 33 +++++----------------
 4 files changed, 32 insertions(+), 84 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a48e207d25f2..f3f92ba8f5fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -697,6 +697,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -714,7 +720,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -744,7 +750,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -756,13 +762,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -791,7 +797,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -799,7 +805,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -811,7 +817,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -844,6 +850,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -878,10 +885,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -890,6 +893,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f1b2dc788e2..6573d95f1d1e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -730,8 +730,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -784,6 +782,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index ded49f18cd95..40a8e615499f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3188,11 +3151,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index ae635eecbfa8..d19ce6fcab83 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -270,10 +270,10 @@ struct vm_area_struct {
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_lock.lock (in write mode)
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_lock.lock (in read or write mode)
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -282,7 +282,7 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock *vm_lock;
+	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -459,17 +459,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
+	init_rwsem(&vma->vm_lock.lock);
 	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -492,6 +485,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
+	vma_lock_init(vma);
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -502,10 +496,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		free(vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -518,10 +508,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	if (!vma_lock_alloc(new)) {
-		free(new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 
 	return new;
@@ -691,14 +678,8 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	free(vma->vm_lock);
-}
-
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	vma_lock_free(vma);
 	free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct
  2024-12-26 17:06 ` [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2025-01-08 14:59   ` Liam R. Howlett
  0 siblings, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 14:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing.  Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> 
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code.  When that split-away part is a lock, it complicates
> things even further.  With no performance benefits, there are no reasons
> for this split.  Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
> vm_lock back into vm_area_struct, aligning it at the cacheline boundary
> and changing the cache to be cacheline-aligned as well.  With kernel
> compiled using defconfig, this causes VMA memory consumption to grow from
> 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
> 
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
> 
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
> dependent on the kernel configuration and typically the original size is
> higher than 160 bytes.  Therefore these calculations are close to the
> worst case scenario.  A more realistic vm_area_struct usage before this
> change is:
> 
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    176   46    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
> addressed later by optimizing the vm_lock.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  include/linux/mm.h               | 28 ++++++++++--------
>  include/linux/mm_types.h         |  6 ++--
>  kernel/fork.c                    | 49 ++++----------------------------
>  tools/testing/vma/vma_internal.h | 33 +++++----------------
>  4 files changed, 32 insertions(+), 84 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a48e207d25f2..f3f92ba8f5fe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -697,6 +697,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_lock_init(struct vm_area_struct *vma)
> +{
> +	init_rwsem(&vma->vm_lock.lock);
> +	vma->vm_lock_seq = UINT_MAX;
> +}
> +
>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -714,7 +720,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> +	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>  		return false;
>  
>  	/*
> @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock->lock);
> +		up_read(&vma->vm_lock.lock);
>  		return false;
>  	}
>  	return true;
> @@ -744,7 +750,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read_nested(&vma->vm_lock->lock, subclass);
> +	down_read_nested(&vma->vm_lock.lock, subclass);
>  }
>  
>  /*
> @@ -756,13 +762,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
>  static inline void vma_start_read_locked(struct vm_area_struct *vma)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read(&vma->vm_lock->lock);
> +	down_read(&vma->vm_lock.lock);
>  }
>  
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> -	up_read(&vma->vm_lock->lock);
> +	up_read(&vma->vm_lock.lock);
>  	rcu_read_unlock();
>  }
>  
> @@ -791,7 +797,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	if (__is_vma_write_locked(vma, &mm_lock_seq))
>  		return;
>  
> -	down_write(&vma->vm_lock->lock);
> +	down_write(&vma->vm_lock.lock);
>  	/*
>  	 * We should use WRITE_ONCE() here because we can have concurrent reads
>  	 * from the early lockless pessimistic check in vma_start_read().
> @@ -799,7 +805,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>  	 */
>  	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> -	up_write(&vma->vm_lock->lock);
> +	up_write(&vma->vm_lock.lock);
>  }
>  
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> @@ -811,7 +817,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock->lock))
> +	if (!rwsem_is_locked(&vma->vm_lock.lock))
>  		vma_assert_write_locked(vma);
>  }
>  
> @@ -844,6 +850,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  
>  #else /* CONFIG_PER_VMA_LOCK */
>  
> +static inline void vma_lock_init(struct vm_area_struct *vma) {}
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  		{ return false; }
>  static inline void vma_end_read(struct vm_area_struct *vma) {}
> @@ -878,10 +885,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>  
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
>  
> -/*
> - * WARNING: vma_init does not initialize vma->vm_lock.
> - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> - */
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>  	memset(vma, 0, sizeof(*vma));
> @@ -890,6 +893,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  	vma_mark_detached(vma, false);
>  	vma_numab_state_init(vma);
> +	vma_lock_init(vma);
>  }
>  
>  /* Use when VMA is not part of the VMA tree and needs no locking */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5f1b2dc788e2..6573d95f1d1e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -730,8 +730,6 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	/* Unstable RCU readers are allowed to read this. */
> -	struct vma_lock *vm_lock;
>  #endif
>  
>  	/*
> @@ -784,6 +782,10 @@ struct vm_area_struct {
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* Unstable RCU readers are allowed to read this. */
> +	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +#endif
>  } __randomize_layout;
>  
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ded49f18cd95..40a8e615499f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -#ifdef CONFIG_PER_VMA_LOCK
> -
> -/* SLAB cache for vm_area_struct.lock */
> -static struct kmem_cache *vma_lock_cachep;
> -
> -static bool vma_lock_alloc(struct vm_area_struct *vma)
> -{
> -	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> -	if (!vma->vm_lock)
> -		return false;
> -
> -	init_rwsem(&vma->vm_lock->lock);
> -	vma->vm_lock_seq = UINT_MAX;
> -
> -	return true;
> -}
> -
> -static inline void vma_lock_free(struct vm_area_struct *vma)
> -{
> -	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> -}
> -
> -#else /* CONFIG_PER_VMA_LOCK */
> -
> -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> -
> -#endif /* CONFIG_PER_VMA_LOCK */
> -
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  		return NULL;
>  
>  	vma_init(vma, mm);
> -	if (!vma_lock_alloc(vma)) {
> -		kmem_cache_free(vm_area_cachep, vma);
> -		return NULL;
> -	}
>  
>  	return vma;
>  }
> @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	 * will be reinitialized.
>  	 */
>  	data_race(memcpy(new, orig, sizeof(*new)));
> -	if (!vma_lock_alloc(new)) {
> -		kmem_cache_free(vm_area_cachep, new);
> -		return NULL;
> -	}
> +	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
> @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
>  {
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> -	vma_lock_free(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>  
> @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
>  						  vm_rcu);
>  
>  	/* The vma should not be locked while being destroyed. */
> -	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> +	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
>  	__vm_area_free(vma);
>  }
>  #endif
> @@ -3188,11 +3151,9 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -
> -	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> -#endif
> +	vm_area_cachep = KMEM_CACHE(vm_area_struct,
> +			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +			SLAB_ACCOUNT);
>  	mmap_init();
>  	nsproxy_cache_init();
>  }
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index ae635eecbfa8..d19ce6fcab83 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -270,10 +270,10 @@ struct vm_area_struct {
>  	/*
>  	 * Can only be written (using WRITE_ONCE()) while holding both:
>  	 *  - mmap_lock (in write mode)
> -	 *  - vm_lock->lock (in write mode)
> +	 *  - vm_lock.lock (in write mode)
>  	 * Can be read reliably while holding one of:
>  	 *  - mmap_lock (in read or write mode)
> -	 *  - vm_lock->lock (in read or write mode)
> +	 *  - vm_lock.lock (in read or write mode)
>  	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
>  	 * while holding nothing (except RCU to keep the VMA struct allocated).
>  	 *
> @@ -282,7 +282,7 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	struct vma_lock *vm_lock;
> +	struct vma_lock vm_lock;
>  #endif
>  
>  	/*
> @@ -459,17 +459,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
>  	return mas_find(&vmi->mas, ULONG_MAX);
>  }
>  
> -static inline bool vma_lock_alloc(struct vm_area_struct *vma)
> +static inline void vma_lock_init(struct vm_area_struct *vma)
>  {
> -	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
> -
> -	if (!vma->vm_lock)
> -		return false;
> -
> -	init_rwsem(&vma->vm_lock->lock);
> +	init_rwsem(&vma->vm_lock.lock);
>  	vma->vm_lock_seq = UINT_MAX;
> -
> -	return true;
>  }
>  
>  static inline void vma_assert_write_locked(struct vm_area_struct *);
> @@ -492,6 +485,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  	vma_mark_detached(vma, false);
> +	vma_lock_init(vma);
>  }
>  
>  static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> @@ -502,10 +496,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  		return NULL;
>  
>  	vma_init(vma, mm);
> -	if (!vma_lock_alloc(vma)) {
> -		free(vma);
> -		return NULL;
> -	}
>  
>  	return vma;
>  }
> @@ -518,10 +508,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		return NULL;
>  
>  	memcpy(new, orig, sizeof(*new));
> -	if (!vma_lock_alloc(new)) {
> -		free(new);
> -		return NULL;
> -	}
> +	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  
>  	return new;
> @@ -691,14 +678,8 @@ static inline void mpol_put(struct mempolicy *)
>  {
>  }
>  
> -static inline void vma_lock_free(struct vm_area_struct *vma)
> -{
> -	free(vma->vm_lock);
> -}
> -
>  static inline void __vm_area_free(struct vm_area_struct *vma)
>  {
> -	vma_lock_free(vma);
>  	free(vma);
>  }
>  
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-12-26 17:06 ` [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2024-12-26 17:06 ` [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-08 15:01   ` Liam R. Howlett
  2024-12-26 17:06 ` [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma Suren Baghdasaryan
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Current implementation does not set detached flag when a VMA is first
allocated.  This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree.  Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.

Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 27 ++++++++++++++++++++-------
 kernel/fork.c                    |  4 ++++
 mm/memory.c                      |  2 +-
 mm/vma.c                         |  6 +++---
 mm/vma.h                         |  2 ++
 tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
 6 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f92ba8f5fe..081178b0eec4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,12 +821,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -857,8 +866,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
-				     bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
@@ -891,7 +900,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
@@ -1086,6 +1098,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 40a8e615499f..f2f9e7b427ad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
+#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
diff --git a/mm/memory.c b/mm/memory.c
index 2a20e3810534..d0dee2282325 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6349,7 +6349,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index 0caaeea899a9..476146c25283 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 
 	if (vp->remove) {
 again:
-		vma_mark_detached(vp->remove, true);
+		vma_mark_detached(vp->remove);
 		if (vp->file) {
 			uprobe_munmap(vp->remove, vp->remove->vm_start,
 				      vp->remove->vm_end);
@@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
 
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
+		vma_mark_attached(vma);
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
+		vma_mark_detached(next);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index 61ed044b6145..24636a2b0acf 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -157,6 +157,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
@@ -389,6 +390,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+	vma_mark_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index d19ce6fcab83..2a624f9304da 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -465,13 +465,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -484,7 +488,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -510,6 +515,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree
  2024-12-26 17:06 ` [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2025-01-08 15:01   ` Liam R. Howlett
  0 siblings, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 15:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> Current implementation does not set detached flag when a VMA is first
> allocated.  This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree.  Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added into
> a tree.
> 
> Introduce vma_mark_attached() to make the API more readable and to
> simplify possible future cleanup when vma->vm_mm might be used to indicate
> detached vma and vma_mark_attached() will need an additional mm parameter.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  include/linux/mm.h               | 27 ++++++++++++++++++++-------
>  kernel/fork.c                    |  4 ++++
>  mm/memory.c                      |  2 +-
>  mm/vma.c                         |  6 +++---
>  mm/vma.h                         |  2 ++
>  tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
>  6 files changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f92ba8f5fe..081178b0eec4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -821,12 +821,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>  		vma_assert_write_locked(vma);
>  }
>  
> -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
> +{
> +	vma->detached = false;
> +}
> +
> +static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
>  	/* When detaching vma should be write-locked */
> -	if (detached)
> -		vma_assert_write_locked(vma);
> -	vma->detached = detached;
> +	vma_assert_write_locked(vma);
> +	vma->detached = true;
> +}
> +
> +static inline bool is_vma_detached(struct vm_area_struct *vma)
> +{
> +	return vma->detached;
>  }
>  
>  static inline void release_fault_lock(struct vm_fault *vmf)
> @@ -857,8 +866,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
>  static inline void vma_start_write(struct vm_area_struct *vma) {}
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  		{ mmap_assert_write_locked(vma->vm_mm); }
> -static inline void vma_mark_detached(struct vm_area_struct *vma,
> -				     bool detached) {}
> +static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> +static inline void vma_mark_detached(struct vm_area_struct *vma) {}
>  
>  static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		unsigned long address)
> @@ -891,7 +900,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
> +#endif
>  	vma_numab_state_init(vma);
>  	vma_lock_init(vma);
>  }
> @@ -1086,6 +1098,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
>  	if (unlikely(mas_is_err(&vmi->mas)))
>  		return -ENOMEM;
>  
> +	vma_mark_attached(vma);
>  	return 0;
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 40a8e615499f..f2f9e7b427ad 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	data_race(memcpy(new, orig, sizeof(*new)));
>  	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	new->detached = true;
> +#endif
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a20e3810534..d0dee2282325 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6349,7 +6349,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		goto inval;
>  
>  	/* Check if the VMA got isolated after we found it */
> -	if (vma->detached) {
> +	if (is_vma_detached(vma)) {
>  		vma_end_read(vma);
>  		count_vm_vma_lock_event(VMA_LOCK_MISS);
>  		/* The area was replaced with another one */
> diff --git a/mm/vma.c b/mm/vma.c
> index 0caaeea899a9..476146c25283 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  
>  	if (vp->remove) {
>  again:
> -		vma_mark_detached(vp->remove, true);
> +		vma_mark_detached(vp->remove);
>  		if (vp->file) {
>  			uprobe_munmap(vp->remove, vp->remove->vm_start,
>  				      vp->remove->vm_end);
> @@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
>  
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		vma_mark_detached(vma, false);
> +		vma_mark_attached(vma);
>  
>  	__mt_destroy(mas_detach->tree);
>  }
> @@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		if (error)
>  			goto munmap_gather_failed;
>  
> -		vma_mark_detached(next, true);
> +		vma_mark_detached(next);
>  		nrpages = vma_pages(next);
>  
>  		vms->nr_pages += nrpages;
> diff --git a/mm/vma.h b/mm/vma.h
> index 61ed044b6145..24636a2b0acf 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -157,6 +157,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
>  	if (unlikely(mas_is_err(&vmi->mas)))
>  		return -ENOMEM;
>  
> +	vma_mark_attached(vma);
>  	return 0;
>  }
>  
> @@ -389,6 +390,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
>  
>  	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
>  	mas_store_prealloc(&vmi->mas, vma);
> +	vma_mark_attached(vma);
>  }
>  
>  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index d19ce6fcab83..2a624f9304da 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -465,13 +465,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>  	vma->vm_lock_seq = UINT_MAX;
>  }
>  
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
> +{
> +	vma->detached = false;
> +}
> +
>  static inline void vma_assert_write_locked(struct vm_area_struct *);
> -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> +static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
>  	/* When detaching vma should be write-locked */
> -	if (detached)
> -		vma_assert_write_locked(vma);
> -	vma->detached = detached;
> +	vma_assert_write_locked(vma);
> +	vma->detached = true;
>  }
>  
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
> @@ -484,7 +488,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
>  	vma_lock_init(vma);
>  }
>  
> @@ -510,6 +515,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	memcpy(new, orig, sizeof(*new));
>  	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	new->detached = true;

Annoying that this is done in different locations, but that's a clean up
for another time.

>  
>  	return new;
>  }
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2024-12-26 17:06 ` [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-07 16:48   ` Vlastimil Babka
  2025-01-07 16:49   ` Liam R. Howlett
  2024-12-26 17:06 ` [PATCH v7 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_iter_store() functions can be used both when adding a new vma and
when updating an existing one. However for existing ones we do not need
to mark them attached as they are already marked that way. Add a parameter
to distinguish the usage and skip vma_mark_attached() when not needed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 12 ++++++++++++
 mm/nommu.c         |  4 ++--
 mm/vma.c           | 16 ++++++++--------
 mm/vma.h           | 13 +++++++++----
 4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 081178b0eec4..c50edfedd99d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
+static inline void vma_assert_attached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(vma->detached, vma);
+}
+
+static inline void vma_assert_detached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(!vma->detached, vma);
+}
+
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
 	vma->detached = false;
@@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
+static inline void vma_assert_attached(struct vm_area_struct *vma) {}
+static inline void vma_assert_detached(struct vm_area_struct *vma) {}
 static inline void vma_mark_attached(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 9cb6e99215e2..72c8c505836c 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1191,7 +1191,7 @@ unsigned long do_mmap(struct file *file,
 	setup_vma_to_mm(vma, current->mm);
 	current->mm->map_count++;
 	/* add the VMA to the tree */
-	vma_iter_store(&vmi, vma);
+	vma_iter_store(&vmi, vma, true);
 
 	/* we flush the region from the icache only when the first executable
 	 * mapping of it is made  */
@@ -1356,7 +1356,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	setup_vma_to_mm(vma, mm);
 	setup_vma_to_mm(new, mm);
-	vma_iter_store(vmi, new);
+	vma_iter_store(vmi, new, true);
 	mm->map_count++;
 	return 0;
 
diff --git a/mm/vma.c b/mm/vma.c
index 476146c25283..ce113dd8c471 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -306,7 +306,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		 * us to insert it before dropping the locks
 		 * (it may either follow vma or precede it).
 		 */
-		vma_iter_store(vmi, vp->insert);
+		vma_iter_store(vmi, vp->insert, true);
 		mm->map_count++;
 	}
 
@@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
 	vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
 
 	if (expanded)
-		vma_iter_store(vmg->vmi, vmg->vma);
+		vma_iter_store(vmg->vmi, vmg->vma, false);
 
 	if (adj_start) {
 		adjust->vm_start += adj_start;
 		adjust->vm_pgoff += PHYS_PFN(adj_start);
 		if (adj_start < 0) {
 			WARN_ON(expanded);
-			vma_iter_store(vmg->vmi, adjust);
+			vma_iter_store(vmg->vmi, adjust, false);
 		}
 	}
 
@@ -1689,7 +1689,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 		return -ENOMEM;
 
 	vma_start_write(vma);
-	vma_iter_store(&vmi, vma);
+	vma_iter_store(&vmi, vma, true);
 	vma_link_file(vma);
 	mm->map_count++;
 	validate_mm(mm);
@@ -2368,7 +2368,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 
 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
-	vma_iter_store(vmi, vma);
+	vma_iter_store(vmi, vma, true);
 	map->mm->map_count++;
 	vma_link_file(vma);
 
@@ -2542,7 +2542,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vm_flags_init(vma, flags);
 	vma->vm_page_prot = vm_get_page_prot(flags);
 	vma_start_write(vma);
-	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
+	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL, true))
 		goto mas_store_fail;
 
 	mm->map_count++;
@@ -2785,7 +2785,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store(&vmi, vma, false);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
@@ -2865,7 +2865,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store(&vmi, vma, false);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
diff --git a/mm/vma.h b/mm/vma.h
index 24636a2b0acf..18c9e49b1eae 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -145,7 +145,7 @@ __must_check int vma_shrink(struct vma_iterator *vmi,
 		unsigned long start, unsigned long end, pgoff_t pgoff);
 
 static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
-			struct vm_area_struct *vma, gfp_t gfp)
+			struct vm_area_struct *vma, gfp_t gfp, bool new_vma)
 
 {
 	if (vmi->mas.status != ma_start &&
@@ -157,7 +157,10 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
-	vma_mark_attached(vma);
+	if (new_vma)
+		vma_mark_attached(vma);
+	vma_assert_attached(vma);
+
 	return 0;
 }
 
@@ -366,7 +369,7 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
 
 /* Store a VMA with preallocated memory */
 static inline void vma_iter_store(struct vma_iterator *vmi,
-				  struct vm_area_struct *vma)
+				  struct vm_area_struct *vma, bool new_vma)
 {
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
@@ -390,7 +393,9 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
-	vma_mark_attached(vma);
+	if (new_vma)
+		vma_mark_attached(vma);
+	vma_assert_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma
  2024-12-26 17:06 ` [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma Suren Baghdasaryan
@ 2025-01-07 16:48   ` Vlastimil Babka
  2025-01-07 16:49   ` Liam R. Howlett
  1 sibling, 0 replies; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-07 16:48 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:06, Suren Baghdasaryan wrote:
> vma_iter_store() functions can be used both when adding a new vma and
> when updating an existing one. However for existing ones we do not need
> to mark them attached as they are already marked that way. Add a parameter
> to distinguish the usage and skip vma_mark_attached() when not needed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Seems like an overkill? Looks the common case is the "true" case so add a
variant for the false case? Also seems no _store_gfp caller uses false so
that doesn't need to exist anyway?



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma
  2024-12-26 17:06 ` [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma Suren Baghdasaryan
  2025-01-07 16:48   ` Vlastimil Babka
@ 2025-01-07 16:49   ` Liam R. Howlett
  2025-01-07 17:12     ` Suren Baghdasaryan
  1 sibling, 1 reply; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-07 16:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> vma_iter_store() functions can be used both when adding a new vma and
> when updating an existing one. However for existing ones we do not need
> to mark them attached as they are already marked that way. Add a parameter
> to distinguish the usage and skip vma_mark_attached() when not needed.

I really don't like boolean flags - especially to such a small function.

The passing of flags complicates things and is not self documenting. Can
we make a new vma_iter_store_detach() that just calls vma_iter_store()
then does the detach?

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h | 12 ++++++++++++
>  mm/nommu.c         |  4 ++--
>  mm/vma.c           | 16 ++++++++--------
>  mm/vma.h           | 13 +++++++++----
>  4 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 081178b0eec4..c50edfedd99d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>  		vma_assert_write_locked(vma);
>  }
>  
> +static inline void vma_assert_attached(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(vma->detached, vma);
> +}
> +
> +static inline void vma_assert_detached(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(!vma->detached, vma);
> +}
> +
>  static inline void vma_mark_attached(struct vm_area_struct *vma)
>  {
>  	vma->detached = false;
> @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
>  static inline void vma_start_write(struct vm_area_struct *vma) {}
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  		{ mmap_assert_write_locked(vma->vm_mm); }
> +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
>  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
>  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
>  
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 9cb6e99215e2..72c8c505836c 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1191,7 +1191,7 @@ unsigned long do_mmap(struct file *file,
>  	setup_vma_to_mm(vma, current->mm);
>  	current->mm->map_count++;
>  	/* add the VMA to the tree */
> -	vma_iter_store(&vmi, vma);
> +	vma_iter_store(&vmi, vma, true);
>  
>  	/* we flush the region from the icache only when the first executable
>  	 * mapping of it is made  */
> @@ -1356,7 +1356,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  
>  	setup_vma_to_mm(vma, mm);
>  	setup_vma_to_mm(new, mm);
> -	vma_iter_store(vmi, new);
> +	vma_iter_store(vmi, new, true);
>  	mm->map_count++;
>  	return 0;
>  
> diff --git a/mm/vma.c b/mm/vma.c
> index 476146c25283..ce113dd8c471 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -306,7 +306,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		 * us to insert it before dropping the locks
>  		 * (it may either follow vma or precede it).
>  		 */
> -		vma_iter_store(vmi, vp->insert);
> +		vma_iter_store(vmi, vp->insert, true);
>  		mm->map_count++;
>  	}
>  
> @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
>  	vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
>  
>  	if (expanded)
> -		vma_iter_store(vmg->vmi, vmg->vma);
> +		vma_iter_store(vmg->vmi, vmg->vma, false);
>  
>  	if (adj_start) {
>  		adjust->vm_start += adj_start;
>  		adjust->vm_pgoff += PHYS_PFN(adj_start);
>  		if (adj_start < 0) {
>  			WARN_ON(expanded);
> -			vma_iter_store(vmg->vmi, adjust);
> +			vma_iter_store(vmg->vmi, adjust, false);
>  		}
>  	}
>  
> @@ -1689,7 +1689,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
>  		return -ENOMEM;
>  
>  	vma_start_write(vma);
> -	vma_iter_store(&vmi, vma);
> +	vma_iter_store(&vmi, vma, true);
>  	vma_link_file(vma);
>  	mm->map_count++;
>  	validate_mm(mm);
> @@ -2368,7 +2368,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> -	vma_iter_store(vmi, vma);
> +	vma_iter_store(vmi, vma, true);
>  	map->mm->map_count++;
>  	vma_link_file(vma);
>  
> @@ -2542,7 +2542,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	vm_flags_init(vma, flags);
>  	vma->vm_page_prot = vm_get_page_prot(flags);
>  	vma_start_write(vma);
> -	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> +	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL, true))
>  		goto mas_store_fail;
>  
>  	mm->map_count++;
> @@ -2785,7 +2785,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_end = address;
>  				/* Overwrite old entry in mtree. */
> -				vma_iter_store(&vmi, vma);
> +				vma_iter_store(&vmi, vma, false);
>  				anon_vma_interval_tree_post_update_vma(vma);
>  
>  				perf_event_mmap(vma);
> @@ -2865,7 +2865,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>  				vma->vm_start = address;
>  				vma->vm_pgoff -= grow;
>  				/* Overwrite old entry in mtree. */
> -				vma_iter_store(&vmi, vma);
> +				vma_iter_store(&vmi, vma, false);
>  				anon_vma_interval_tree_post_update_vma(vma);
>  
>  				perf_event_mmap(vma);
> diff --git a/mm/vma.h b/mm/vma.h
> index 24636a2b0acf..18c9e49b1eae 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -145,7 +145,7 @@ __must_check int vma_shrink(struct vma_iterator *vmi,
>  		unsigned long start, unsigned long end, pgoff_t pgoff);
>  
>  static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> -			struct vm_area_struct *vma, gfp_t gfp)
> +			struct vm_area_struct *vma, gfp_t gfp, bool new_vma)
>  
>  {
>  	if (vmi->mas.status != ma_start &&
> @@ -157,7 +157,10 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
>  	if (unlikely(mas_is_err(&vmi->mas)))
>  		return -ENOMEM;
>  
> -	vma_mark_attached(vma);
> +	if (new_vma)
> +		vma_mark_attached(vma);
> +	vma_assert_attached(vma);
> +
>  	return 0;
>  }
>  
> @@ -366,7 +369,7 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
>  
>  /* Store a VMA with preallocated memory */
>  static inline void vma_iter_store(struct vma_iterator *vmi,
> -				  struct vm_area_struct *vma)
> +				  struct vm_area_struct *vma, bool new_vma)
>  {
>  
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> @@ -390,7 +393,9 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
>  
>  	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
>  	mas_store_prealloc(&vmi->mas, vma);
> -	vma_mark_attached(vma);
> +	if (new_vma)
> +		vma_mark_attached(vma);
> +	vma_assert_attached(vma);
>  }
>  
>  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma
  2025-01-07 16:49   ` Liam R. Howlett
@ 2025-01-07 17:12     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 17:12 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 8:50 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> > vma_iter_store() functions can be used both when adding a new vma and
> > when updating an existing one. However for existing ones we do not need
> > to mark them attached as they are already marked that way. Add a parameter
> > to distinguish the usage and skip vma_mark_attached() when not needed.
>
> I really don't like boolean flags - especially to such a small function.
>
> The passing of flags complicates things and is not self documenting. Can
> we make a new vma_iter_store_detach() that just calls vma_iter_store()
> then does the detach?

Sure, I'll do that. Thanks for the feedback!

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h | 12 ++++++++++++
> >  mm/nommu.c         |  4 ++--
> >  mm/vma.c           | 16 ++++++++--------
> >  mm/vma.h           | 13 +++++++++----
> >  4 files changed, 31 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 081178b0eec4..c50edfedd99d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> >               vma_assert_write_locked(vma);
> >  }
> >
> > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(vma->detached, vma);
> > +}
> > +
> > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > +}
> > +
> >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> >  {
> >       vma->detached = false;
> > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >               { mmap_assert_write_locked(vma->vm_mm); }
> > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> >
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 9cb6e99215e2..72c8c505836c 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1191,7 +1191,7 @@ unsigned long do_mmap(struct file *file,
> >       setup_vma_to_mm(vma, current->mm);
> >       current->mm->map_count++;
> >       /* add the VMA to the tree */
> > -     vma_iter_store(&vmi, vma);
> > +     vma_iter_store(&vmi, vma, true);
> >
> >       /* we flush the region from the icache only when the first executable
> >        * mapping of it is made  */
> > @@ -1356,7 +1356,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> >       setup_vma_to_mm(vma, mm);
> >       setup_vma_to_mm(new, mm);
> > -     vma_iter_store(vmi, new);
> > +     vma_iter_store(vmi, new, true);
> >       mm->map_count++;
> >       return 0;
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 476146c25283..ce113dd8c471 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -306,7 +306,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >                * us to insert it before dropping the locks
> >                * (it may either follow vma or precede it).
> >                */
> > -             vma_iter_store(vmi, vp->insert);
> > +             vma_iter_store(vmi, vp->insert, true);
> >               mm->map_count++;
> >       }
> >
> > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> >
> >       if (expanded)
> > -             vma_iter_store(vmg->vmi, vmg->vma);
> > +             vma_iter_store(vmg->vmi, vmg->vma, false);
> >
> >       if (adj_start) {
> >               adjust->vm_start += adj_start;
> >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> >               if (adj_start < 0) {
> >                       WARN_ON(expanded);
> > -                     vma_iter_store(vmg->vmi, adjust);
> > +                     vma_iter_store(vmg->vmi, adjust, false);
> >               }
> >       }
> >
> > @@ -1689,7 +1689,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> >               return -ENOMEM;
> >
> >       vma_start_write(vma);
> > -     vma_iter_store(&vmi, vma);
> > +     vma_iter_store(&vmi, vma, true);
> >       vma_link_file(vma);
> >       mm->map_count++;
> >       validate_mm(mm);
> > @@ -2368,7 +2368,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >
> >       /* Lock the VMA since it is modified after insertion into VMA tree */
> >       vma_start_write(vma);
> > -     vma_iter_store(vmi, vma);
> > +     vma_iter_store(vmi, vma, true);
> >       map->mm->map_count++;
> >       vma_link_file(vma);
> >
> > @@ -2542,7 +2542,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       vm_flags_init(vma, flags);
> >       vma->vm_page_prot = vm_get_page_prot(flags);
> >       vma_start_write(vma);
> > -     if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> > +     if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL, true))
> >               goto mas_store_fail;
> >
> >       mm->map_count++;
> > @@ -2785,7 +2785,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >                               anon_vma_interval_tree_pre_update_vma(vma);
> >                               vma->vm_end = address;
> >                               /* Overwrite old entry in mtree. */
> > -                             vma_iter_store(&vmi, vma);
> > +                             vma_iter_store(&vmi, vma, false);
> >                               anon_vma_interval_tree_post_update_vma(vma);
> >
> >                               perf_event_mmap(vma);
> > @@ -2865,7 +2865,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >                               vma->vm_start = address;
> >                               vma->vm_pgoff -= grow;
> >                               /* Overwrite old entry in mtree. */
> > -                             vma_iter_store(&vmi, vma);
> > +                             vma_iter_store(&vmi, vma, false);
> >                               anon_vma_interval_tree_post_update_vma(vma);
> >
> >                               perf_event_mmap(vma);
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 24636a2b0acf..18c9e49b1eae 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -145,7 +145,7 @@ __must_check int vma_shrink(struct vma_iterator *vmi,
> >               unsigned long start, unsigned long end, pgoff_t pgoff);
> >
> >  static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > -                     struct vm_area_struct *vma, gfp_t gfp)
> > +                     struct vm_area_struct *vma, gfp_t gfp, bool new_vma)
> >
> >  {
> >       if (vmi->mas.status != ma_start &&
> > @@ -157,7 +157,10 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> >       if (unlikely(mas_is_err(&vmi->mas)))
> >               return -ENOMEM;
> >
> > -     vma_mark_attached(vma);
> > +     if (new_vma)
> > +             vma_mark_attached(vma);
> > +     vma_assert_attached(vma);
> > +
> >       return 0;
> >  }
> >
> > @@ -366,7 +369,7 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> >
> >  /* Store a VMA with preallocated memory */
> >  static inline void vma_iter_store(struct vma_iterator *vmi,
> > -                               struct vm_area_struct *vma)
> > +                               struct vm_area_struct *vma, bool new_vma)
> >  {
> >
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > @@ -390,7 +393,9 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> >
> >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> >       mas_store_prealloc(&vmi->mas, vma);
> > -     vma_mark_attached(vma);
> > +     if (new_vma)
> > +             vma_mark_attached(vma);
> > +     vma_assert_attached(vma);
> >  }
> >
> >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 05/17] mm: mark vmas detached upon exit
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2024-12-26 17:06 ` [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-07 17:08   ` Vlastimil Babka
  2024-12-26 17:06 ` [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

When exit_mmap() removes vmas belonging to an exiting task, it does not
mark them as detached since they can't be reached by other tasks and they
will be freed shortly. Once we introduce vma reuse, all vmas will have to
be in detached state before they are freed to ensure vma when reused is
in a consistent state. Add missing vma_mark_detached() before freeing the
vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/vma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index ce113dd8c471..4a3deb6f9662 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -413,9 +413,10 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable)
+	if (unreachable) {
+		vma_mark_detached(vma);
 		__vm_area_free(vma);
-	else
+	} else
 		vm_area_free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 05/17] mm: mark vmas detached upon exit
  2024-12-26 17:06 ` [PATCH v7 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
@ 2025-01-07 17:08   ` Vlastimil Babka
  2025-01-07 17:13     ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-07 17:08 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:06, Suren Baghdasaryan wrote:
> When exit_mmap() removes vmas belonging to an exiting task, it does not
> mark them as detached since they can't be reached by other tasks and they
> will be freed shortly. Once we introduce vma reuse, all vmas will have to
> be in detached state before they are freed to ensure vma when reused is
> in a consistent state. Add missing vma_mark_detached() before freeing the
> vma.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vma.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index ce113dd8c471..4a3deb6f9662 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -413,9 +413,10 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	if (unreachable)
> +	if (unreachable) {
> +		vma_mark_detached(vma);
>  		__vm_area_free(vma);
> -	else
> +	} else
>  		vm_area_free(vma);

Nit: per style this should be { } too now.

>  }
>  



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 05/17] mm: mark vmas detached upon exit
  2025-01-07 17:08   ` Vlastimil Babka
@ 2025-01-07 17:13     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 17:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 9:08 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/26/24 18:06, Suren Baghdasaryan wrote:
> > When exit_mmap() removes vmas belonging to an exiting task, it does not
> > mark them as detached since they can't be reached by other tasks and they
> > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> > be in detached state before they are freed to ensure vma when reused is
> > in a consistent state. Add missing vma_mark_detached() before freeing the
> > vma.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> > ---
> >  mm/vma.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index ce113dd8c471..4a3deb6f9662 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -413,9 +413,10 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> >       if (vma->vm_file)
> >               fput(vma->vm_file);
> >       mpol_put(vma_policy(vma));
> > -     if (unreachable)
> > +     if (unreachable) {
> > +             vma_mark_detached(vma);
> >               __vm_area_free(vma);
> > -     else
> > +     } else
> >               vm_area_free(vma);
>
> Nit: per style this should be { } too now.

Thanks. Will fix it in the next version since I have to respin the
series anyway.

>
> >  }
> >
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2024-12-26 17:06 ` [PATCH v7 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2025-01-07 17:51   ` Liam R. Howlett
  2024-12-26 17:06 ` [PATCH v7 07/17] types: move struct rcuwait into types.h Suren Baghdasaryan
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

nommu configuration has two places where vma gets attached to the vma tree
without write-locking it. Add the missing locks to ensure vma is always
locked before it's attached.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/nommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/nommu.c b/mm/nommu.c
index 72c8c505836c..1754e84e5758 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1189,6 +1189,7 @@ unsigned long do_mmap(struct file *file,
 		goto error_just_free;
 
 	setup_vma_to_mm(vma, current->mm);
+	vma_start_write(vma);
 	current->mm->map_count++;
 	/* add the VMA to the tree */
 	vma_iter_store(&vmi, vma, true);
@@ -1356,6 +1357,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	setup_vma_to_mm(vma, mm);
 	setup_vma_to_mm(new, mm);
+	vma_start_write(new);
 	vma_iter_store(vmi, new, true);
 	mm->map_count++;
 	return 0;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached
  2024-12-26 17:06 ` [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
@ 2025-01-07 17:51   ` Liam R. Howlett
  2025-01-07 18:05     ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-07 17:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> nommu configuration has two places where vma gets attached to the vma tree
> without write-locking it. Add the missing locks to ensure vma is always
> locked before it's attached.

Does the delete side need to write lock as well?

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/nommu.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 72c8c505836c..1754e84e5758 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1189,6 +1189,7 @@ unsigned long do_mmap(struct file *file,
>  		goto error_just_free;
>  
>  	setup_vma_to_mm(vma, current->mm);
> +	vma_start_write(vma);
>  	current->mm->map_count++;
>  	/* add the VMA to the tree */
>  	vma_iter_store(&vmi, vma, true);
> @@ -1356,6 +1357,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  
>  	setup_vma_to_mm(vma, mm);
>  	setup_vma_to_mm(new, mm);
> +	vma_start_write(new);
>  	vma_iter_store(vmi, new, true);
>  	mm->map_count++;
>  	return 0;
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached
  2025-01-07 17:51   ` Liam R. Howlett
@ 2025-01-07 18:05     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 18:05 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 9:51 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> > nommu configuration has two places where vma gets attached to the vma tree
> > without write-locking it. Add the missing locks to ensure vma is always
> > locked before it's attached.
>
> Does the delete side need to write lock as well?

Ugh. I just realized that CONFIG_PER_VMA_LOCK depends on CONFIG_MMU,
so this patch is not needed because all these per-vma functions are
NoOps when CONFIG_PER_VMA_LOCK=n. I'll drop it in the next version.

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/nommu.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 72c8c505836c..1754e84e5758 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1189,6 +1189,7 @@ unsigned long do_mmap(struct file *file,
> >               goto error_just_free;
> >
> >       setup_vma_to_mm(vma, current->mm);
> > +     vma_start_write(vma);
> >       current->mm->map_count++;
> >       /* add the VMA to the tree */
> >       vma_iter_store(&vmi, vma, true);
> > @@ -1356,6 +1357,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> >       setup_vma_to_mm(vma, mm);
> >       setup_vma_to_mm(new, mm);
> > +     vma_start_write(new);
> >       vma_iter_store(vmi, new, true);
> >       mm->map_count++;
> >       return 0;
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 07/17] types: move struct rcuwait into types.h
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2024-12-26 17:06 ` [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
@ 2024-12-26 17:06 ` Suren Baghdasaryan
  2024-12-27 18:35   ` Davidlohr Bueso
  2025-01-08 15:02   ` Liam R. Howlett
  2024-12-26 17:07 ` [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:06 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Move rcuwait struct definition into types.h so that rcuwait can be used
without including rcuwait.h which includes other headers. Without this
change mm_types.h can't use rcuwait due to a the following circular
dependency:

mm_types.h -> rcuwait.h -> signal.h -> mm_types.h

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/rcuwait.h | 13 +------------
 include/linux/types.h   | 12 ++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 27343424225c..9ad134a04b41 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -4,18 +4,7 @@
 
 #include <linux/rcupdate.h>
 #include <linux/sched/signal.h>
-
-/*
- * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner.
- *
- * The only time @task is non-nil is when a user is blocked (or
- * checking if it needs to) on a condition, and reset as soon as we
- * know that the condition has succeeded and are awoken.
- */
-struct rcuwait {
-	struct task_struct __rcu *task;
-};
+#include <linux/types.h>
 
 #define __RCUWAIT_INITIALIZER(name)		\
 	{ .task = NULL, }
diff --git a/include/linux/types.h b/include/linux/types.h
index 2d7b9ae8714c..f1356a9a5730 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
 typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
 typedef int (*cmp_func_t)(const void *a, const void *b);
 
+/*
+ * rcuwait provides a way of blocking and waking up a single
+ * task in an rcu-safe manner.
+ *
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
+ */
+struct rcuwait {
+	struct task_struct __rcu *task;
+};
+
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 07/17] types: move struct rcuwait into types.h
  2024-12-26 17:06 ` [PATCH v7 07/17] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2024-12-27 18:35   ` Davidlohr Bueso
  2025-01-08 15:02   ` Liam R. Howlett
  1 sibling, 0 replies; 65+ messages in thread
From: Davidlohr Bueso @ 2024-12-27 18:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	klarasmodin, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, 26 Dec 2024, Suren Baghdasaryan wrote:

>Move rcuwait struct definition into types.h so that rcuwait can be used
>without including rcuwait.h which includes other headers. Without this
>change mm_types.h can't use rcuwait due to a the following circular
>dependency:
>
>mm_types.h -> rcuwait.h -> signal.h -> mm_types.h
>
>Suggested-by: Matthew Wilcox <willy@infradead.org>
>Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: Davidlohr Bueso <dave@stgolabs.net>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 07/17] types: move struct rcuwait into types.h
  2024-12-26 17:06 ` [PATCH v7 07/17] types: move struct rcuwait into types.h Suren Baghdasaryan
  2024-12-27 18:35   ` Davidlohr Bueso
@ 2025-01-08 15:02   ` Liam R. Howlett
  1 sibling, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 15:02 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> Move rcuwait struct definition into types.h so that rcuwait can be used
> without including rcuwait.h which includes other headers. Without this
> change mm_types.h can't use rcuwait due to a the following circular
> dependency:
> 
> mm_types.h -> rcuwait.h -> signal.h -> mm_types.h
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  include/linux/rcuwait.h | 13 +------------
>  include/linux/types.h   | 12 ++++++++++++
>  2 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
> index 27343424225c..9ad134a04b41 100644
> --- a/include/linux/rcuwait.h
> +++ b/include/linux/rcuwait.h
> @@ -4,18 +4,7 @@
>  
>  #include <linux/rcupdate.h>
>  #include <linux/sched/signal.h>
> -
> -/*
> - * rcuwait provides a way of blocking and waking up a single
> - * task in an rcu-safe manner.
> - *
> - * The only time @task is non-nil is when a user is blocked (or
> - * checking if it needs to) on a condition, and reset as soon as we
> - * know that the condition has succeeded and are awoken.
> - */
> -struct rcuwait {
> -	struct task_struct __rcu *task;
> -};
> +#include <linux/types.h>
>  
>  #define __RCUWAIT_INITIALIZER(name)		\
>  	{ .task = NULL, }
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 2d7b9ae8714c..f1356a9a5730 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
>  typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
>  typedef int (*cmp_func_t)(const void *a, const void *b);
>  
> +/*
> + * rcuwait provides a way of blocking and waking up a single
> + * task in an rcu-safe manner.
> + *
> + * The only time @task is non-nil is when a user is blocked (or
> + * checking if it needs to) on a condition, and reset as soon as we
> + * know that the condition has succeeded and are awoken.
> + */
> +struct rcuwait {
> +	struct task_struct __rcu *task;
> +};
> +
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2024-12-26 17:06 ` [PATCH v7 07/17] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-07 17:28   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_locked_nested failing
due to refcount overflow. Prepare for such possibility by changing these
APIs and adjusting their users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
---
 include/linux/mm.h |  6 ++++--
 mm/userfaultfd.c   | 17 ++++++++++++-----
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c50edfedd99d..ab27de9729d8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read_nested(&vma->vm_lock.lock, subclass);
+	return true;
 }
 
 /*
@@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked(struct vm_area_struct *vma)
+static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read(&vma->vm_lock.lock);
+	return true;
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4527c385935b..38207d8be205 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma))
-		vma_start_read_locked(vma);
+		if (!vma_start_read_locked(vma))
+			vma = ERR_PTR(-EAGAIN);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		vma_start_read_locked(*dst_vmap);
-		if (*dst_vmap != *src_vmap)
-			vma_start_read_locked_nested(*src_vmap,
-						SINGLE_DEPTH_NESTING);
+		if (vma_start_read_locked(*dst_vmap)) {
+			if (*dst_vmap != *src_vmap) {
+				if (!vma_start_read_locked_nested(*src_vmap,
+							SINGLE_DEPTH_NESTING)) {
+					vma_end_read(*dst_vmap);
+					err = -EAGAIN;
+				}
+			}
+		} else
+			err = -EAGAIN;
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2024-12-26 17:07 ` [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2025-01-07 17:28   ` Vlastimil Babka
  2025-01-07 17:31     ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-07 17:28 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> due to refcount overflow. Prepare for such possibility by changing these
> APIs and adjusting their users.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(I don't know if this can break some uffd users?)

> @@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
>  	mmap_read_lock(mm);
>  	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>  	if (!err) {
> -		vma_start_read_locked(*dst_vmap);
> -		if (*dst_vmap != *src_vmap)
> -			vma_start_read_locked_nested(*src_vmap,
> -						SINGLE_DEPTH_NESTING);
> +		if (vma_start_read_locked(*dst_vmap)) {
> +			if (*dst_vmap != *src_vmap) {
> +				if (!vma_start_read_locked_nested(*src_vmap,
> +							SINGLE_DEPTH_NESTING)) {
> +					vma_end_read(*dst_vmap);
> +					err = -EAGAIN;
> +				}
> +			}
> +		} else
> +			err = -EAGAIN;

Nit: { } per style

>  	}
>  	mmap_read_unlock(mm);
>  	return err;



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-07 17:28   ` Vlastimil Babka
@ 2025-01-07 17:31     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 17:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 9:28 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/26/24 18:07, Suren Baghdasaryan wrote:
> > With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> > possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> > due to refcount overflow. Prepare for such possibility by changing these
> > APIs and adjusting their users.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> (I don't know if this can break some uffd users?)

I checked with Lokesh who added that uffd code and he confirmed that
uffd will handle errors correctly.

>
> > @@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
> >       mmap_read_lock(mm);
> >       err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> >       if (!err) {
> > -             vma_start_read_locked(*dst_vmap);
> > -             if (*dst_vmap != *src_vmap)
> > -                     vma_start_read_locked_nested(*src_vmap,
> > -                                             SINGLE_DEPTH_NESTING);
> > +             if (vma_start_read_locked(*dst_vmap)) {
> > +                     if (*dst_vmap != *src_vmap) {
> > +                             if (!vma_start_read_locked_nested(*src_vmap,
> > +                                                     SINGLE_DEPTH_NESTING)) {
> > +                                     vma_end_read(*dst_vmap);
> > +                                     err = -EAGAIN;
> > +                             }
> > +                     }
> > +             } else
> > +                     err = -EAGAIN;
>
> Nit: { } per style

Ack.

>
> >       }
> >       mmap_read_unlock(mm);
> >       return err;
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-07 17:30   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 10/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file. This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmap_lock.h | 6 ------
 kernel/fork.c             | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff6..4706c6769902 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
-static inline void mmap_init_lock(struct mm_struct *mm)
-{
-	init_rwsem(&mm->mmap_lock);
-	mm_lock_seqcount_init(mm);
-}
-
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
diff --git a/kernel/fork.c b/kernel/fork.c
index f2f9e7b427ad..d4c75428ccaf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static inline void mmap_init_lock(struct mm_struct *mm)
+{
+	init_rwsem(&mm->mmap_lock);
+	mm_lock_seqcount_init(mm);
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file
  2024-12-26 17:07 ` [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2025-01-07 17:30   ` Vlastimil Babka
  0 siblings, 0 replies; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-07 17:30 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
> not have to reside in the header file. This move lets us avoid including
> additional headers in mmap_lock.h later, when mmap_init_lock() needs to
> initialize rcuwait object.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Ideally a file named kernel/fork.c wouldn't contain so much mm specific
code, but that could be perhaps addressed separately later. The reasons here
sound pragmatic.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mmap_lock.h | 6 ------
>  kernel/fork.c             | 6 ++++++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 45a21faa3ff6..4706c6769902 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
>  
>  #endif /* CONFIG_PER_VMA_LOCK */
>  
> -static inline void mmap_init_lock(struct mm_struct *mm)
> -{
> -	init_rwsem(&mm->mmap_lock);
> -	mm_lock_seqcount_init(mm);
> -}
> -
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_start_locking(mm, true);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f2f9e7b427ad..d4c75428ccaf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  #endif
>  }
>  
> +static inline void mmap_init_lock(struct mm_struct *mm)
> +{
> +	init_rwsem(&mm->mmap_lock);
> +	mm_lock_seqcount_init(mm);
> +}
> +
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	struct user_namespace *user_ns)
>  {



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-07 17:35   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit the future code size growth.
No functional changes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 12 +++---------
 mm/memory.c        | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab27de9729d8..ea4c4228b125 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -787,6 +787,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq);
+
 /*
  * Begin writing to a VMA.
  * Exclude concurrent readers under the per-VMA lock until the currently
@@ -799,15 +801,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock.lock);
-	/*
-	 * We should use WRITE_ONCE() here because we can have concurrent reads
-	 * from the early lockless pessimistic check in vma_start_read().
-	 * We don't really care about the correctness of that early check, but
-	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
-	 */
-	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	__vma_start_write(vma, mm_lock_seq);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index d0dee2282325..236fdecd44d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+{
+	down_write(&vma->vm_lock.lock);
+	/*
+	 * We should use WRITE_ONCE() here because we can have concurrent reads
+	 * from the early lockless pessimistic check in vma_start_read().
+	 * We don't really care about the correctness of that early check, but
+	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+	 */
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+	up_write(&vma->vm_lock.lock);
+}
+EXPORT_SYMBOL_GPL(__vma_start_write);
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2024-12-26 17:07 ` [PATCH v7 10/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2025-01-07 17:35   ` Vlastimil Babka
  2025-01-07 17:45     ` Suren Baghdasaryan
  2025-04-08  4:39     ` Eric Naim
  0 siblings, 2 replies; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-07 17:35 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> vma_start_write() is used in many places and will grow in size very soon.
> It is not used in performance critical paths and uninlining it should
> limit the future code size growth.
> No functional changes.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>  #endif
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> +{
> +	down_write(&vma->vm_lock.lock);
> +	/*
> +	 * We should use WRITE_ONCE() here because we can have concurrent reads
> +	 * from the early lockless pessimistic check in vma_start_read().
> +	 * We don't really care about the correctness of that early check, but
> +	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> +	 */
> +	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> +	up_write(&vma->vm_lock.lock);
> +}
> +EXPORT_SYMBOL_GPL(__vma_start_write);

Do any modules need it? If not we shouldn't export.

>  /*
>   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
>   * stable and not isolated. If the VMA is not found or is being modified the



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-01-07 17:35   ` Vlastimil Babka
@ 2025-01-07 17:45     ` Suren Baghdasaryan
  2025-01-07 18:51       ` Suren Baghdasaryan
  2025-04-08  4:39     ` Eric Naim
  1 sibling, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 17:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 9:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/26/24 18:07, Suren Baghdasaryan wrote:
> > vma_start_write() is used in many places and will grow in size very soon.
> > It is not used in performance critical paths and uninlining it should
> > limit the future code size growth.
> > No functional changes.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> >  #endif
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> > +{
> > +     down_write(&vma->vm_lock.lock);
> > +     /*
> > +      * We should use WRITE_ONCE() here because we can have concurrent reads
> > +      * from the early lockless pessimistic check in vma_start_read().
> > +      * We don't really care about the correctness of that early check, but
> > +      * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> > +      */
> > +     WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > +     up_write(&vma->vm_lock.lock);
> > +}
> > +EXPORT_SYMBOL_GPL(__vma_start_write);
>
> Do any modules need it? If not we shouldn't export.

I'm pretty sure I added it because of the allmodconfig build failure
but let me rerun it and see which module was using it.

>
> >  /*
> >   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> >   * stable and not isolated. If the VMA is not found or is being modified the
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-01-07 17:45     ` Suren Baghdasaryan
@ 2025-01-07 18:51       ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 18:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 9:45 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 7, 2025 at 9:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/26/24 18:07, Suren Baghdasaryan wrote:
> > > vma_start_write() is used in many places and will grow in size very soon.
> > > It is not used in performance critical paths and uninlining it should
> > > limit the future code size growth.
> > > No functional changes.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> > >  #endif
> > >
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> > > +{
> > > +     down_write(&vma->vm_lock.lock);
> > > +     /*
> > > +      * We should use WRITE_ONCE() here because we can have concurrent reads
> > > +      * from the early lockless pessimistic check in vma_start_read().
> > > +      * We don't really care about the correctness of that early check, but
> > > +      * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> > > +      */
> > > +     WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > +     up_write(&vma->vm_lock.lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(__vma_start_write);
> >
> > Do any modules need it? If not we shouldn't export.
>
> I'm pretty sure I added it because of the allmodconfig build failure
> but let me rerun it and see which module was using it.

Yeah, when building allmodconfig without this export I get:

ERROR: modpost: "__vma_start_write" [fs/ext4/ext4.ko] undefined!
ERROR: modpost: "__vma_start_write" [fs/cramfs/cramfs.ko] undefined!
ERROR: modpost: "__vma_start_write" [fs/fuse/fuse.ko] undefined!
ERROR: modpost: "__vma_start_write" [fs/orangefs/orangefs.ko] undefined!
ERROR: modpost: "__vma_start_write" [fs/xfs/xfs.ko] undefined!
ERROR: modpost: "__vma_start_write" [fs/erofs/erofs.ko] undefined!
ERROR: modpost: "__vma_start_write" [drivers/video/fbdev/core/fb.ko] undefined!
ERROR: modpost: "__vma_start_write" [drivers/acpi/pfr_telemetry.ko] undefined!
ERROR: modpost: "__vma_start_write" [drivers/dma/idxd/idxd.ko] undefined!
ERROR: modpost: "__vma_start_write" [drivers/xen/xen-gntdev.ko] undefined!


>
> >
> > >  /*
> > >   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> > >   * stable and not isolated. If the VMA is not found or is being modified the
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-01-07 17:35   ` Vlastimil Babka
  2025-01-07 17:45     ` Suren Baghdasaryan
@ 2025-04-08  4:39     ` Eric Naim
  2025-04-08  6:01       ` Christoph Hellwig
  1 sibling, 1 reply; 65+ messages in thread
From: Eric Naim @ 2025-04-08  4:39 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 1/8/25 01:35, Vlastimil Babka wrote:
> On 12/26/24 18:07, Suren Baghdasaryan wrote:
>> vma_start_write() is used in many places and will grow in size very soon.
>> It is not used in performance critical paths and uninlining it should
>> limit the future code size growth.
>> No functional changes.
>>
>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>>  #endif
>>  
>>  #ifdef CONFIG_PER_VMA_LOCK
>> +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>> +{
>> +	down_write(&vma->vm_lock.lock);
>> +	/*
>> +	 * We should use WRITE_ONCE() here because we can have concurrent reads
>> +	 * from the early lockless pessimistic check in vma_start_read().
>> +	 * We don't really care about the correctness of that early check, but
>> +	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>> +	 */
>> +	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>> +	up_write(&vma->vm_lock.lock);
>> +}
>> +EXPORT_SYMBOL_GPL(__vma_start_write);
> 
> Do any modules need it? If not we shouldn't export.

Hi Vlastimil, Suren

The out-of-tree NVIDIA modules seem to rely on this symbol, is it possible to use EXPORT_SYMBOL() here instead of EXPORT_SYMBOL_GPL(), below is the modpost error:

	MODPOST Module.symvers
	WARNING: modpost: missing MODULE_DESCRIPTION() in nvidia.o
	WARNING: modpost: missing MODULE_DESCRIPTION() in nvidia-uvm.o
	WARNING: modpost: missing MODULE_DESCRIPTION() in nvidia-modeset.o
	WARNING: modpost: missing MODULE_DESCRIPTION() in nvidia-drm.o
	ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol '__vma_start_write'
	ERROR: modpost: GPL-incompatible module nvidia-drm.ko uses GPL-only symbol '__vma_start_write'
	make[4]: *** [/tmp/makepkg/linux-cachyos-rc-nc/src/linux-6.15-rc1/scripts/Makefile.modpost:147: Module.symvers] Error 1
	make[3]: *** [/tmp/makepkg/linux-cachyos-rc-nc/src/linux-6.15-rc1/Makefile:1964: modpost] Error 2

If it's possible I can send a patch that changes that.

> 
>>  /*
>>   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
>>   * stable and not isolated. If the VMA is not found or is being modified the
> 
-- 
Regards,
Eric


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-04-08  4:39     ` Eric Naim
@ 2025-04-08  6:01       ` Christoph Hellwig
  2025-04-08  6:25         ` Lorenzo Stoakes
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-04-08  6:01 UTC (permalink / raw)
  To: Eric Naim
  Cc: Vlastimil Babka, Suren Baghdasaryan, akpm, peterz, willy,
	liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Apr 08, 2025 at 12:39:25PM +0800, Eric Naim wrote:
> The out-of-tree NVIDIA modules seem to rely on this symbol, is it possible to use EXPORT_SYMBOL() here instead of EXPORT_SYMBOL_GPL(), below is the modpost error:

No.  They don't have any business using this.

In fact vma_start_write should not be exported at all, just the
vm_flags_{set,clear,mod} helpers.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-04-08  6:01       ` Christoph Hellwig
@ 2025-04-08  6:25         ` Lorenzo Stoakes
  2025-04-08  7:52           ` Eric Naim
  0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08  6:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Naim, Vlastimil Babka, Suren Baghdasaryan, akpm, peterz,
	willy, liam.howlett, mhocko, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Apr 07, 2025 at 11:01:46PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 08, 2025 at 12:39:25PM +0800, Eric Naim wrote:
> > The out-of-tree NVIDIA modules seem to rely on this symbol, is it possible to use EXPORT_SYMBOL() here instead of EXPORT_SYMBOL_GPL(), below is the modpost error:
>
> No.  They don't have any business using this.

What on _earth_ are they using this for? Is this just via the VMA flag
manipulation functions? If it's something else, it's an unintended use of this.

Anyway, generally speaking - agreed, this is absolutely a no-go Eric. In my view
we simply should not be using EXPORT_SYMBOL() for _any_ new symbols whatsoever.

Out-of-tree modules are simply a non-consideration for core mm code, this is a
GPL open source project. If I had my way we'd simply revoke _all_
EXPORT_SYMBOL()'s, not add new ones.

>
> In fact vma_start_write should not be exported at all, just the
> vm_flags_{set,clear,mod} helpers.

Yup, I'd rather we just kept vma_start_write() mm-internal, though of course
kernel/fork.c (ugh) needs it (we could probably refactor that in some way to
avoid), and literally just the PPC arch (again maybe we can find a way round
that).

Maybe one for me to look at actually... hmm.

Anyway Eric - I wonder if this is simply the nvidia OOT driver doing a
vm_flags_...() call and then having an issue because the lock is uninlined now?

I guess you are jut noticing this is breaking and don't know since - proprietary
code.

Anyway in this case, the OOT driver should just write some GPL wrapper code or
something here. Or better yet - make the driver open source :)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-04-08  6:25         ` Lorenzo Stoakes
@ 2025-04-08  7:52           ` Eric Naim
  2025-04-08 17:09             ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Eric Naim @ 2025-04-08  7:52 UTC (permalink / raw)
  To: Lorenzo Stoakes, Christoph Hellwig
  Cc: Vlastimil Babka, Suren Baghdasaryan, akpm, peterz, willy,
	liam.howlett, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On 4/8/25 14:25, Lorenzo Stoakes wrote:
> On Mon, Apr 07, 2025 at 11:01:46PM -0700, Christoph Hellwig wrote:
>> On Tue, Apr 08, 2025 at 12:39:25PM +0800, Eric Naim wrote:
>>> The out-of-tree NVIDIA modules seem to rely on this symbol, is it possible to use EXPORT_SYMBOL() here instead of EXPORT_SYMBOL_GPL(), below is the modpost error:
>>
>> No.  They don't have any business using this.
> 
> What on _earth_ are they using this for? Is this just via the VMA flag
> manipulation functions? If it's something else, it's an unintended use of this.
> 
> Anyway, generally speaking - agreed, this is absolutely a no-go Eric. In my view
> we simply should not be using EXPORT_SYMBOL() for _any_ new symbols whatsoever.
> 
> Out-of-tree modules are simply a non-consideration for core mm code, this is a
> GPL open source project. If I had my way we'd simply revoke _all_
> EXPORT_SYMBOL()'s, not add new ones.
> 
>>
>> In fact vma_start_write should not be exported at all, just the
>> vm_flags_{set,clear,mod} helpers.
> 
> Yup, I'd rather we just kept vma_start_write() mm-internal, though of course
> kernel/fork.c (ugh) needs it (we could probably refactor that in some way to
> avoid), and literally just the PPC arch (again maybe we can find a way round
> that).
> 
> Maybe one for me to look at actually... hmm.
> 
> Anyway Eric - I wonder if this is simply the nvidia OOT driver doing a
> vm_flags_...() call and then having an issue because the lock is uninlined now?
> 
> I guess you are jut noticing this is breaking and don't know since - proprietary
> code.


This seems to be the case, upon looking a bit deeper it looks like the driver code 
is calling atleast one of vm_flags_set. I couldn't find any direct calls to {,__}vma_start_write
at first and was bit confused.

> 
> Anyway in this case, the OOT driver should just write some GPL wrapper code or
> something here. Or better yet - make the driver open source :)

Yeah, as obvious as it seems it doesn't happen on their open-sourced code :)

Either way, I'm thankful for the replies. NVIDIA *should* have probably noticed
this already and it would probably fixed in the next driver or two so I'll just
let sleeping dogs lie.


-- 
Regards,
Eric


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 10/17] mm: uninline the main body of vma_start_write()
  2025-04-08  7:52           ` Eric Naim
@ 2025-04-08 17:09             ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-04-08 17:09 UTC (permalink / raw)
  To: Eric Naim
  Cc: Lorenzo Stoakes, Christoph Hellwig, Vlastimil Babka, akpm,
	peterz, willy, liam.howlett, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Apr 8, 2025 at 12:52 AM Eric Naim <dnaim@cachyos.org> wrote:
>
> On 4/8/25 14:25, Lorenzo Stoakes wrote:
> > On Mon, Apr 07, 2025 at 11:01:46PM -0700, Christoph Hellwig wrote:
> >> On Tue, Apr 08, 2025 at 12:39:25PM +0800, Eric Naim wrote:
> >>> The out-of-tree NVIDIA modules seem to rely on this symbol, is it possible to use EXPORT_SYMBOL() here instead of EXPORT_SYMBOL_GPL(), below is the modpost error:
> >>
> >> No.  They don't have any business using this.
> >
> > What on _earth_ are they using this for? Is this just via the VMA flag
> > manipulation functions? If it's something else, it's an unintended use of this.
> >
> > Anyway, generally speaking - agreed, this is absolutely a no-go Eric. In my view
> > we simply should not be using EXPORT_SYMBOL() for _any_ new symbols whatsoever.
> >
> > Out-of-tree modules are simply a non-consideration for core mm code, this is a
> > GPL open source project. If I had my way we'd simply revoke _all_
> > EXPORT_SYMBOL()'s, not add new ones.
> >
> >>
> >> In fact vma_start_write should not be exported at all, just the
> >> vm_flags_{set,clear,mod} helpers.
> >
> > Yup, I'd rather we just kept vma_start_write() mm-internal, though of course
> > kernel/fork.c (ugh) needs it (we could probably refactor that in some way to
> > avoid), and literally just the PPC arch (again maybe we can find a way round
> > that).
> >
> > Maybe one for me to look at actually... hmm.
> >
> > Anyway Eric - I wonder if this is simply the nvidia OOT driver doing a
> > vm_flags_...() call and then having an issue because the lock is uninlined now?
> >
> > I guess you are jut noticing this is breaking and don't know since - proprietary
> > code.
>
>
> This seems to be the case, upon looking a bit deeper it looks like the driver code
> is calling atleast one of vm_flags_set. I couldn't find any direct calls to {,__}vma_start_write
> at first and was bit confused.

I agree that EXPORT_SYMBOL_GPL() should be enough as far as exporting goes.

>
> >
> > Anyway in this case, the OOT driver should just write some GPL wrapper code or
> > something here. Or better yet - make the driver open source :)
>
> Yeah, as obvious as it seems it doesn't happen on their open-sourced code :)
>
> Either way, I'm thankful for the replies. NVIDIA *should* have probably noticed
> this already and it would probably fixed in the next driver or two so I'll just
> let sleeping dogs lie.
>
>
> --
> Regards,
> Eric


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 10/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-08  9:16   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Introduce functions to increase refcount but with a top limit above
which they will fail to increase. Setting the limit to 0 indicates
no limit.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/refcount.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..e51a49179307 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -137,13 +137,19 @@ static inline unsigned int refcount_read(const refcount_t *r)
 }
 
 static inline __must_check __signed_wrap
-bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
+				     int limit)
 {
 	int old = refcount_read(r);
 
 	do {
 		if (!old)
 			break;
+		if (limit && old + i > limit) {
+			if (oldp)
+				*oldp = old;
+			return false;
+		}
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
@@ -155,6 +161,12 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	return old;
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_limited(i, r, oldp, 0);
+}
+
 /**
  * refcount_add_not_zero - add a value to a refcount unless it is 0
  * @i: the value to add to the refcount
@@ -213,6 +225,12 @@ static inline void refcount_add(int i, refcount_t *r)
 	__refcount_add(i, r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_limited(refcount_t *r,
+								int *oldp, int limit)
+{
+	return __refcount_add_not_zero_limited(1, r, oldp, limit);
+}
+
 static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
 {
 	return __refcount_add_not_zero(1, r, oldp);
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2024-12-26 17:07 ` [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2025-01-08  9:16   ` Vlastimil Babka
  2025-01-08 15:06     ` Matthew Wilcox
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08  9:16 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> Introduce functions to increase refcount but with a top limit above
> which they will fail to increase. Setting the limit to 0 indicates
> no limit.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/refcount.h | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..e51a49179307 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -137,13 +137,19 @@ static inline unsigned int refcount_read(const refcount_t *r)
>  }
>  
>  static inline __must_check __signed_wrap
> -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> +				     int limit)
>  {
>  	int old = refcount_read(r);
>  
>  	do {
>  		if (!old)
>  			break;
> +		if (limit && old + i > limit) {

Should this be e.g. "old > limit - i" to avoid overflow and false negative
if someone sets limit close to INT_MAX?

> +			if (oldp)
> +				*oldp = old;
> +			return false;
> +		}
>  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>  
>  	if (oldp)
> @@ -155,6 +161,12 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>  	return old;
>  }
>  
> +static inline __must_check __signed_wrap
> +bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> +{
> +	return __refcount_add_not_zero_limited(i, r, oldp, 0);
> +}
> +
>  /**
>   * refcount_add_not_zero - add a value to a refcount unless it is 0
>   * @i: the value to add to the refcount
> @@ -213,6 +225,12 @@ static inline void refcount_add(int i, refcount_t *r)
>  	__refcount_add(i, r, NULL);
>  }
>  
> +static inline __must_check bool __refcount_inc_not_zero_limited(refcount_t *r,
> +								int *oldp, int limit)
> +{
> +	return __refcount_add_not_zero_limited(1, r, oldp, limit);
> +}
> +
>  static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
>  {
>  	return __refcount_add_not_zero(1, r, oldp);



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-08  9:16   ` Vlastimil Babka
@ 2025-01-08 15:06     ` Matthew Wilcox
  2025-01-08 15:45       ` Suren Baghdasaryan
  2025-01-10 13:32       ` David Laight
  0 siblings, 2 replies; 65+ messages in thread
From: Matthew Wilcox @ 2025-01-08 15:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, akpm, peterz, liam.howlett, lorenzo.stoakes,
	mhocko, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Jan 08, 2025 at 10:16:04AM +0100, Vlastimil Babka wrote:
> >  static inline __must_check __signed_wrap
> > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > +				     int limit)
> >  {
> >  	int old = refcount_read(r);
> >  
> >  	do {
> >  		if (!old)
> >  			break;
> > +		if (limit && old + i > limit) {
> 
> Should this be e.g. "old > limit - i" to avoid overflow and false negative
> if someone sets limit close to INT_MAX?

Although 'i' might also be INT_MAX, whereas we know that old < limit.
So "i > limit - old" is the correct condition to check, IMO.

I'd further suggest that using a limit of 0 to mean "unlimited" introduces
an unnecessary arithmetic operation.  Make 'limit' inclusive instead
of exclusive, pass INT_MAX instead of 0, and Vlastimil's suggestion,
and this becomes:

		if (i > limit - old)

> > +			if (oldp)
> > +				*oldp = old;
> > +			return false;
> > +		}
> >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));

...

> > +static inline __must_check __signed_wrap
> > +bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > +{
> > +	return __refcount_add_not_zero_limited(i, r, oldp, 0);

Just to be clear, this becomes:

	return __refcount_add_not_zero_limited(i, r, oldp, INT_MAX);



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-08 15:06     ` Matthew Wilcox
@ 2025-01-08 15:45       ` Suren Baghdasaryan
  2025-01-10 13:32       ` David Laight
  1 sibling, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-08 15:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, akpm, peterz, liam.howlett, lorenzo.stoakes,
	mhocko, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Jan 8, 2025 at 7:06 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jan 08, 2025 at 10:16:04AM +0100, Vlastimil Babka wrote:
> > >  static inline __must_check __signed_wrap
> > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > +                                int limit)
> > >  {
> > >     int old = refcount_read(r);
> > >
> > >     do {
> > >             if (!old)
> > >                     break;
> > > +           if (limit && old + i > limit) {
> >
> > Should this be e.g. "old > limit - i" to avoid overflow and false negative
> > if someone sets limit close to INT_MAX?
>
> Although 'i' might also be INT_MAX, whereas we know that old < limit.
> So "i > limit - old" is the correct condition to check, IMO.
>
> I'd further suggest that using a limit of 0 to mean "unlimited" introduces
> an unnecessary arithmetic operation.  Make 'limit' inclusive instead
> of exclusive, pass INT_MAX instead of 0, and Vlastimil's suggestion,
> and this becomes:
>
>                 if (i > limit - old)

Thanks for the suggestions, Vlastimil and Matthew! Yes, this looks
much better. Will use it in the next version.

>
> > > +                   if (oldp)
> > > +                           *oldp = old;
> > > +                   return false;
> > > +           }
> > >     } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>
> ...
>
> > > +static inline __must_check __signed_wrap
> > > +bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > +{
> > > +   return __refcount_add_not_zero_limited(i, r, oldp, 0);
>
> Just to be clear, this becomes:
>
>         return __refcount_add_not_zero_limited(i, r, oldp, INT_MAX);

Ack.

>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-08 15:06     ` Matthew Wilcox
  2025-01-08 15:45       ` Suren Baghdasaryan
@ 2025-01-10 13:32       ` David Laight
  2025-01-10 16:29         ` Suren Baghdasaryan
  1 sibling, 1 reply; 65+ messages in thread
From: David Laight @ 2025-01-10 13:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Suren Baghdasaryan, akpm, peterz, liam.howlett,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, 8 Jan 2025 15:06:17 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Jan 08, 2025 at 10:16:04AM +0100, Vlastimil Babka wrote:
> > >  static inline __must_check __signed_wrap
> > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > +				     int limit)
> > >  {
> > >  	int old = refcount_read(r);
> > >  
> > >  	do {
> > >  		if (!old)
> > >  			break;
> > > +		if (limit && old + i > limit) {  
> > 
> > Should this be e.g. "old > limit - i" to avoid overflow and false negative
> > if someone sets limit close to INT_MAX?  
> 
> Although 'i' might also be INT_MAX, whereas we know that old < limit.
> So "i > limit - old" is the correct condition to check, IMO.
> 
> I'd further suggest that using a limit of 0 to mean "unlimited" introduces
> an unnecessary arithmetic operation.  Make 'limit' inclusive instead
> of exclusive, pass INT_MAX instead of 0, and Vlastimil's suggestion,
> and this becomes:
> 
> 		if (i > limit - old)
>
...

The problem with that is the compiler is unlikely to optimise it away.
Perhaps:
		if (statically_true(!limit || limit == INT_MAX))
			continue;
		if (i > limit - old) {
			...

	David




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-10 13:32       ` David Laight
@ 2025-01-10 16:29         ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 16:29 UTC (permalink / raw)
  To: David Laight
  Cc: Matthew Wilcox, Vlastimil Babka, akpm, peterz, liam.howlett,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 5:32 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 8 Jan 2025 15:06:17 +0000
> Matthew Wilcox <willy@infradead.org> wrote:
>
> > On Wed, Jan 08, 2025 at 10:16:04AM +0100, Vlastimil Babka wrote:
> > > >  static inline __must_check __signed_wrap
> > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > > +                              int limit)
> > > >  {
> > > >   int old = refcount_read(r);
> > > >
> > > >   do {
> > > >           if (!old)
> > > >                   break;
> > > > +         if (limit && old + i > limit) {
> > >
> > > Should this be e.g. "old > limit - i" to avoid overflow and false negative
> > > if someone sets limit close to INT_MAX?
> >
> > Although 'i' might also be INT_MAX, whereas we know that old < limit.
> > So "i > limit - old" is the correct condition to check, IMO.
> >
> > I'd further suggest that using a limit of 0 to mean "unlimited" introduces
> > an unnecessary arithmetic operation.  Make 'limit' inclusive instead
> > of exclusive, pass INT_MAX instead of 0, and Vlastimil's suggestion,
> > and this becomes:
> >
> >               if (i > limit - old)
> >
> ...
>
> The problem with that is the compiler is unlikely to optimise it away.
> Perhaps:
>                 if (statically_true(!limit || limit == INT_MAX))
>                         continue;
>                 if (i > limit - old) {
>                         ...


Thanks for the comment! I think it makes sense.
For the reference, the new version of this patch is here:
https://lore.kernel.org/all/20250109023025.2242447-11-surenb@google.com/
If I apply your suggestion to that version it should look like this:

+bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
+                                     int limit)
 {
        int old = refcount_read(r);

        do {
               if (!old)
                      break;
+
+                if (statically_true(limit == INT_MAX))
+                        continue;
+
+                if (i > limit - old) {
+                        if (oldp)
+                                *oldp = old;
+                        return false;
+                }
        } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));

I'll update the patch with this and let's see if everyone agrees.

>
>         David
>
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-06  0:38   ` Wei Yang
                     ` (2 more replies)
  2024-12-26 17:07 ` [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
                   ` (4 subsequent siblings)
  16 siblings, 3 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore and the vma->detached flag with
a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached()
to be done only after vma has been write-locked. vma_mark_attached()
changes vm_refcnt to 1 to indicate that it has been attached to the vma
tree. When a reader takes read lock, it increments vm_refcnt, unless the
top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
a writer. When writer takes write lock, it both increments vm_refcnt and
sets the top usable bit to indicate its presence. If there are readers,
writer will wait using newly introduced mm->vma_writer_wait. Since all
writers take mmap_lock in write mode first, there can be only one writer
at a time. The last reader to release the lock will signal the writer
to wake up.
refcount might overflow if there are many competing readers, in which case
read-locking will fail. Readers are expected to handle such failures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 100 +++++++++++++++++++++----------
 include/linux/mm_types.h         |  22 ++++---
 kernel/fork.c                    |  13 ++--
 mm/init-mm.c                     |   1 +
 mm/memory.c                      |  68 +++++++++++++++++----
 tools/testing/vma/linux/atomic.h |   5 ++
 tools/testing/vma/vma_internal.h |  66 +++++++++++---------
 7 files changed, 185 insertions(+), 90 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea4c4228b125..99f4720d7e51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/rcuwait.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -697,12 +698,34 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_lock_init(struct vm_area_struct *vma)
+static inline void vma_lockdep_init(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+}
+
+static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt)
+{
+	if (reset_refcnt)
+		refcount_set(&vma->vm_refcnt, 0);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_refcount_put(struct vm_area_struct *vma)
+{
+	int refcnt;
+
+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+
+		if (refcnt & VMA_LOCK_OFFSET)
+			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
+	}
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -710,6 +733,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	int oldcnt;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	/* Limit at VMA_REF_LIMIT to leave one count for a writer */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT))) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
 		return false;
+	}
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
 
 	/*
-	 * Overflow might produce false locked result.
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 *
 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
@@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
+		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+		vma_refcount_put(vma);
 		return false;
 	}
+
 	return true;
 }
 
@@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
+	int oldcnt;
+
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);
+	/* Limit at VMA_REF_LIMIT to leave one count for a writer */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT))) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+		return false;
+	}
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
 	return true;
 }
 
@@ -762,15 +805,13 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
-	return true;
+	return vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
+	vma_refcount_put(vma);
 	rcu_read_unlock();
 }
 
@@ -813,36 +854,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (refcount_read(&vma->vm_refcnt) <= 1)
 		vma_assert_write_locked(vma);
 }
 
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
 static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(vma->detached, vma);
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!vma->detached, vma);
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
-}
-
-static inline void vma_mark_detached(struct vm_area_struct *vma)
-{
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
 }
 
-static inline bool is_vma_detached(struct vm_area_struct *vma)
-{
-	return vma->detached;
-}
+void vma_mark_detached(struct vm_area_struct *vma);
 
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
@@ -865,7 +903,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_lock_init(struct vm_area_struct *vma) {}
+static inline void vma_lockdep_init(struct vm_area_struct *vma) {}
+static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -908,12 +947,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-#endif
 	vma_numab_state_init(vma);
-	vma_lock_init(vma);
+	vma_lockdep_init(vma);
+	vma_init_lock(vma, false);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6573d95f1d1e..b5312421dec6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -629,9 +630,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
+#define VMA_LOCK_OFFSET	0x40000000
+#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 2)
 
 struct vma_numab_state {
 	/*
@@ -709,19 +709,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/*
-	 * Flag to indicate areas detached from the mm->mm_mt tree.
-	 * Unstable RCU readers are allowed to read this.
-	 */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -784,7 +778,10 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map vmlock_dep_map;
+#endif
 #endif
 } __randomize_layout;
 
@@ -919,6 +916,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct rcuwait vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/kernel/fork.c b/kernel/fork.c
index d4c75428ccaf..7a0800d48112 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	vma_lock_init(new);
+	vma_init_lock(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
-#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
@@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void __vm_area_free(struct vm_area_struct *vma)
 {
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
@@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
 
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	rcuwait_init(&mm->vma_writer_wait);
+#endif
 }
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..4600e7605cab 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
diff --git a/mm/memory.c b/mm/memory.c
index 236fdecd44d6..2def47b5dff0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,9 +6328,39 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
+{
+	/*
+	 * If vma is detached then only vma_mark_attached() can raise the
+	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
+	 */
+	if (!refcount_inc_not_zero(&vma->vm_refcnt))
+		return false;
+
+	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	/* vma is attached, set the writer present bit */
+	refcount_add(VMA_LOCK_OFFSET, &vma->vm_refcnt);
+	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
+		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
+		   TASK_UNINTERRUPTIBLE);
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
+
+	return true;
+}
+
+static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
+{
+	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET + 1, &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+}
+
 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 {
-	down_write(&vma->vm_lock.lock);
+	bool locked;
+
+	/* Wait until refcnt is (VMA_LOCK_OFFSET + 2) => attached with no readers */
+	locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 2);
+
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6338,10 +6368,36 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+
+	if (locked) {
+		bool detached;
+
+		__vma_exit_locked(vma, &detached);
+		VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
+	}
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
+void vma_mark_detached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_attached(vma);
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Wait until refcnt is (VMA_LOCK_OFFSET + 1) => detached with
+		 * no readers
+		 */
+		if (__vma_enter_locked(vma, VMA_LOCK_OFFSET + 1)) {
+			bool detached;
+
+			__vma_exit_locked(vma, &detached);
+			VM_BUG_ON_VMA(!detached, vma);
+		}
+	}
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
@@ -6354,7 +6410,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	rcu_read_lock();
-retry:
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -6362,13 +6417,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
-		vma_end_read(vma);
-		count_vm_vma_lock_event(VMA_LOCK_MISS);
-		/* The area was replaced with another one */
-		goto retry;
-	}
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
index e01f66f98982..2e2021553196 100644
--- a/tools/testing/vma/linux/atomic.h
+++ b/tools/testing/vma/linux/atomic.h
@@ -9,4 +9,9 @@
 #define atomic_set(x, y) do {} while (0)
 #define U8_MAX UCHAR_MAX
 
+#ifndef atomic_cmpxchg_relaxed
+#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
+#define  atomic_cmpxchg_release         uatomic_cmpxchg
+#endif /* atomic_cmpxchg_relaxed */
+
 #endif	/* _LINUX_ATOMIC_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2a624f9304da..1e8cd2f013fa 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -25,7 +25,7 @@
 #include <linux/maple_tree.h>
 #include <linux/mm.h>
 #include <linux/rbtree.h>
-#include <linux/rwsem.h>
+#include <linux/refcount.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -132,10 +132,6 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 struct kref {
 	refcount_t refcount;
 };
@@ -228,15 +224,12 @@ struct mm_struct {
 	unsigned long def_flags;
 };
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
-
-
 struct file {
 	struct address_space	*f_mapping;
 };
 
+#define VMA_LOCK_OFFSET	0x40000000
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -264,16 +257,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/* Flag to indicate areas detached from the mm->mm_mt tree */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock.lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock.lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -282,7 +272,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -335,6 +324,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	refcount_t vm_refcnt;
+#endif
 } __randomize_layout;
 
 struct vm_fault {};
@@ -459,23 +452,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline void vma_lock_init(struct vm_area_struct *vma)
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
+static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
-	vma->vm_lock_seq = UINT_MAX;
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
+}
+
 static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_attached(vma);
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -488,9 +499,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-	vma_lock_init(vma);
+	vma->vm_lock_seq = UINT_MAX;
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -513,10 +522,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	vma_lock_init(new);
+	refcount_set(&new->vm_refcnt, 0);
+	new->vm_lock_seq = UINT_MAX;
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2025-01-06  0:38   ` Wei Yang
  2025-01-06 17:26     ` Suren Baghdasaryan
  2025-01-07 18:44   ` Liam R. Howlett
  2025-01-08 11:52   ` Vlastimil Babka
  2 siblings, 1 reply; 65+ messages in thread
From: Wei Yang @ 2025-01-06  0:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 26, 2024 at 09:07:04AM -0800, Suren Baghdasaryan wrote:
[...]
> /*
>  * Try to read-lock a vma. The function is allowed to occasionally yield false
>  * locked result to avoid performance overhead, in which case we fall back to
>@@ -710,6 +733,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>  */
> static inline bool vma_start_read(struct vm_area_struct *vma)
> {
>+	int oldcnt;
>+
> 	/*
> 	 * Check before locking. A race might cause false locked result.
> 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
>@@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> 		return false;
> 
>-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>+
>+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
>+	/* Limit at VMA_REF_LIMIT to leave one count for a writer */
>+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
>+						      VMA_REF_LIMIT))) {
>+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> 		return false;
>+	}
>+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> 
> 	/*
>-	 * Overflow might produce false locked result.
>+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> 	 * False unlocked result is impossible because we modify and check
>-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> 	 * modification invalidates all existing locks.
> 	 *
> 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
>@@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> 	 * after it has been unlocked.
> 	 * This pairs with RELEASE semantics in vma_end_write_all().
> 	 */
>-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>-		up_read(&vma->vm_lock.lock);
>+	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
>+		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

I am not sure it worth mention. In case it is too trivial, just ignore.

If (oldcnt & VMA_LOCK_OFFSET), oldcnt + 1 > VMA_REF_LIMIT. This means
__refcount_inc_not_zero_limited() above would return false.

If my understanding is correct, we don't need to check it here.

>+		vma_refcount_put(vma);
> 		return false;
> 	}
>+
> 	return true;
> }
> 
[...]

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-06  0:38   ` Wei Yang
@ 2025-01-06 17:26     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-06 17:26 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Sun, Jan 5, 2025 at 4:38 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Thu, Dec 26, 2024 at 09:07:04AM -0800, Suren Baghdasaryan wrote:
> [...]
> > /*
> >  * Try to read-lock a vma. The function is allowed to occasionally yield false
> >  * locked result to avoid performance overhead, in which case we fall back to
> >@@ -710,6 +733,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >  */
> > static inline bool vma_start_read(struct vm_area_struct *vma)
> > {
> >+      int oldcnt;
> >+
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> >@@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> >-      if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >+
> >+      rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> >+      /* Limit at VMA_REF_LIMIT to leave one count for a writer */
> >+      if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> >+                                                    VMA_REF_LIMIT))) {
> >+              rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >               return false;
> >+      }
> >+      lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >
> >       /*
> >-       * Overflow might produce false locked result.
> >+       * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> >-       * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >+       * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> >@@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >-      if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> >-              up_read(&vma->vm_lock.lock);
> >+      if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> >+                   vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> I am not sure it worth mention. In case it is too trivial, just ignore.
>
> If (oldcnt & VMA_LOCK_OFFSET), oldcnt + 1 > VMA_REF_LIMIT. This means
> __refcount_inc_not_zero_limited() above would return false.
>
> If my understanding is correct, we don't need to check it here.

Yes, you are correct, (oldcnt & VMA_LOCK_OFFSET) is not really needed
here. I'll send a small fixup removing this check and adding a comment
before __refcount_inc_not_zero_limited() explaining that it will fail
if VMA_LOCK_OFFSET is set.
Thanks,
Suren.

>
> >+              vma_refcount_put(vma);
> >               return false;
> >       }
> >+
> >       return true;
> > }
> >
> [...]
>
> --
> Wei Yang
> Help you, Help me


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2025-01-06  0:38   ` Wei Yang
@ 2025-01-07 18:44   ` Liam R. Howlett
  2025-01-07 19:38     ` Suren Baghdasaryan
  2025-01-08 11:52   ` Vlastimil Babka
  2 siblings, 1 reply; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-07 18:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore and the vma->detached flag with
> a refcount (vm_refcnt).
> When vma is in detached state, vm_refcnt is 0 and only a call to
> vma_mark_attached() can take it out of this state. Note that unlike
> before, now we enforce both vma_mark_attached() and vma_mark_detached()
> to be done only after vma has been write-locked. vma_mark_attached()
> changes vm_refcnt to 1 to indicate that it has been attached to the vma
> tree. When a reader takes read lock, it increments vm_refcnt, unless the
> top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> a writer. When writer takes write lock, it both increments vm_refcnt and
> sets the top usable bit to indicate its presence. If there are readers,
> writer will wait using newly introduced mm->vma_writer_wait. Since all
> writers take mmap_lock in write mode first, there can be only one writer
> at a time. The last reader to release the lock will signal the writer
> to wake up.
> refcount might overflow if there are many competing readers, in which case
> read-locking will fail. Readers are expected to handle such failures.

I find the above a bit hard to parse.

What I understand is:
1. all accesses increment the ref count.
2. readers cannot increment the ref count unless the writer bit is 0 (no
write present)
3. writers must wait for the ref count to reach 2 (the tree + writer
reference) before proceeding.
4. increment overflow must be handled by the readers.

> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h               | 100 +++++++++++++++++++++----------
>  include/linux/mm_types.h         |  22 ++++---
>  kernel/fork.c                    |  13 ++--
>  mm/init-mm.c                     |   1 +
>  mm/memory.c                      |  68 +++++++++++++++++----
>  tools/testing/vma/linux/atomic.h |   5 ++
>  tools/testing/vma/vma_internal.h |  66 +++++++++++---------
>  7 files changed, 185 insertions(+), 90 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ea4c4228b125..99f4720d7e51 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -32,6 +32,7 @@
>  #include <linux/memremap.h>
>  #include <linux/slab.h>
>  #include <linux/cacheinfo.h>
> +#include <linux/rcuwait.h>
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -697,12 +698,34 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> -static inline void vma_lock_init(struct vm_area_struct *vma)
> +static inline void vma_lockdep_init(struct vm_area_struct *vma)
>  {
> -	init_rwsem(&vma->vm_lock.lock);
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	static struct lock_class_key lockdep_key;
> +
> +	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
> +#endif
> +}
> +
> +static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt)
> +{
> +	if (reset_refcnt)
> +		refcount_set(&vma->vm_refcnt, 0);
>  	vma->vm_lock_seq = UINT_MAX;
>  }
>  
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> +	int refcnt;
> +
> +	if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +
> +		if (refcnt & VMA_LOCK_OFFSET)

Couldn't we only wake on refcnt == VMA_LOCK_OFFSET + 2?
Right now you will wake on every departed reader, I think?  We know
refcnt is only going down if VMA_LOCK_OFFSET is set.

Also, maybe a #define for VMA_LOCK_WRITER_ONLY or some better name?


> +			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> +	}
> +}
> +
>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -710,6 +733,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  {
> +	int oldcnt;
> +
>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> +
> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> +	/* Limit at VMA_REF_LIMIT to leave one count for a writer */
> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT))) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>  		return false;
> +	}
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>  
>  	/*
> -	 * Overflow might produce false locked result.
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
> -	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 *
>  	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> @@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * after it has been unlocked.
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
> -	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock.lock);
> +	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> +		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +		vma_refcount_put(vma);
>  		return false;
>  	}
> +
>  	return true;
>  }
>  
> @@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
> +	int oldcnt;
> +
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read_nested(&vma->vm_lock.lock, subclass);
> +	rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);
> +	/* Limit at VMA_REF_LIMIT to leave one count for a writer */
> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT))) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +		return false;
> +	}
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>  	return true;
>  }
>  
> @@ -762,15 +805,13 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
>   */
>  static inline bool vma_start_read_locked(struct vm_area_struct *vma)
>  {
> -	mmap_assert_locked(vma->vm_mm);
> -	down_read(&vma->vm_lock.lock);
> -	return true;
> +	return vma_start_read_locked_nested(vma, 0);
>  }
>  
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> -	up_read(&vma->vm_lock.lock);
> +	vma_refcount_put(vma);
>  	rcu_read_unlock();
>  }
>  
> @@ -813,36 +854,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock.lock))
> +	if (refcount_read(&vma->vm_refcnt) <= 1)
>  		vma_assert_write_locked(vma);
>  }
>  
> +/*
> + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
> + * assertions should be made either under mmap_write_lock or when the object
> + * has been isolated under mmap_write_lock, ensuring no competing writers.
> + */
>  static inline void vma_assert_attached(struct vm_area_struct *vma)
>  {
> -	VM_BUG_ON_VMA(vma->detached, vma);
> +	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
>  }
>  
>  static inline void vma_assert_detached(struct vm_area_struct *vma)
>  {
> -	VM_BUG_ON_VMA(!vma->detached, vma);
> +	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
>  }
>  
>  static inline void vma_mark_attached(struct vm_area_struct *vma)
>  {
> -	vma->detached = false;
> -}
> -
> -static inline void vma_mark_detached(struct vm_area_struct *vma)
> -{
> -	/* When detaching vma should be write-locked */
>  	vma_assert_write_locked(vma);
> -	vma->detached = true;
> +	vma_assert_detached(vma);
> +	refcount_set(&vma->vm_refcnt, 1);
>  }
>  
> -static inline bool is_vma_detached(struct vm_area_struct *vma)
> -{
> -	return vma->detached;
> -}
> +void vma_mark_detached(struct vm_area_struct *vma);
>  
>  static inline void release_fault_lock(struct vm_fault *vmf)
>  {
> @@ -865,7 +903,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  
>  #else /* CONFIG_PER_VMA_LOCK */
>  
> -static inline void vma_lock_init(struct vm_area_struct *vma) {}
> +static inline void vma_lockdep_init(struct vm_area_struct *vma) {}
> +static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt) {}
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  		{ return false; }
>  static inline void vma_end_read(struct vm_area_struct *vma) {}
> @@ -908,12 +947,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	/* vma is not locked, can't use vma_mark_detached() */
> -	vma->detached = true;
> -#endif
>  	vma_numab_state_init(vma);
> -	vma_lock_init(vma);
> +	vma_lockdep_init(vma);
> +	vma_init_lock(vma, false);
>  }
>  
>  /* Use when VMA is not part of the VMA tree and needs no locking */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6573d95f1d1e..b5312421dec6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -19,6 +19,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/seqlock.h>
>  #include <linux/percpu_counter.h>
> +#include <linux/types.h>
>  
>  #include <asm/mmu.h>
>  
> @@ -629,9 +630,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
>  }
>  #endif
>  
> -struct vma_lock {
> -	struct rw_semaphore lock;
> -};
> +#define VMA_LOCK_OFFSET	0x40000000
> +#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 2)
>  
>  struct vma_numab_state {
>  	/*
> @@ -709,19 +709,13 @@ struct vm_area_struct {
>  	};
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> -	/*
> -	 * Flag to indicate areas detached from the mm->mm_mt tree.
> -	 * Unstable RCU readers are allowed to read this.
> -	 */
> -	bool detached;
> -
>  	/*
>  	 * Can only be written (using WRITE_ONCE()) while holding both:
>  	 *  - mmap_lock (in write mode)
> -	 *  - vm_lock->lock (in write mode)
> +	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
>  	 * Can be read reliably while holding one of:
>  	 *  - mmap_lock (in read or write mode)
> -	 *  - vm_lock->lock (in read or write mode)
> +	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
>  	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
>  	 * while holding nothing (except RCU to keep the VMA struct allocated).
>  	 *
> @@ -784,7 +778,10 @@ struct vm_area_struct {
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  #ifdef CONFIG_PER_VMA_LOCK
>  	/* Unstable RCU readers are allowed to read this. */
> -	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map vmlock_dep_map;
> +#endif
>  #endif
>  } __randomize_layout;
>  
> @@ -919,6 +916,7 @@ struct mm_struct {
>  					  * by mmlist_lock
>  					  */
>  #ifdef CONFIG_PER_VMA_LOCK
> +		struct rcuwait vma_writer_wait;
>  		/*
>  		 * This field has lock-like semantics, meaning it is sometimes
>  		 * accessed with ACQUIRE/RELEASE semantics.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d4c75428ccaf..7a0800d48112 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	 * will be reinitialized.
>  	 */
>  	data_race(memcpy(new, orig, sizeof(*new)));
> -	vma_lock_init(new);
> +	vma_init_lock(new, true);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	/* vma is not locked, can't use vma_mark_detached() */
> -	new->detached = true;
> -#endif
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
>  
> @@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  
>  void __vm_area_free(struct vm_area_struct *vma)
>  {
> +	/* The vma should be detached while being destroyed. */
> +	vma_assert_detached(vma);
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
> @@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
>  	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
>  						  vm_rcu);
>  
> -	/* The vma should not be locked while being destroyed. */
> -	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
>  	__vm_area_free(vma);
>  }
>  #endif
> @@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>  	init_rwsem(&mm->mmap_lock);
>  	mm_lock_seqcount_init(mm);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	rcuwait_init(&mm->vma_writer_wait);
> +#endif
>  }
>  
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 6af3ad675930..4600e7605cab 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -40,6 +40,7 @@ struct mm_struct init_mm = {
>  	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  #ifdef CONFIG_PER_VMA_LOCK
> +	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
>  	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
>  #endif
>  	.user_ns	= &init_user_ns,
> diff --git a/mm/memory.c b/mm/memory.c
> index 236fdecd44d6..2def47b5dff0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6328,9 +6328,39 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>  #endif
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> +{
> +	/*
> +	 * If vma is detached then only vma_mark_attached() can raise the
> +	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> +	 */
> +	if (!refcount_inc_not_zero(&vma->vm_refcnt))
> +		return false;

Can't the write lock overflow the ref count too?

> +
> +	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> +	/* vma is attached, set the writer present bit */
> +	refcount_add(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> +	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> +		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> +		   TASK_UNINTERRUPTIBLE);
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> +
> +	return true;
> +}
> +
> +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> +{
> +	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET + 1, &vma->vm_refcnt);
> +	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +}
> +
>  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>  {
> -	down_write(&vma->vm_lock.lock);
> +	bool locked;
> +
> +	/* Wait until refcnt is (VMA_LOCK_OFFSET + 2) => attached with no readers */
> +	locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 2);

Does it need to take a ref count at all?  Could we just set the write
bit and wait for it to become 1 instead?  That is, 1 would represent
detached or writer is about to attach/detach it.

If we do need it to be ref counted for the writer, we could set the
write bit and the wait for the ref to be 1 before incrementing it to 2?
I think this would be safer as we know there is only one writer and the
readers can only decrease after setting the write bit.

> +
>  	/*
>  	 * We should use WRITE_ONCE() here because we can have concurrent reads
>  	 * from the early lockless pessimistic check in vma_start_read().
> @@ -6338,10 +6368,36 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>  	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>  	 */
>  	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> -	up_write(&vma->vm_lock.lock);
> +
> +	if (locked) {
> +		bool detached;
> +
> +		__vma_exit_locked(vma, &detached);
> +		VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
> +	}
>  }
>  EXPORT_SYMBOL_GPL(__vma_start_write);
>  
> +void vma_mark_detached(struct vm_area_struct *vma)
> +{
> +	vma_assert_write_locked(vma);
> +	vma_assert_attached(vma);
> +
> +	/* We are the only writer, so no need to use vma_refcount_put(). */
> +	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> +		/*
> +		 * Wait until refcnt is (VMA_LOCK_OFFSET + 1) => detached with
> +		 * no readers
> +		 */
> +		if (__vma_enter_locked(vma, VMA_LOCK_OFFSET + 1)) {
> +			bool detached;
> +
> +			__vma_exit_locked(vma, &detached);
> +			VM_BUG_ON_VMA(!detached, vma);
> +		}
> +	}
> +}
> +
>  /*
>   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
>   * stable and not isolated. If the VMA is not found or is being modified the
> @@ -6354,7 +6410,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  
>  	rcu_read_lock();
> -retry:
>  	vma = mas_walk(&mas);
>  	if (!vma)
>  		goto inval;
> @@ -6362,13 +6417,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	if (!vma_start_read(vma))
>  		goto inval;
>  
> -	/* Check if the VMA got isolated after we found it */
> -	if (is_vma_detached(vma)) {
> -		vma_end_read(vma);
> -		count_vm_vma_lock_event(VMA_LOCK_MISS);
> -		/* The area was replaced with another one */
> -		goto retry;
> -	}
>  	/*
>  	 * At this point, we have a stable reference to a VMA: The VMA is
>  	 * locked and we know it hasn't already been isolated.
> diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
> index e01f66f98982..2e2021553196 100644
> --- a/tools/testing/vma/linux/atomic.h
> +++ b/tools/testing/vma/linux/atomic.h
> @@ -9,4 +9,9 @@
>  #define atomic_set(x, y) do {} while (0)
>  #define U8_MAX UCHAR_MAX
>  
> +#ifndef atomic_cmpxchg_relaxed
> +#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
> +#define  atomic_cmpxchg_release         uatomic_cmpxchg
> +#endif /* atomic_cmpxchg_relaxed */
> +
>  #endif	/* _LINUX_ATOMIC_H */
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 2a624f9304da..1e8cd2f013fa 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -25,7 +25,7 @@
>  #include <linux/maple_tree.h>
>  #include <linux/mm.h>
>  #include <linux/rbtree.h>
> -#include <linux/rwsem.h>
> +#include <linux/refcount.h>
>  
>  extern unsigned long stack_guard_gap;
>  #ifdef CONFIG_MMU
> @@ -132,10 +132,6 @@ typedef __bitwise unsigned int vm_fault_t;
>   */
>  #define pr_warn_once pr_err
>  
> -typedef struct refcount_struct {
> -	atomic_t refs;
> -} refcount_t;
> -
>  struct kref {
>  	refcount_t refcount;
>  };
> @@ -228,15 +224,12 @@ struct mm_struct {
>  	unsigned long def_flags;
>  };
>  
> -struct vma_lock {
> -	struct rw_semaphore lock;
> -};
> -
> -
>  struct file {
>  	struct address_space	*f_mapping;
>  };
>  
> +#define VMA_LOCK_OFFSET	0x40000000
> +
>  struct vm_area_struct {
>  	/* The first cache line has the info for VMA tree walking. */
>  
> @@ -264,16 +257,13 @@ struct vm_area_struct {
>  	};
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> -	/* Flag to indicate areas detached from the mm->mm_mt tree */
> -	bool detached;
> -
>  	/*
>  	 * Can only be written (using WRITE_ONCE()) while holding both:
>  	 *  - mmap_lock (in write mode)
> -	 *  - vm_lock.lock (in write mode)
> +	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
>  	 * Can be read reliably while holding one of:
>  	 *  - mmap_lock (in read or write mode)
> -	 *  - vm_lock.lock (in read or write mode)
> +	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
>  	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
>  	 * while holding nothing (except RCU to keep the VMA struct allocated).
>  	 *
> @@ -282,7 +272,6 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	struct vma_lock vm_lock;
>  #endif
>  
>  	/*
> @@ -335,6 +324,10 @@ struct vm_area_struct {
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* Unstable RCU readers are allowed to read this. */
> +	refcount_t vm_refcnt;
> +#endif
>  } __randomize_layout;
>  
>  struct vm_fault {};
> @@ -459,23 +452,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
>  	return mas_find(&vmi->mas, ULONG_MAX);
>  }
>  
> -static inline void vma_lock_init(struct vm_area_struct *vma)
> +/*
> + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
> + * assertions should be made either under mmap_write_lock or when the object
> + * has been isolated under mmap_write_lock, ensuring no competing writers.
> + */
> +static inline void vma_assert_attached(struct vm_area_struct *vma)
>  {
> -	init_rwsem(&vma->vm_lock.lock);
> -	vma->vm_lock_seq = UINT_MAX;
> +	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
>  }
>  
> -static inline void vma_mark_attached(struct vm_area_struct *vma)
> +static inline void vma_assert_detached(struct vm_area_struct *vma)
>  {
> -	vma->detached = false;
> +	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
>  }
>  
>  static inline void vma_assert_write_locked(struct vm_area_struct *);
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
> +{
> +	vma_assert_write_locked(vma);
> +	vma_assert_detached(vma);
> +	refcount_set(&vma->vm_refcnt, 1);
> +}
> +
>  static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
> -	/* When detaching vma should be write-locked */
>  	vma_assert_write_locked(vma);
> -	vma->detached = true;
> +	vma_assert_attached(vma);
> +
> +	/* We are the only writer, so no need to use vma_refcount_put(). */
> +	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> +		/*
> +		 * Reader must have temporarily raised vm_refcnt but it will
> +		 * drop it without using the vma since vma is write-locked.
> +		 */
> +	}
>  }
>  
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
> @@ -488,9 +499,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	/* vma is not locked, can't use vma_mark_detached() */
> -	vma->detached = true;
> -	vma_lock_init(vma);
> +	vma->vm_lock_seq = UINT_MAX;
>  }
>  
>  static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> @@ -513,10 +522,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		return NULL;
>  
>  	memcpy(new, orig, sizeof(*new));
> -	vma_lock_init(new);
> +	refcount_set(&new->vm_refcnt, 0);
> +	new->vm_lock_seq = UINT_MAX;
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> -	/* vma is not locked, can't use vma_mark_detached() */
> -	new->detached = true;
>  
>  	return new;
>  }
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-07 18:44   ` Liam R. Howlett
@ 2025-01-07 19:38     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-07 19:38 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 7, 2025 at 10:44 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore and the vma->detached flag with
> > a refcount (vm_refcnt).
> > When vma is in detached state, vm_refcnt is 0 and only a call to
> > vma_mark_attached() can take it out of this state. Note that unlike
> > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > to be done only after vma has been write-locked. vma_mark_attached()
> > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > a writer. When writer takes write lock, it both increments vm_refcnt and
> > sets the top usable bit to indicate its presence. If there are readers,
> > writer will wait using newly introduced mm->vma_writer_wait. Since all
> > writers take mmap_lock in write mode first, there can be only one writer
> > at a time. The last reader to release the lock will signal the writer
> > to wake up.
> > refcount might overflow if there are many competing readers, in which case
> > read-locking will fail. Readers are expected to handle such failures.
>
> I find the above a bit hard to parse.
>
> What I understand is:
> 1. all accesses increment the ref count.
> 2. readers cannot increment the ref count unless the writer bit is 0 (no
> write present)
> 3. writers must wait for the ref count to reach 2 (the tree + writer
> reference) before proceeding.
> 4. increment overflow must be handled by the readers.

Your understanding is correct. I will add this summary to the
description to make it more understandable.

>
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h               | 100 +++++++++++++++++++++----------
> >  include/linux/mm_types.h         |  22 ++++---
> >  kernel/fork.c                    |  13 ++--
> >  mm/init-mm.c                     |   1 +
> >  mm/memory.c                      |  68 +++++++++++++++++----
> >  tools/testing/vma/linux/atomic.h |   5 ++
> >  tools/testing/vma/vma_internal.h |  66 +++++++++++---------
> >  7 files changed, 185 insertions(+), 90 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ea4c4228b125..99f4720d7e51 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -32,6 +32,7 @@
> >  #include <linux/memremap.h>
> >  #include <linux/slab.h>
> >  #include <linux/cacheinfo.h>
> > +#include <linux/rcuwait.h>
> >
> >  struct mempolicy;
> >  struct anon_vma;
> > @@ -697,12 +698,34 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -static inline void vma_lock_init(struct vm_area_struct *vma)
> > +static inline void vma_lockdep_init(struct vm_area_struct *vma)
> >  {
> > -     init_rwsem(&vma->vm_lock.lock);
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +     static struct lock_class_key lockdep_key;
> > +
> > +     lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
> > +#endif
> > +}
> > +
> > +static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt)
> > +{
> > +     if (reset_refcnt)
> > +             refcount_set(&vma->vm_refcnt, 0);
> >       vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > +{
> > +     int refcnt;
> > +
> > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +
> > +             if (refcnt & VMA_LOCK_OFFSET)
>
> Couldn't we only wake on refcnt == VMA_LOCK_OFFSET + 2?
> Right now you will wake on every departed reader, I think?  We know
> refcnt is only going down if VMA_LOCK_OFFSET is set.

I think we can, except vma_mark_detached() can be waiting for readers
after detaching the vma, so the refcount value it's waiting for is
VMA_LOCK_OFFSET + 1 (detached vma with no readers).
I think if we change the condition to (refcnt & VMA_LOCK_OFFSET) &&
(refcnt <= VMA_LOCK_OFFSET + 2) that would work. We might occasionally
wake a writer unnecessarily if it's waiting inside vma_mark_detached()
for VMA_LOCK_OFFSET + 1 but this situation is rare (see my later
comment).

>
> Also, maybe a #define for VMA_LOCK_WRITER_ONLY or some better name?

Perhaps a function would be better?

static inline bool is_vma_writer_only(refcnt)
{
    return (refcnt & VMA_LOCK_OFFSET) && (refcnt <= VMA_LOCK_OFFSET + 2);
}

>
>
> > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > +     }
> > +}
> > +
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -710,6 +733,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >  {
> > +     int oldcnt;
> > +
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > +
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > +     /* Limit at VMA_REF_LIMIT to leave one count for a writer */
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT))) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >               return false;
> > +     }
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >
> >       /*
> > -      * Overflow might produce false locked result.
> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > @@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > -     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock.lock);
> > +     if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> > +                  vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > +             vma_refcount_put(vma);
> >               return false;
> >       }
> > +
> >       return true;
> >  }
> >
> > @@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> > +     int oldcnt;
> > +
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read_nested(&vma->vm_lock.lock, subclass);
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);
> > +     /* Limit at VMA_REF_LIMIT to leave one count for a writer */
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT))) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +             return false;
> > +     }
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >       return true;
> >  }
> >
> > @@ -762,15 +805,13 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
> >   */
> >  static inline bool vma_start_read_locked(struct vm_area_struct *vma)
> >  {
> > -     mmap_assert_locked(vma->vm_mm);
> > -     down_read(&vma->vm_lock.lock);
> > -     return true;
> > +     return vma_start_read_locked_nested(vma, 0);
> >  }
> >
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> >  {
> >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > -     up_read(&vma->vm_lock.lock);
> > +     vma_refcount_put(vma);
> >       rcu_read_unlock();
> >  }
> >
> > @@ -813,36 +854,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> >               vma_assert_write_locked(vma);
> >  }
> >
> > +/*
> > + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
> > + * assertions should be made either under mmap_write_lock or when the object
> > + * has been isolated under mmap_write_lock, ensuring no competing writers.
> > + */
> >  static inline void vma_assert_attached(struct vm_area_struct *vma)
> >  {
> > -     VM_BUG_ON_VMA(vma->detached, vma);
> > +     VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
> >  }
> >
> >  static inline void vma_assert_detached(struct vm_area_struct *vma)
> >  {
> > -     VM_BUG_ON_VMA(!vma->detached, vma);
> > +     VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
> >  }
> >
> >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> >  {
> > -     vma->detached = false;
> > -}
> > -
> > -static inline void vma_mark_detached(struct vm_area_struct *vma)
> > -{
> > -     /* When detaching vma should be write-locked */
> >       vma_assert_write_locked(vma);
> > -     vma->detached = true;
> > +     vma_assert_detached(vma);
> > +     refcount_set(&vma->vm_refcnt, 1);
> >  }
> >
> > -static inline bool is_vma_detached(struct vm_area_struct *vma)
> > -{
> > -     return vma->detached;
> > -}
> > +void vma_mark_detached(struct vm_area_struct *vma);
> >
> >  static inline void release_fault_lock(struct vm_fault *vmf)
> >  {
> > @@ -865,7 +903,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> > -static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > +static inline void vma_lockdep_init(struct vm_area_struct *vma) {}
> > +static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt) {}
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >               { return false; }
> >  static inline void vma_end_read(struct vm_area_struct *vma) {}
> > @@ -908,12 +947,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     /* vma is not locked, can't use vma_mark_detached() */
> > -     vma->detached = true;
> > -#endif
> >       vma_numab_state_init(vma);
> > -     vma_lock_init(vma);
> > +     vma_lockdep_init(vma);
> > +     vma_init_lock(vma, false);
> >  }
> >
> >  /* Use when VMA is not part of the VMA tree and needs no locking */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6573d95f1d1e..b5312421dec6 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -19,6 +19,7 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/seqlock.h>
> >  #include <linux/percpu_counter.h>
> > +#include <linux/types.h>
> >
> >  #include <asm/mmu.h>
> >
> > @@ -629,9 +630,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
> >  }
> >  #endif
> >
> > -struct vma_lock {
> > -     struct rw_semaphore lock;
> > -};
> > +#define VMA_LOCK_OFFSET      0x40000000
> > +#define VMA_REF_LIMIT        (VMA_LOCK_OFFSET - 2)
> >
> >  struct vma_numab_state {
> >       /*
> > @@ -709,19 +709,13 @@ struct vm_area_struct {
> >       };
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -     /*
> > -      * Flag to indicate areas detached from the mm->mm_mt tree.
> > -      * Unstable RCU readers are allowed to read this.
> > -      */
> > -     bool detached;
> > -
> >       /*
> >        * Can only be written (using WRITE_ONCE()) while holding both:
> >        *  - mmap_lock (in write mode)
> > -      *  - vm_lock->lock (in write mode)
> > +      *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
> >        * Can be read reliably while holding one of:
> >        *  - mmap_lock (in read or write mode)
> > -      *  - vm_lock->lock (in read or write mode)
> > +      *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
> >        * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
> >        * while holding nothing (except RCU to keep the VMA struct allocated).
> >        *
> > @@ -784,7 +778,10 @@ struct vm_area_struct {
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       /* Unstable RCU readers are allowed to read this. */
> > -     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +     refcount_t vm_refcnt ____cacheline_aligned_in_smp;
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +     struct lockdep_map vmlock_dep_map;
> > +#endif
> >  #endif
> >  } __randomize_layout;
> >
> > @@ -919,6 +916,7 @@ struct mm_struct {
> >                                         * by mmlist_lock
> >                                         */
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +             struct rcuwait vma_writer_wait;
> >               /*
> >                * This field has lock-like semantics, meaning it is sometimes
> >                * accessed with ACQUIRE/RELEASE semantics.
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index d4c75428ccaf..7a0800d48112 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >        * will be reinitialized.
> >        */
> >       data_race(memcpy(new, orig, sizeof(*new)));
> > -     vma_lock_init(new);
> > +     vma_init_lock(new, true);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     /* vma is not locked, can't use vma_mark_detached() */
> > -     new->detached = true;
> > -#endif
> >       vma_numab_state_init(new);
> >       dup_anon_vma_name(orig, new);
> >
> > @@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >
> >  void __vm_area_free(struct vm_area_struct *vma)
> >  {
> > +     /* The vma should be detached while being destroyed. */
> > +     vma_assert_detached(vma);
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> > @@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
> >       struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> >                                                 vm_rcu);
> >
> > -     /* The vma should not be locked while being destroyed. */
> > -     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> >       __vm_area_free(vma);
> >  }
> >  #endif
> > @@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
> >  {
> >       init_rwsem(&mm->mmap_lock);
> >       mm_lock_seqcount_init(mm);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     rcuwait_init(&mm->vma_writer_wait);
> > +#endif
> >  }
> >
> >  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index 6af3ad675930..4600e7605cab 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -40,6 +40,7 @@ struct mm_struct init_mm = {
> >       .arg_lock       =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> >       .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +     .vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
> >       .mm_lock_seq    = SEQCNT_ZERO(init_mm.mm_lock_seq),
> >  #endif
> >       .user_ns        = &init_user_ns,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 236fdecd44d6..2def47b5dff0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6328,9 +6328,39 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> >  #endif
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> > +{
> > +     /*
> > +      * If vma is detached then only vma_mark_attached() can raise the
> > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> > +      */
> > +     if (!refcount_inc_not_zero(&vma->vm_refcnt))
> > +             return false;
>
> Can't the write lock overflow the ref count too?

No. VMA_REF_LIMIT is VMA_LOCK_OFFSET - 2 with one count reserved for a
possible writer (see the comment in vma_start_read()) and there can be
only one writer at a time. So, readers can raise the ref count up to
VMA_LOCK_OFFSET - 2 and a writer can raise it one more to
VMA_LOCK_OFFSET - 1 but not higher.

>
> > +
> > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > +     /* vma is attached, set the writer present bit */
> > +     refcount_add(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > +                TASK_UNINTERRUPTIBLE);
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > +
> > +     return true;
> > +}
> > +
> > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> > +{
> > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET + 1, &vma->vm_refcnt);
> > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +}
> > +
> >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> >  {
> > -     down_write(&vma->vm_lock.lock);
> > +     bool locked;
> > +
> > +     /* Wait until refcnt is (VMA_LOCK_OFFSET + 2) => attached with no readers */
> > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 2);
>
> Does it need to take a ref count at all?  Could we just set the write
> bit and wait for it to become 1 instead?  That is, 1 would represent
> detached or writer is about to attach/detach it.

Yeah, I think you are right. We can use VMA_LOCK_OFFSET alone without
taking the ref count for the writer. Something like this:

static inline bool __vma_enter_locked(struct vm_area_struct *vma,
unsigned int tgt_refcnt)
{
        /*
         * If vma is detached then only vma_mark_attached() can raise the
         * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
         */
        if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
                return false;

        rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
        rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
                                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
                                TASK_UNINTERRUPTIBLE);
        lock_acquired(&vma->vmlock_dep_map, _RET_IP_);

        return true;
}

static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
{
        *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
        rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
}

I'll try that and see if we missed anything.

>
> If we do need it to be ref counted for the writer, we could set the
> write bit and the wait for the ref to be 1 before incrementing it to 2?
> I think this would be safer as we know there is only one writer and the
> readers can only decrease after setting the write bit.
>
> > +
> >       /*
> >        * We should use WRITE_ONCE() here because we can have concurrent reads
> >        * from the early lockless pessimistic check in vma_start_read().
> > @@ -6338,10 +6368,36 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> >        * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >        */
> >       WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > -     up_write(&vma->vm_lock.lock);
> > +
> > +     if (locked) {
> > +             bool detached;
> > +
> > +             __vma_exit_locked(vma, &detached);
> > +             VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
> > +     }
> >  }
> >  EXPORT_SYMBOL_GPL(__vma_start_write);
> >
> > +void vma_mark_detached(struct vm_area_struct *vma)
> > +{
> > +     vma_assert_write_locked(vma);
> > +     vma_assert_attached(vma);
> > +
> > +     /* We are the only writer, so no need to use vma_refcount_put(). */

I should probably add a comment explaining why the below condition is
very unlikely (readers can increment vm_refcnt only temporarily before
they realize the vma is locked and drop the vm_refcnt back. That is a
very narrow window).


> > +     if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> > +             /*
> > +              * Wait until refcnt is (VMA_LOCK_OFFSET + 1) => detached with
> > +              * no readers
> > +              */
> > +             if (__vma_enter_locked(vma, VMA_LOCK_OFFSET + 1)) {
> > +                     bool detached;
> > +
> > +                     __vma_exit_locked(vma, &detached);
> > +                     VM_BUG_ON_VMA(!detached, vma);
> > +             }
> > +     }
> > +}
> > +
> >  /*
> >   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> >   * stable and not isolated. If the VMA is not found or is being modified the
> > @@ -6354,7 +6410,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       struct vm_area_struct *vma;
> >
> >       rcu_read_lock();
> > -retry:
> >       vma = mas_walk(&mas);
> >       if (!vma)
> >               goto inval;
> > @@ -6362,13 +6417,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       if (!vma_start_read(vma))
> >               goto inval;
> >
> > -     /* Check if the VMA got isolated after we found it */
> > -     if (is_vma_detached(vma)) {
> > -             vma_end_read(vma);
> > -             count_vm_vma_lock_event(VMA_LOCK_MISS);
> > -             /* The area was replaced with another one */
> > -             goto retry;
> > -     }
> >       /*
> >        * At this point, we have a stable reference to a VMA: The VMA is
> >        * locked and we know it hasn't already been isolated.
> > diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
> > index e01f66f98982..2e2021553196 100644
> > --- a/tools/testing/vma/linux/atomic.h
> > +++ b/tools/testing/vma/linux/atomic.h
> > @@ -9,4 +9,9 @@
> >  #define atomic_set(x, y) do {} while (0)
> >  #define U8_MAX UCHAR_MAX
> >
> > +#ifndef atomic_cmpxchg_relaxed
> > +#define  atomic_cmpxchg_relaxed              uatomic_cmpxchg
> > +#define  atomic_cmpxchg_release         uatomic_cmpxchg
> > +#endif /* atomic_cmpxchg_relaxed */
> > +
> >  #endif       /* _LINUX_ATOMIC_H */
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 2a624f9304da..1e8cd2f013fa 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -25,7 +25,7 @@
> >  #include <linux/maple_tree.h>
> >  #include <linux/mm.h>
> >  #include <linux/rbtree.h>
> > -#include <linux/rwsem.h>
> > +#include <linux/refcount.h>
> >
> >  extern unsigned long stack_guard_gap;
> >  #ifdef CONFIG_MMU
> > @@ -132,10 +132,6 @@ typedef __bitwise unsigned int vm_fault_t;
> >   */
> >  #define pr_warn_once pr_err
> >
> > -typedef struct refcount_struct {
> > -     atomic_t refs;
> > -} refcount_t;
> > -
> >  struct kref {
> >       refcount_t refcount;
> >  };
> > @@ -228,15 +224,12 @@ struct mm_struct {
> >       unsigned long def_flags;
> >  };
> >
> > -struct vma_lock {
> > -     struct rw_semaphore lock;
> > -};
> > -
> > -
> >  struct file {
> >       struct address_space    *f_mapping;
> >  };
> >
> > +#define VMA_LOCK_OFFSET      0x40000000
> > +
> >  struct vm_area_struct {
> >       /* The first cache line has the info for VMA tree walking. */
> >
> > @@ -264,16 +257,13 @@ struct vm_area_struct {
> >       };
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -     /* Flag to indicate areas detached from the mm->mm_mt tree */
> > -     bool detached;
> > -
> >       /*
> >        * Can only be written (using WRITE_ONCE()) while holding both:
> >        *  - mmap_lock (in write mode)
> > -      *  - vm_lock.lock (in write mode)
> > +      *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
> >        * Can be read reliably while holding one of:
> >        *  - mmap_lock (in read or write mode)
> > -      *  - vm_lock.lock (in read or write mode)
> > +      *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
> >        * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
> >        * while holding nothing (except RCU to keep the VMA struct allocated).
> >        *
> > @@ -282,7 +272,6 @@ struct vm_area_struct {
> >        * slowpath.
> >        */
> >       unsigned int vm_lock_seq;
> > -     struct vma_lock vm_lock;
> >  #endif
> >
> >       /*
> > @@ -335,6 +324,10 @@ struct vm_area_struct {
> >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* Unstable RCU readers are allowed to read this. */
> > +     refcount_t vm_refcnt;
> > +#endif
> >  } __randomize_layout;
> >
> >  struct vm_fault {};
> > @@ -459,23 +452,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
> >       return mas_find(&vmi->mas, ULONG_MAX);
> >  }
> >
> > -static inline void vma_lock_init(struct vm_area_struct *vma)
> > +/*
> > + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
> > + * assertions should be made either under mmap_write_lock or when the object
> > + * has been isolated under mmap_write_lock, ensuring no competing writers.
> > + */
> > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> >  {
> > -     init_rwsem(&vma->vm_lock.lock);
> > -     vma->vm_lock_seq = UINT_MAX;
> > +     VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
> >  }
> >
> > -static inline void vma_mark_attached(struct vm_area_struct *vma)
> > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> >  {
> > -     vma->detached = false;
> > +     VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
> >  }
> >
> >  static inline void vma_assert_write_locked(struct vm_area_struct *);
> > +static inline void vma_mark_attached(struct vm_area_struct *vma)
> > +{
> > +     vma_assert_write_locked(vma);
> > +     vma_assert_detached(vma);
> > +     refcount_set(&vma->vm_refcnt, 1);
> > +}
> > +
> >  static inline void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> > -     /* When detaching vma should be write-locked */
> >       vma_assert_write_locked(vma);
> > -     vma->detached = true;
> > +     vma_assert_attached(vma);
> > +
> > +     /* We are the only writer, so no need to use vma_refcount_put(). */
> > +     if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> > +             /*
> > +              * Reader must have temporarily raised vm_refcnt but it will
> > +              * drop it without using the vma since vma is write-locked.
> > +              */
> > +     }
> >  }
> >
> >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> > @@ -488,9 +499,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     /* vma is not locked, can't use vma_mark_detached() */
> > -     vma->detached = true;
> > -     vma_lock_init(vma);
> > +     vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> >  static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > @@ -513,10 +522,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >               return NULL;
> >
> >       memcpy(new, orig, sizeof(*new));
> > -     vma_lock_init(new);
> > +     refcount_set(&new->vm_refcnt, 0);
> > +     new->vm_lock_seq = UINT_MAX;
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> > -     /* vma is not locked, can't use vma_mark_detached() */
> > -     new->detached = true;
> >
> >       return new;
> >  }
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2025-01-06  0:38   ` Wei Yang
  2025-01-07 18:44   ` Liam R. Howlett
@ 2025-01-08 11:52   ` Vlastimil Babka
  2025-01-08 17:53     ` Suren Baghdasaryan
  2 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08 11:52 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore and the vma->detached flag with
> a refcount (vm_refcnt).
> When vma is in detached state, vm_refcnt is 0 and only a call to
> vma_mark_attached() can take it out of this state. Note that unlike
> before, now we enforce both vma_mark_attached() and vma_mark_detached()
> to be done only after vma has been write-locked. vma_mark_attached()
> changes vm_refcnt to 1 to indicate that it has been attached to the vma
> tree. When a reader takes read lock, it increments vm_refcnt, unless the
> top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> a writer. When writer takes write lock, it both increments vm_refcnt and
> sets the top usable bit to indicate its presence. If there are readers,
> writer will wait using newly introduced mm->vma_writer_wait. Since all
> writers take mmap_lock in write mode first, there can be only one writer
> at a time. The last reader to release the lock will signal the writer
> to wake up.
> refcount might overflow if there are many competing readers, in which case
> read-locking will fail. Readers are expected to handle such failures.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

>   */
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  {
> +	int oldcnt;
> +
>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> +
> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);

I don't know much about lockdep, but I see that down_read() does

rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);

down_read_trylock() does

rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);

This is passing the down_read()-like variant but it behaves like a trylock, no?

> +	/* Limit at VMA_REF_LIMIT to leave one count for a writer */

It's mainly to not increase as much as VMA_LOCK_OFFSET bit could become
false positively set set by readers, right? The "leave one count" sounds
like an implementation detail of VMA_REF_LIMIT and will change if Liam's
suggestion is proven feasible?

> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT))) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>  		return false;
> +	}
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>  
>  	/*
> -	 * Overflow might produce false locked result.
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
> -	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 *
>  	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> @@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * after it has been unlocked.
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
> -	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock.lock);
> +	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> +		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +		vma_refcount_put(vma);
>  		return false;
>  	}
> +
>  	return true;
>  }
>  
> @@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
> +	int oldcnt;
> +
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read_nested(&vma->vm_lock.lock, subclass);
> +	rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);

Same as above?

> +	/* Limit at VMA_REF_LIMIT to leave one count for a writer */

Also

> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT))) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +		return false;
> +	}
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>  	return true;
>  }
>  


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-08 11:52   ` Vlastimil Babka
@ 2025-01-08 17:53     ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-08 17:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Jan 8, 2025 at 3:52 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/26/24 18:07, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore and the vma->detached flag with
> > a refcount (vm_refcnt).
> > When vma is in detached state, vm_refcnt is 0 and only a call to
> > vma_mark_attached() can take it out of this state. Note that unlike
> > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > to be done only after vma has been write-locked. vma_mark_attached()
> > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > a writer. When writer takes write lock, it both increments vm_refcnt and
> > sets the top usable bit to indicate its presence. If there are readers,
> > writer will wait using newly introduced mm->vma_writer_wait. Since all
> > writers take mmap_lock in write mode first, there can be only one writer
> > at a time. The last reader to release the lock will signal the writer
> > to wake up.
> > refcount might overflow if there are many competing readers, in which case
> > read-locking will fail. Readers are expected to handle such failures.
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> >   */
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >  {
> > +     int oldcnt;
> > +
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > +
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
>
> I don't know much about lockdep, but I see that down_read() does
>
> rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>
> down_read_trylock() does
>
> rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
>
> This is passing the down_read()-like variant but it behaves like a trylock, no?

Yes, you are correct, this should behave like a trylock. I'll fix it.

>
> > +     /* Limit at VMA_REF_LIMIT to leave one count for a writer */
>
> It's mainly to not increase as much as VMA_LOCK_OFFSET bit could become
> false positively set set by readers, right?

Correct.

> The "leave one count" sounds
> like an implementation detail of VMA_REF_LIMIT and will change if Liam's
> suggestion is proven feasible?

Yes. I already tested Liam's suggestion and it seems to be working
fine. This comment will be gone in the next revision.

>
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT))) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >               return false;
> > +     }
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >
> >       /*
> > -      * Overflow might produce false locked result.
> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > @@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > -     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock.lock);
> > +     if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> > +                  vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > +             vma_refcount_put(vma);
> >               return false;
> >       }
> > +
> >       return true;
> >  }
> >
> > @@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> > +     int oldcnt;
> > +
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read_nested(&vma->vm_lock.lock, subclass);
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);
>
> Same as above?

Ack.

>
> > +     /* Limit at VMA_REF_LIMIT to leave one count for a writer */
>
> Also

Ack.

>
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT))) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +             return false;
> > +     }
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >       return true;
> >  }
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2024-12-26 19:40   ` kernel test robot
  2024-12-26 20:13   ` kernel test robot
  2024-12-26 17:07 ` [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vm_refcnt encodes a number of useful states:
- whether vma is attached or detached
- the number of current vma readers
- presence of a vma writer
Let's include it in the vma dump.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/debug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index 95b6ab809c0e..68b3ba3cf603 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -181,12 +181,12 @@ void dump_vma(const struct vm_area_struct *vma)
 	pr_emerg("vma %px start %px end %px mm %px\n"
 		"prot %lx anon_vma %px vm_ops %px\n"
 		"pgoff %lx file %px private_data %px\n"
-		"flags: %#lx(%pGv)\n",
+		"flags: %#lx(%pGv) refcnt %x\n",
 		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
 		(unsigned long)pgprot_val(vma->vm_page_prot),
 		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
 		vma->vm_file, vma->vm_private_data,
-		vma->vm_flags, &vma->vm_flags);
+		vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
 }
 EXPORT_SYMBOL(dump_vma);
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 17:07 ` [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
@ 2024-12-26 19:40   ` kernel test robot
  2024-12-26 19:51     ` Suren Baghdasaryan
  2024-12-26 20:13   ` kernel test robot
  1 sibling, 1 reply; 65+ messages in thread
From: kernel test robot @ 2024-12-26 19:40 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: llvm, oe-kbuild-all, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm

Hi Suren,

kernel test robot noticed the following build errors:

[auto build test ERROR on 431614f1580a03c1a653340c55ea76bd12a9403f]

url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241227-010925
base:   431614f1580a03c1a653340c55ea76bd12a9403f
patch link:    https://lore.kernel.org/r/20241226170710.1159679-14-surenb%40google.com
patch subject: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
config: hexagon-randconfig-002-20241227 (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412270328.HiJX1mRt-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/debug.c:189:54: error: no member named 'vm_refcnt' in 'struct vm_area_struct'
     189 |                 vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
         |                                                               ~~~  ^
   include/linux/printk.h:514:35: note: expanded from macro 'pr_emerg'
     514 |         printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
         |                                          ^~~~~~~~~~~
   include/linux/printk.h:501:60: note: expanded from macro 'printk'
     501 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
         |                                                            ^~~~~~~~~~~
   include/linux/printk.h:473:19: note: expanded from macro 'printk_index_wrap'
     473 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                                 ^~~~~~~~~~~
   1 error generated.


vim +189 mm/debug.c

   178	
   179	void dump_vma(const struct vm_area_struct *vma)
   180	{
   181		pr_emerg("vma %px start %px end %px mm %px\n"
   182			"prot %lx anon_vma %px vm_ops %px\n"
   183			"pgoff %lx file %px private_data %px\n"
   184			"flags: %#lx(%pGv) refcnt %x\n",
   185			vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
   186			(unsigned long)pgprot_val(vma->vm_page_prot),
   187			vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
   188			vma->vm_file, vma->vm_private_data,
 > 189			vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
   190	}
   191	EXPORT_SYMBOL(dump_vma);
   192	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 19:40   ` kernel test robot
@ 2024-12-26 19:51     ` Suren Baghdasaryan
  2024-12-26 19:54       ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 19:51 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, llvm, oe-kbuild-all, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm

On Thu, Dec 26, 2024 at 11:41 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Suren,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on 431614f1580a03c1a653340c55ea76bd12a9403f]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241227-010925
> base:   431614f1580a03c1a653340c55ea76bd12a9403f
> patch link:    https://lore.kernel.org/r/20241226170710.1159679-14-surenb%40google.com
> patch subject: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
> config: hexagon-randconfig-002-20241227 (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/config)
> compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202412270328.HiJX1mRt-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> mm/debug.c:189:54: error: no member named 'vm_refcnt' in 'struct vm_area_struct'

Ugh, ok. On architectures with ARCH_SUPPORTS_PER_VMA_LOCK=n this will
happen. Preparing a fix.

>      189 |                 vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
>          |                                                               ~~~  ^
>    include/linux/printk.h:514:35: note: expanded from macro 'pr_emerg'
>      514 |         printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
>          |                                          ^~~~~~~~~~~
>    include/linux/printk.h:501:60: note: expanded from macro 'printk'
>      501 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
>          |                                                            ^~~~~~~~~~~
>    include/linux/printk.h:473:19: note: expanded from macro 'printk_index_wrap'
>      473 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
>          |                                 ^~~~~~~~~~~
>    1 error generated.
>
>
> vim +189 mm/debug.c
>
>    178
>    179  void dump_vma(const struct vm_area_struct *vma)
>    180  {
>    181          pr_emerg("vma %px start %px end %px mm %px\n"
>    182                  "prot %lx anon_vma %px vm_ops %px\n"
>    183                  "pgoff %lx file %px private_data %px\n"
>    184                  "flags: %#lx(%pGv) refcnt %x\n",
>    185                  vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
>    186                  (unsigned long)pgprot_val(vma->vm_page_prot),
>    187                  vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
>    188                  vma->vm_file, vma->vm_private_data,
>  > 189                  vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
>    190  }
>    191  EXPORT_SYMBOL(dump_vma);
>    192
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 19:51     ` Suren Baghdasaryan
@ 2024-12-26 19:54       ` Suren Baghdasaryan
  2024-12-26 20:04         ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 19:54 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, llvm, oe-kbuild-all, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm

On Thu, Dec 26, 2024 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 26, 2024 at 11:41 AM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Suren,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on 431614f1580a03c1a653340c55ea76bd12a9403f]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241227-010925
> > base:   431614f1580a03c1a653340c55ea76bd12a9403f
> > patch link:    https://lore.kernel.org/r/20241226170710.1159679-14-surenb%40google.com
> > patch subject: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
> > config: hexagon-randconfig-002-20241227 (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/config)
> > compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202412270328.HiJX1mRt-lkp@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> > >> mm/debug.c:189:54: error: no member named 'vm_refcnt' in 'struct vm_area_struct'
>
> Ugh, ok. On architectures with ARCH_SUPPORTS_PER_VMA_LOCK=n this will
> happen. Preparing a fix.

This error actually also requires CONFIG_DEBUG_VM=y.

>
> >      189 |                 vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> >          |                                                               ~~~  ^
> >    include/linux/printk.h:514:35: note: expanded from macro 'pr_emerg'
> >      514 |         printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
> >          |                                          ^~~~~~~~~~~
> >    include/linux/printk.h:501:60: note: expanded from macro 'printk'
> >      501 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
> >          |                                                            ^~~~~~~~~~~
> >    include/linux/printk.h:473:19: note: expanded from macro 'printk_index_wrap'
> >      473 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
> >          |                                 ^~~~~~~~~~~
> >    1 error generated.
> >
> >
> > vim +189 mm/debug.c
> >
> >    178
> >    179  void dump_vma(const struct vm_area_struct *vma)
> >    180  {
> >    181          pr_emerg("vma %px start %px end %px mm %px\n"
> >    182                  "prot %lx anon_vma %px vm_ops %px\n"
> >    183                  "pgoff %lx file %px private_data %px\n"
> >    184                  "flags: %#lx(%pGv) refcnt %x\n",
> >    185                  vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> >    186                  (unsigned long)pgprot_val(vma->vm_page_prot),
> >    187                  vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> >    188                  vma->vm_file, vma->vm_private_data,
> >  > 189                  vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> >    190  }
> >    191  EXPORT_SYMBOL(dump_vma);
> >    192
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 19:54       ` Suren Baghdasaryan
@ 2024-12-26 20:04         ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 20:04 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, llvm, oe-kbuild-all, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm

On Thu, Dec 26, 2024 at 11:54 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 26, 2024 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Dec 26, 2024 at 11:41 AM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Suren,
> > >
> > > kernel test robot noticed the following build errors:
> > >
> > > [auto build test ERROR on 431614f1580a03c1a653340c55ea76bd12a9403f]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241227-010925
> > > base:   431614f1580a03c1a653340c55ea76bd12a9403f
> > > patch link:    https://lore.kernel.org/r/20241226170710.1159679-14-surenb%40google.com
> > > patch subject: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
> > > config: hexagon-randconfig-002-20241227 (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/config)
> > > compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
> > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241227/202412270328.HiJX1mRt-lkp@intel.com/reproduce)
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202412270328.HiJX1mRt-lkp@intel.com/
> > >
> > > All errors (new ones prefixed by >>):
> > >
> > > >> mm/debug.c:189:54: error: no member named 'vm_refcnt' in 'struct vm_area_struct'
> >
> > Ugh, ok. On architectures with ARCH_SUPPORTS_PER_VMA_LOCK=n this will
> > happen. Preparing a fix.
>
> This error actually also requires CONFIG_DEBUG_VM=y.

Fixup posted at
https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/

>
> >
> > >      189 |                 vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> > >          |                                                               ~~~  ^
> > >    include/linux/printk.h:514:35: note: expanded from macro 'pr_emerg'
> > >      514 |         printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
> > >          |                                          ^~~~~~~~~~~
> > >    include/linux/printk.h:501:60: note: expanded from macro 'printk'
> > >      501 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
> > >          |                                                            ^~~~~~~~~~~
> > >    include/linux/printk.h:473:19: note: expanded from macro 'printk_index_wrap'
> > >      473 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
> > >          |                                 ^~~~~~~~~~~
> > >    1 error generated.
> > >
> > >
> > > vim +189 mm/debug.c
> > >
> > >    178
> > >    179  void dump_vma(const struct vm_area_struct *vma)
> > >    180  {
> > >    181          pr_emerg("vma %px start %px end %px mm %px\n"
> > >    182                  "prot %lx anon_vma %px vm_ops %px\n"
> > >    183                  "pgoff %lx file %px private_data %px\n"
> > >    184                  "flags: %#lx(%pGv) refcnt %x\n",
> > >    185                  vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> > >    186                  (unsigned long)pgprot_val(vma->vm_page_prot),
> > >    187                  vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > >    188                  vma->vm_file, vma->vm_private_data,
> > >  > 189                  vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> > >    190  }
> > >    191  EXPORT_SYMBOL(dump_vma);
> > >    192
> > >
> > > --
> > > 0-DAY CI Kernel Test Service
> > > https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2024-12-26 17:07 ` [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
  2024-12-26 19:40   ` kernel test robot
@ 2024-12-26 20:13   ` kernel test robot
  1 sibling, 0 replies; 65+ messages in thread
From: kernel test robot @ 2024-12-26 20:13 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: oe-kbuild-all, peterz, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm

Hi Suren,

kernel test robot noticed the following build errors:

[auto build test ERROR on 431614f1580a03c1a653340c55ea76bd12a9403f]

url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241227-010925
base:   431614f1580a03c1a653340c55ea76bd12a9403f
patch link:    https://lore.kernel.org/r/20241226170710.1159679-14-surenb%40google.com
patch subject: [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma
config: riscv-randconfig-001-20241227 (https://download.01.org/0day-ci/archive/20241227/202412270304.GtgNiDPd-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241227/202412270304.GtgNiDPd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412270304.GtgNiDPd-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:22,
                    from arch/riscv/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from arch/riscv/include/asm/cmpxchg.h:9,
                    from arch/riscv/include/asm/barrier.h:14,
                    from include/asm-generic/bitops/generic-non-atomic.h:7,
                    from include/linux/bitops.h:29,
                    from include/linux/kernel.h:23,
                    from mm/debug.c:9:
   mm/debug.c: In function 'dump_vma':
>> mm/debug.c:189:66: error: 'const struct vm_area_struct' has no member named 'vm_refcnt'
     189 |                 vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
         |                                                                  ^~
   include/linux/printk.h:473:33: note: in definition of macro 'printk_index_wrap'
     473 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                                 ^~~~~~~~~~~
   include/linux/printk.h:514:9: note: in expansion of macro 'printk'
     514 |         printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~
   mm/debug.c:181:9: note: in expansion of macro 'pr_emerg'
     181 |         pr_emerg("vma %px start %px end %px mm %px\n"
         |         ^~~~~~~~


vim +189 mm/debug.c

   178	
   179	void dump_vma(const struct vm_area_struct *vma)
   180	{
   181		pr_emerg("vma %px start %px end %px mm %px\n"
   182			"prot %lx anon_vma %px vm_ops %px\n"
   183			"pgoff %lx file %px private_data %px\n"
   184			"flags: %#lx(%pGv) refcnt %x\n",
   185			vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
   186			(unsigned long)pgprot_val(vma->vm_page_prot),
   187			vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
   188			vma->vm_file, vma->vm_private_data,
 > 189			vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
   190	}
   191	EXPORT_SYMBOL(dump_vma);
   192	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-08 18:04   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_init() already memset's the whole vm_area_struct to 0, so there is
no need to an additional vma_numab_state_init().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 99f4720d7e51..40bbe815df11 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -947,7 +947,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_numab_state_init(vma);
 	vma_lockdep_init(vma);
 	vma_init_lock(vma, false);
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call
  2024-12-26 17:07 ` [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2025-01-08 18:04   ` Vlastimil Babka
  0 siblings, 0 replies; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08 18:04 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> vma_init() already memset's the whole vm_area_struct to 0, so there is
> no need to an additional vma_numab_state_init().
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mm.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 99f4720d7e51..40bbe815df11 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -947,7 +947,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_numab_state_init(vma);
>  	vma_lockdep_init(vma);
>  	vma_init_lock(vma, false);
>  }



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-08 18:05   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2024-12-26 17:07 ` [PATCH v7 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu() locks
the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using
the original mm and after locking the vma it should ensure that vma->vm_mm
has not changed from under us.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 10 ++++++----
 mm/memory.c        |  7 ++++---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 40bbe815df11..56a7d70ca5bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -730,8 +730,10 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
  * using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
  */
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	int oldcnt;
 
@@ -742,7 +744,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
 
@@ -767,7 +769,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
-		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+		     vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		vma_refcount_put(vma);
 		return false;
 	}
@@ -905,7 +907,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 static inline void vma_lockdep_init(struct vm_area_struct *vma) {}
 static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 2def47b5dff0..9cc93c2f79f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6414,7 +6414,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	if (!vma_start_read(vma))
+	if (!vma_start_read(mm, vma))
 		goto inval;
 
 	/*
@@ -6424,8 +6424,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * fields are accessible for RCU readers.
 	 */
 
-	/* Check since vm_start/vm_end might change before we lock the VMA */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != mm ||
+		     address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
 
 	rcu_read_unlock();
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility
  2024-12-26 17:07 ` [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
@ 2025-01-08 18:05   ` Vlastimil Babka
  0 siblings, 0 replies; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08 18:05 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
> to be reused and attached to another mm after lock_vma_under_rcu() locks
> the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using
> the original mm and after locking the vma it should ensure that vma->vm_mm
> has not changed from under us.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mm.h | 10 ++++++----
>  mm/memory.c        |  7 ++++---
>  2 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 40bbe815df11..56a7d70ca5bd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -730,8 +730,10 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
>   * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
>   */
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
>  {
>  	int oldcnt;
>  
> @@ -742,7 +744,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * we don't rely on for anything - the mm_lock_seq read against which we
>  	 * need ordering is below.
>  	 */
> -	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> +	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>  		return false;
>  
>  
> @@ -767,7 +769,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(oldcnt & VMA_LOCK_OFFSET ||
> -		     vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +		     vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
>  		vma_refcount_put(vma);
>  		return false;
>  	}
> @@ -905,7 +907,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  
>  static inline void vma_lockdep_init(struct vm_area_struct *vma) {}
>  static inline void vma_init_lock(struct vm_area_struct *vma, bool reset_refcnt) {}
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
>  		{ return false; }
>  static inline void vma_end_read(struct vm_area_struct *vma) {}
>  static inline void vma_start_write(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 2def47b5dff0..9cc93c2f79f3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6414,7 +6414,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	if (!vma)
>  		goto inval;
>  
> -	if (!vma_start_read(vma))
> +	if (!vma_start_read(mm, vma))
>  		goto inval;
>  
>  	/*
> @@ -6424,8 +6424,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	 * fields are accessible for RCU readers.
>  	 */
>  
> -	/* Check since vm_start/vm_end might change before we lock the VMA */
> -	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> +	/* Check if the vma we locked is the right one. */
> +	if (unlikely(vma->vm_mm != mm ||
> +		     address < vma->vm_start || address >= vma->vm_end))
>  		goto inval_end_read;
>  
>  	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-08 14:55   ` Liam R. Howlett
  2025-01-08 18:21   ` Vlastimil Babka
  2024-12-26 17:07 ` [PATCH v7 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  16 siblings, 2 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu(). Current checks are sufficient as long as vma
is detached before it is freed. Implement this guarantee by calling
vma_ensure_detached() before vma is freed and make vm_area_cachep
SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
will minimize the number of call_rcu() calls.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               |  2 --
 include/linux/mm_types.h         | 10 +++++++---
 include/linux/slab.h             |  6 ------
 kernel/fork.c                    | 31 +++++++++----------------------
 mm/mmap.c                        |  3 ++-
 mm/vma.c                         | 10 +++-------
 mm/vma.h                         |  2 +-
 tools/testing/vma/vma_internal.h |  7 +------
 8 files changed, 23 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 56a7d70ca5bd..017d70e1d432 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
-/* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b5312421dec6..3ca4695f6d0f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -574,6 +574,12 @@ static inline void *folio_get_private(struct folio *folio)
 
 typedef unsigned long vm_flags_t;
 
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
  * conditions.  These are held in a global tree and are pinned by the VMAs that
@@ -687,9 +693,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..681b685b6c4e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -234,12 +234,6 @@ enum _slab_flag_bits {
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
 #endif
 
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/kernel/fork.c b/kernel/fork.c
index 7a0800d48112..da3b1ebfd282 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -471,7 +471,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
-void __vm_area_free(struct vm_area_struct *vma)
+void vm_area_free(struct vm_area_struct *vma)
 {
 	/* The vma should be detached while being destroyed. */
 	vma_assert_detached(vma);
@@ -480,25 +480,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
-{
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
-	__vm_area_free(vma);
-}
-#endif
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
-	__vm_area_free(vma);
-#endif
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3144,6 +3125,11 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3160,8 +3146,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
 			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
diff --git a/mm/mmap.c b/mm/mmap.c
index 3cc8de07411d..7fdc4207fe98 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1364,7 +1364,8 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true);
+		vma_mark_detached(vma);
+		remove_vma(vma);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 4a3deb6f9662..e37eb384d118 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -406,18 +406,14 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma)
 {
 	might_sleep();
 	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable) {
-		vma_mark_detached(vma);
-		__vm_area_free(vma);
-	} else
-		vm_area_free(vma);
+	vm_area_free(vma);
 }
 
 /*
@@ -1199,7 +1195,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* unreachable = */ false);
+		remove_vma(vma);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
diff --git a/mm/vma.h b/mm/vma.h
index 18c9e49b1eae..d6803626151d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -173,7 +173,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);
 
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 1e8cd2f013fa..c7c580ec9a2d 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -693,14 +693,9 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void __vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void vm_area_free(struct vm_area_struct *vma)
 {
-	__vm_area_free(vma);
+	free(vma);
 }
 
 static inline void lru_add_drain(void)
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-12-26 17:07 ` [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2025-01-08 14:55   ` Liam R. Howlett
  2025-01-08 18:21   ` Vlastimil Babka
  1 sibling, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 14:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:07]:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected by
> lock_vma_under_rcu(). Current checks are sufficient as long as vma
> is detached before it is freed. Implement this guarantee by calling
> vma_ensure_detached() before vma is freed and make vm_area_cachep
> SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
> will minimize the number of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  include/linux/mm.h               |  2 --
>  include/linux/mm_types.h         | 10 +++++++---
>  include/linux/slab.h             |  6 ------
>  kernel/fork.c                    | 31 +++++++++----------------------
>  mm/mmap.c                        |  3 ++-
>  mm/vma.c                         | 10 +++-------
>  mm/vma.h                         |  2 +-
>  tools/testing/vma/vma_internal.h |  7 +------
>  8 files changed, 23 insertions(+), 48 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 56a7d70ca5bd..017d70e1d432 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>  void vm_area_free(struct vm_area_struct *);
> -/* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
>  
>  #ifndef CONFIG_MMU
>  extern struct rb_root nommu_region_tree;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b5312421dec6..3ca4695f6d0f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -574,6 +574,12 @@ static inline void *folio_get_private(struct folio *folio)
>  
>  typedef unsigned long vm_flags_t;
>  
> +/*
> + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> + */
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  /*
>   * A region containing a mapping of a non-memory backed file under NOMMU
>   * conditions.  These are held in a global tree and are pinned by the VMAs that
> @@ -687,9 +693,7 @@ struct vm_area_struct {
>  			unsigned long vm_start;
>  			unsigned long vm_end;
>  		};
> -#ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> -#endif
> +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>  	};
>  
>  	/*
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..681b685b6c4e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -234,12 +234,6 @@ enum _slab_flag_bits {
>  #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
>  #endif
>  
> -/*
> - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> - */
> -typedef struct { unsigned long v; } freeptr_t;
> -
>  /*
>   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
>   *
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7a0800d48112..da3b1ebfd282 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -471,7 +471,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	return new;
>  }
>  
> -void __vm_area_free(struct vm_area_struct *vma)
> +void vm_area_free(struct vm_area_struct *vma)
>  {
>  	/* The vma should be detached while being destroyed. */
>  	vma_assert_detached(vma);
> @@ -480,25 +480,6 @@ void __vm_area_free(struct vm_area_struct *vma)
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>  
> -#ifdef CONFIG_PER_VMA_LOCK
> -static void vm_area_free_rcu_cb(struct rcu_head *head)
> -{
> -	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> -						  vm_rcu);
> -
> -	__vm_area_free(vma);
> -}
> -#endif
> -
> -void vm_area_free(struct vm_area_struct *vma)
> -{
> -#ifdef CONFIG_PER_VMA_LOCK
> -	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> -#else
> -	__vm_area_free(vma);
> -#endif
> -}
> -
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> @@ -3144,6 +3125,11 @@ void __init mm_cache_init(void)
>  
>  void __init proc_caches_init(void)
>  {
> +	struct kmem_cache_args args = {
> +		.use_freeptr_offset = true,
> +		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +	};
> +
>  	sighand_cachep = kmem_cache_create("sighand_cache",
>  			sizeof(struct sighand_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3160,8 +3146,9 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -	vm_area_cachep = KMEM_CACHE(vm_area_struct,
> -			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +	vm_area_cachep = kmem_cache_create("vm_area_struct",
> +			sizeof(struct vm_area_struct), &args,
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
>  			SLAB_ACCOUNT);
>  	mmap_init();
>  	nsproxy_cache_init();
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3cc8de07411d..7fdc4207fe98 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1364,7 +1364,8 @@ void exit_mmap(struct mm_struct *mm)
>  	do {
>  		if (vma->vm_flags & VM_ACCOUNT)
>  			nr_accounted += vma_pages(vma);
> -		remove_vma(vma, /* unreachable = */ true);
> +		vma_mark_detached(vma);
> +		remove_vma(vma);
>  		count++;
>  		cond_resched();
>  		vma = vma_next(&vmi);
> diff --git a/mm/vma.c b/mm/vma.c
> index 4a3deb6f9662..e37eb384d118 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -406,18 +406,14 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
>  /*
>   * Close a vm structure and free it.
>   */
> -void remove_vma(struct vm_area_struct *vma, bool unreachable)
> +void remove_vma(struct vm_area_struct *vma)
>  {
>  	might_sleep();
>  	vma_close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	if (unreachable) {
> -		vma_mark_detached(vma);
> -		__vm_area_free(vma);
> -	} else
> -		vm_area_free(vma);
> +	vm_area_free(vma);
>  }
>  
>  /*
> @@ -1199,7 +1195,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	/* Remove and clean up vmas */
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		remove_vma(vma, /* unreachable = */ false);
> +		remove_vma(vma);
>  
>  	vm_unacct_memory(vms->nr_accounted);
>  	validate_mm(mm);
> diff --git a/mm/vma.h b/mm/vma.h
> index 18c9e49b1eae..d6803626151d 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -173,7 +173,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  		  unsigned long start, size_t len, struct list_head *uf,
>  		  bool unlock);
>  
> -void remove_vma(struct vm_area_struct *vma, bool unreachable);
> +void remove_vma(struct vm_area_struct *vma);
>  
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  		struct vm_area_struct *prev, struct vm_area_struct *next);
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 1e8cd2f013fa..c7c580ec9a2d 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -693,14 +693,9 @@ static inline void mpol_put(struct mempolicy *)
>  {
>  }
>  
> -static inline void __vm_area_free(struct vm_area_struct *vma)
> -{
> -	free(vma);
> -}
> -
>  static inline void vm_area_free(struct vm_area_struct *vma)
>  {
> -	__vm_area_free(vma);
> +	free(vma);
>  }
>  
>  static inline void lru_add_drain(void)
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-12-26 17:07 ` [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2025-01-08 14:55   ` Liam R. Howlett
@ 2025-01-08 18:21   ` Vlastimil Babka
  2025-01-08 18:44     ` Suren Baghdasaryan
  1 sibling, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08 18:21 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 12/26/24 18:07, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected by
> lock_vma_under_rcu(). Current checks are sufficient as long as vma
> is detached before it is freed. Implement this guarantee by calling
> vma_ensure_detached() before vma is freed and make vm_area_cachep
> SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
> will minimize the number of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

I've noticed vm_area_dup() went back to the approach of "we memcpy
everything including vma_lock and detached (now the vm_refcnt) followed by a
vma_init_lock(..., true) that does refcount_set(&vma->vm_refcnt, 0);
Is that now safe against a racing lock_vma_under_rcu()? I think it's not?



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-08 18:21   ` Vlastimil Babka
@ 2025-01-08 18:44     ` Suren Baghdasaryan
  2025-01-08 19:00       ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-08 18:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Jan 8, 2025 at 10:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/26/24 18:07, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected by
> > lock_vma_under_rcu(). Current checks are sufficient as long as vma
> > is detached before it is freed. Implement this guarantee by calling
> > vma_ensure_detached() before vma is freed and make vm_area_cachep
> > SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
> > will minimize the number of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> I've noticed vm_area_dup() went back to the approach of "we memcpy
> everything including vma_lock and detached (now the vm_refcnt) followed by a
> vma_init_lock(..., true) that does refcount_set(&vma->vm_refcnt, 0);
> Is that now safe against a racing lock_vma_under_rcu()? I think it's not?

I think it's safe because vma created by vm_area_dup() is not in the
vma tree yet, so lock_vma_under_rcu() does not see it until it's added
into the tree. Note also that at the time when the new vma gets added
into the tree, the vma has to be write-locked
(vma_iter_store()->vma_mark_attached()->vma_assert_write_locked()).
So, lock_vma_under_rcu() won't use the new vma even after it's added
into the tree until we unlock the vma.

>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-08 18:44     ` Suren Baghdasaryan
@ 2025-01-08 19:00       ` Vlastimil Babka
  2025-01-08 19:17         ` Suren Baghdasaryan
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2025-01-08 19:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On 1/8/25 19:44, Suren Baghdasaryan wrote:
> On Wed, Jan 8, 2025 at 10:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/26/24 18:07, Suren Baghdasaryan wrote:
>> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > object reuse before RCU grace period is over will be detected by
>> > lock_vma_under_rcu(). Current checks are sufficient as long as vma
>> > is detached before it is freed. Implement this guarantee by calling
>> > vma_ensure_detached() before vma is freed and make vm_area_cachep
>> > SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
>> > will minimize the number of call_rcu() calls.
>> >
>> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>
>> I've noticed vm_area_dup() went back to the approach of "we memcpy
>> everything including vma_lock and detached (now the vm_refcnt) followed by a
>> vma_init_lock(..., true) that does refcount_set(&vma->vm_refcnt, 0);
>> Is that now safe against a racing lock_vma_under_rcu()? I think it's not?
> 
> I think it's safe because vma created by vm_area_dup() is not in the
> vma tree yet, so lock_vma_under_rcu() does not see it until it's added
> into the tree. Note also that at the time when the new vma gets added
> into the tree, the vma has to be write-locked
> (vma_iter_store()->vma_mark_attached()->vma_assert_write_locked()).
> So, lock_vma_under_rcu() won't use the new vma even after it's added
> into the tree until we unlock the vma.


What about something like this, where vma starts out as attached as thus
reachable:

A:			B:	C:
lock_vma_under_rcu()
  vma = mas_walk()
  vma_start_read()
    vm_lock_seq == mm->mm_lock_seq.sequence

			vma detached and freed
					
				vm_area_dup()
				- vma reallocated
				- memcpy() copies non-zero refcnt from orig

    __refcount_inc_not_zero_limited() succeeds

				vma_init_lock();
				refcount_set(&vma->vm_refcnt, 0);

    - vm_lock_seq validation fails (could it even succeed?)
    vma_refcount_put(vma);
      __refcount_dec_and_test makes refcount -1





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-08 19:00       ` Vlastimil Babka
@ 2025-01-08 19:17         ` Suren Baghdasaryan
  0 siblings, 0 replies; 65+ messages in thread
From: Suren Baghdasaryan @ 2025-01-08 19:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Jan 8, 2025 at 11:00 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/8/25 19:44, Suren Baghdasaryan wrote:
> > On Wed, Jan 8, 2025 at 10:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/26/24 18:07, Suren Baghdasaryan wrote:
> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> > object reuse before RCU grace period is over will be detected by
> >> > lock_vma_under_rcu(). Current checks are sufficient as long as vma
> >> > is detached before it is freed. Implement this guarantee by calling
> >> > vma_ensure_detached() before vma is freed and make vm_area_cachep
> >> > SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
> >> > will minimize the number of call_rcu() calls.
> >> >
> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>
> >> I've noticed vm_area_dup() went back to the approach of "we memcpy
> >> everything including vma_lock and detached (now the vm_refcnt) followed by a
> >> vma_init_lock(..., true) that does refcount_set(&vma->vm_refcnt, 0);
> >> Is that now safe against a racing lock_vma_under_rcu()? I think it's not?
> >
> > I think it's safe because vma created by vm_area_dup() is not in the
> > vma tree yet, so lock_vma_under_rcu() does not see it until it's added
> > into the tree. Note also that at the time when the new vma gets added
> > into the tree, the vma has to be write-locked
> > (vma_iter_store()->vma_mark_attached()->vma_assert_write_locked()).
> > So, lock_vma_under_rcu() won't use the new vma even after it's added
> > into the tree until we unlock the vma.
>
>
> What about something like this, where vma starts out as attached as thus
> reachable:

Huh, very clever sequence.

>
> A:                      B:      C:
> lock_vma_under_rcu()
>   vma = mas_walk()
>   vma_start_read()
>     vm_lock_seq == mm->mm_lock_seq.sequence
>
                           vma_start_write
>                         vma detached and freed
>
>                                 vm_area_dup()
>                                 - vma reallocated
>                                 - memcpy() copies non-zero refcnt from orig
>
>     __refcount_inc_not_zero_limited() succeeds
>
>                                 vma_init_lock();
>                                 refcount_set(&vma->vm_refcnt, 0);
>
>     - vm_lock_seq validation fails (could it even succeed?)

It can succeed if task C drops the vma write-lock before A validates
vm_lock_seq.

>     vma_refcount_put(vma);
>       __refcount_dec_and_test makes refcount -1

Yeah, I guess I will have to keep vm_refcnt at 0 across reuse, so
memcpy() in vm_area_dup() should be replaced. I'll make the changes.
Thanks for analyzing this, Vlastimil!

>
>
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v7 17/17] docs/mm: document latest changes to vm_lock
  2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2024-12-26 17:07 ` [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-12-26 17:07 ` Suren Baghdasaryan
  2025-01-08 15:46   ` Liam R. Howlett
  16 siblings, 1 reply; 65+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:07 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Change the documentation to reflect that vm_lock is integrated into vma
and replaced with vm_refcnt.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 81417fa2ed20..f573de936b5d 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
-their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
-via :c:func:`!vma_end_read`.
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
+fail due to lock contention but the caller should still check their return values
+in case they fail for other reasons.
+
+VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
+duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
+:c:func:`!vma_end_read`.
 
 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
@@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
 lock, releasing or downgrading the mmap write lock also releases the VMA write
 lock so there is no :c:func:`!vma_end_write` function.
 
-Note that a semaphore write lock is not held across a VMA lock. Rather, a
-sequence number is used for serialisation, and the write semaphore is only
-acquired at the point of write lock to update this.
+Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
+modified so that readers can detect the presense of a writer. The reference counter is
+restored once the vma sequence number used for serialisation is updated.
 
 This ensures the semantics we require - VMA write locks provide exclusive write
 access to the VMA.
@@ -738,7 +743,7 @@ Implementation details
 
 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
 of the heavily contended mmap lock. It is implemented using a combination of a
-read/write semaphore and sequence numbers belonging to the containing
+reference counter and sequence numbers belonging to the containing
 :c:struct:`!struct mm_struct` and the VMA.
 
 Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
@@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
 keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
-Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
-the sequence count of the VMA does not match that of the mm.
+Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
+reference counter and check that the sequence count of the VMA does not match
+that of the mm.
 
-If it does, the read lock fails. If it does not, we hold the lock, excluding
-writers, but permitting other readers, who will also obtain this lock under RCU.
+If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
+If it does not, we keep the reference counter raised, excluding writers, but
+permitting other readers, who can also obtain this lock under RCU.
 
 Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
-read/write semaphore, before setting the VMA's sequence number under this lock,
-also simultaneously holding the mmap write lock.
+On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
+modified by readers and wait for all readers to drop their reference count.
+Once there are no readers, VMA's sequence number is set to match that of the
+mm. During this entire operation mmap write lock is held.
 
 This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
 until these are finished and mutual exclusion is achieved.
 
-After setting the VMA's sequence number, the lock is released, avoiding
-complexity with a long-term held write lock.
+After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
+indicating a writer is cleared. From this point on, VMA's sequence number will
+indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
 
-This clever combination of a read/write semaphore and sequence count allows for
+This clever combination of a reference counter and sequence count allows for
 fast RCU-based per-VMA lock acquisition (especially on page fault, though
 utilised elsewhere) with minimal complexity around lock ordering.
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v7 17/17] docs/mm: document latest changes to vm_lock
  2024-12-26 17:07 ` [PATCH v7 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2025-01-08 15:46   ` Liam R. Howlett
  0 siblings, 0 replies; 65+ messages in thread
From: Liam R. Howlett @ 2025-01-08 15:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241226 12:08]:
> Change the documentation to reflect that vm_lock is integrated into vma
> and replaced with vm_refcnt.
> Document newly introduced vma_start_read_locked{_nested} functions.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> ---
>  Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index 81417fa2ed20..f573de936b5d 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
>  critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
>  before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
>  
> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> -their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> -via :c:func:`!vma_end_read`.
> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
> +fail due to lock contention but the caller should still check their return values
> +in case they fail for other reasons.
> +
> +VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
> +duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
> +:c:func:`!vma_end_read`.
>  
>  VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
>  VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> @@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
>  lock, releasing or downgrading the mmap write lock also releases the VMA write
>  lock so there is no :c:func:`!vma_end_write` function.
>  
> -Note that a semaphore write lock is not held across a VMA lock. Rather, a
> -sequence number is used for serialisation, and the write semaphore is only
> -acquired at the point of write lock to update this.
> +Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
> +modified so that readers can detect the presense of a writer. The reference counter is
> +restored once the vma sequence number used for serialisation is updated.
>  
>  This ensures the semantics we require - VMA write locks provide exclusive write
>  access to the VMA.
> @@ -738,7 +743,7 @@ Implementation details
>  
>  The VMA lock mechanism is designed to be a lightweight means of avoiding the use
>  of the heavily contended mmap lock. It is implemented using a combination of a
> -read/write semaphore and sequence numbers belonging to the containing
> +reference counter and sequence numbers belonging to the containing
>  :c:struct:`!struct mm_struct` and the VMA.
>  
>  Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> @@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
>  keep VMAs locked across entirely separate write operations. It also maintains
>  correct lock ordering.
>  
> -Each time a VMA read lock is acquired, we acquire a read lock on the
> -:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
> -the sequence count of the VMA does not match that of the mm.
> +Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
> +reference counter and check that the sequence count of the VMA does not match
> +that of the mm.
>  
> -If it does, the read lock fails. If it does not, we hold the lock, excluding
> -writers, but permitting other readers, who will also obtain this lock under RCU.
> +If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
> +If it does not, we keep the reference counter raised, excluding writers, but
> +permitting other readers, who can also obtain this lock under RCU.
>  
>  Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>  are also RCU safe, so the whole read lock operation is guaranteed to function
>  correctly.
>  
> -On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
> -read/write semaphore, before setting the VMA's sequence number under this lock,
> -also simultaneously holding the mmap write lock.
> +On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
> +modified by readers and wait for all readers to drop their reference count.
> +Once there are no readers, VMA's sequence number is set to match that of the
> +mm. During this entire operation mmap write lock is held.
>  
>  This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
>  until these are finished and mutual exclusion is achieved.
>  
> -After setting the VMA's sequence number, the lock is released, avoiding
> -complexity with a long-term held write lock.
> +After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
> +indicating a writer is cleared. From this point on, VMA's sequence number will
> +indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
>  
> -This clever combination of a read/write semaphore and sequence count allows for
> +This clever combination of a reference counter and sequence count allows for
>  fast RCU-based per-VMA lock acquisition (especially on page fault, though
>  utilised elsewhere) with minimal complexity around lock ordering.
>  
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2025-04-08 17:10 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-26 17:06 [PATCH v7 00/17] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-26 17:06 ` [PATCH v7 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2025-01-08 14:59   ` Liam R. Howlett
2024-12-26 17:06 ` [PATCH v7 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2025-01-08 14:59   ` Liam R. Howlett
2024-12-26 17:06 ` [PATCH v7 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2025-01-08 15:01   ` Liam R. Howlett
2024-12-26 17:06 ` [PATCH v7 04/17] mm: modify vma_iter_store{_gfp} to indicate if it's storing a new vma Suren Baghdasaryan
2025-01-07 16:48   ` Vlastimil Babka
2025-01-07 16:49   ` Liam R. Howlett
2025-01-07 17:12     ` Suren Baghdasaryan
2024-12-26 17:06 ` [PATCH v7 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
2025-01-07 17:08   ` Vlastimil Babka
2025-01-07 17:13     ` Suren Baghdasaryan
2024-12-26 17:06 ` [PATCH v7 06/17] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
2025-01-07 17:51   ` Liam R. Howlett
2025-01-07 18:05     ` Suren Baghdasaryan
2024-12-26 17:06 ` [PATCH v7 07/17] types: move struct rcuwait into types.h Suren Baghdasaryan
2024-12-27 18:35   ` Davidlohr Bueso
2025-01-08 15:02   ` Liam R. Howlett
2024-12-26 17:07 ` [PATCH v7 08/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
2025-01-07 17:28   ` Vlastimil Babka
2025-01-07 17:31     ` Suren Baghdasaryan
2024-12-26 17:07 ` [PATCH v7 09/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
2025-01-07 17:30   ` Vlastimil Babka
2024-12-26 17:07 ` [PATCH v7 10/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
2025-01-07 17:35   ` Vlastimil Babka
2025-01-07 17:45     ` Suren Baghdasaryan
2025-01-07 18:51       ` Suren Baghdasaryan
2025-04-08  4:39     ` Eric Naim
2025-04-08  6:01       ` Christoph Hellwig
2025-04-08  6:25         ` Lorenzo Stoakes
2025-04-08  7:52           ` Eric Naim
2025-04-08 17:09             ` Suren Baghdasaryan
2024-12-26 17:07 ` [PATCH v7 11/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
2025-01-08  9:16   ` Vlastimil Babka
2025-01-08 15:06     ` Matthew Wilcox
2025-01-08 15:45       ` Suren Baghdasaryan
2025-01-10 13:32       ` David Laight
2025-01-10 16:29         ` Suren Baghdasaryan
2024-12-26 17:07 ` [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
2025-01-06  0:38   ` Wei Yang
2025-01-06 17:26     ` Suren Baghdasaryan
2025-01-07 18:44   ` Liam R. Howlett
2025-01-07 19:38     ` Suren Baghdasaryan
2025-01-08 11:52   ` Vlastimil Babka
2025-01-08 17:53     ` Suren Baghdasaryan
2024-12-26 17:07 ` [PATCH v7 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
2024-12-26 19:40   ` kernel test robot
2024-12-26 19:51     ` Suren Baghdasaryan
2024-12-26 19:54       ` Suren Baghdasaryan
2024-12-26 20:04         ` Suren Baghdasaryan
2024-12-26 20:13   ` kernel test robot
2024-12-26 17:07 ` [PATCH v7 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
2025-01-08 18:04   ` Vlastimil Babka
2024-12-26 17:07 ` [PATCH v7 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
2025-01-08 18:05   ` Vlastimil Babka
2024-12-26 17:07 ` [PATCH v7 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2025-01-08 14:55   ` Liam R. Howlett
2025-01-08 18:21   ` Vlastimil Babka
2025-01-08 18:44     ` Suren Baghdasaryan
2025-01-08 19:00       ` Vlastimil Babka
2025-01-08 19:17         ` Suren Baghdasaryan
2024-12-26 17:07 ` [PATCH v7 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2025-01-08 15:46   ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox