[PATCH v8 00/16] move per-vma lock into vm_area

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/16] move per-vma lock into vm_area_struct
@ 2025-01-09  2:30 Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (18 more replies)
  0 siblings, 19 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v7 [4]:
- Removed additional parameter for vma_iter_store() and introduced
vma_iter_store_attached() instead, per Vlastimil Babka and
Liam R. Howlett
- Fixed coding style nits, per Vlastimil Babka
- Added Reviewed-bys and Acked-bys, per Vlastimil Babka
- Added Reviewed-bys and Acked-bys, per Liam R. Howlett
- Added Acked-by, per Davidlohr Bueso
- Removed unnecessary patch changeing nommu.c
- Folded a fixup patch [5] into the patch it was fixing
- Changed calculation in __refcount_add_not_zero_limited() to avoid
overflow, to change the limit to be inclusive and to use INT_MAX to
indicate no limits, per Vlastimil Babka and Matthew Wilcox
- Folded a fixup patch [6] into the patch it was fixing
- Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
- Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
to not reserve one count for a writer, per Liam R. Howlett
- Changed vma_refcount_put() to wake up writers only when the last reader
is leaving, per Liam R. Howlett
- Fixed rwsem_acquire_read() parameters when read-locking a vma to match
the way down_read_trylock() does lockdep, per Vlastimil Babka
- Folded vma_lockdep_init() into vma_lock_init() for simplicity
- Brought back vma_copy() to keep vm_refcount at 0 during reuse,
per Vlastimil Babka

What I did not include in this patchset:
- Liam's suggestion to change dump_vma() output since it's unclear to me
how it should look like. The patch is for debug only and not critical for
the rest of the series, we can change the output later or even drop it if
necessary.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
[5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
[6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/

Patchset applies over mm-unstable after reverting v7
(current SHA range: 588f0086398e - fb2270654630)

Suren Baghdasaryan (16):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: introduce vma_iter_store_attached() to use with attached vmas
  mm: mark vmas detached upon exit
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm/debug: print vm_refcnt state when dumping the vma
  mm: remove extra vma_numab_state_init() call
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 +++++----
 include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
 include/linux/mm_types.h           |  36 ++++---
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  20 +++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      | 128 +++++++++++-------------
 mm/debug.c                         |  12 +++
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  94 +++++++++++++++---
 mm/mmap.c                          |   3 +-
 mm/userfaultfd.c                   |  32 +++---
 mm/vma.c                           |  23 ++---
 mm/vma.h                           |  15 ++-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  93 ++++++++----------
 18 files changed, 435 insertions(+), 260 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 01/16] mm: introduce vma_start_read_locked{_nested} helpers
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb, Liam R. Howlett

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read.  Replace direct accesses to vma->vm_lock with
these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/userfaultfd.c   | 22 +++++-----------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57b9e4dc4724..b040376ee81f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,6 +735,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 11b7eb3c8a28..a03c6f1ceb9e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
-	if (!IS_ERR(vma)) {
-		/*
-		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
-		 */
-		down_read(&vma->vm_lock->lock);
-	}
+	if (!IS_ERR(vma))
+		vma_start_read_locked(vma);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1490,14 +1482,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		/*
-		 * See comment in uffd_lock_vma() as to why not using
-		 * vma_start_read() here.
-		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 02/16] mm: move per-vma lock into vm_area_struct
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb, Liam R. Howlett

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing.  Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].

Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code.  When that split-away part is a lock, it complicates
things even further.  With no performance benefits, there are no reasons
for this split.  Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well.  With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes.  Therefore these calculations are close to the
worst case scenario.  A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h               | 28 ++++++++++--------
 include/linux/mm_types.h         |  6 ++--
 kernel/fork.c                    | 49 ++++----------------------------
 tools/testing/vma/vma_internal.h | 33 +++++----------------
 4 files changed, 32 insertions(+), 84 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b040376ee81f..920e5ddd77cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -697,6 +697,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -714,7 +720,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -744,7 +750,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -756,13 +762,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -791,7 +797,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -799,7 +805,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -811,7 +817,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -844,6 +850,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -878,10 +885,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -890,6 +893,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 70dce20cbfd1..0ca63dee1902 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -738,8 +738,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -792,6 +790,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index ded49f18cd95..40a8e615499f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3188,11 +3151,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2404347fa2c7..96aeb28c81f9 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -274,10 +274,10 @@ struct vm_area_struct {
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_lock.lock (in write mode)
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_lock.lock (in read or write mode)
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -286,7 +286,7 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock *vm_lock;
+	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -463,17 +463,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
+	init_rwsem(&vma->vm_lock.lock);
 	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -496,6 +489,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
+	vma_lock_init(vma);
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -506,10 +500,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		free(vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -522,10 +512,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	if (!vma_lock_alloc(new)) {
-		free(new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 
 	return new;
@@ -695,14 +682,8 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	free(vma->vm_lock);
-}
-
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	vma_lock_free(vma);
 	free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 03/16] mm: mark vma as detached until it's added into vma tree
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb, Liam R. Howlett

Current implementation does not set detached flag when a VMA is first
allocated.  This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree.  Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.

Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h               | 27 ++++++++++++++++++++-------
 kernel/fork.c                    |  4 ++++
 mm/memory.c                      |  2 +-
 mm/vma.c                         |  6 +++---
 mm/vma.h                         |  2 ++
 tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
 6 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 920e5ddd77cc..a9d8dd5745f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,12 +821,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -857,8 +866,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
-				     bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
@@ -891,7 +900,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
@@ -1086,6 +1098,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 40a8e615499f..f2f9e7b427ad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
+#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
diff --git a/mm/memory.c b/mm/memory.c
index 1342d451b1bd..105b99064ce5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6391,7 +6391,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index af1d549b179c..d603494e69d7 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 
 	if (vp->remove) {
 again:
-		vma_mark_detached(vp->remove, true);
+		vma_mark_detached(vp->remove);
 		if (vp->file) {
 			uprobe_munmap(vp->remove, vp->remove->vm_start,
 				      vp->remove->vm_end);
@@ -1221,7 +1221,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
 
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
+		vma_mark_attached(vma);
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -1296,7 +1296,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
+		vma_mark_detached(next);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index a2e8710b8c47..2a2668de8d2c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -157,6 +157,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
@@ -389,6 +390,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+	vma_mark_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 96aeb28c81f9..47c8b03ffbbd 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -469,13 +469,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -488,7 +492,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -514,6 +519,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09 14:01   ` Vlastimil Babka
  2025-01-09  2:30 ` [PATCH v8 05/16] mm: mark vmas detached upon exit Suren Baghdasaryan
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

vma_iter_store() functions can be used both when adding a new vma and
when updating an existing one. However for existing ones we do not need
to mark them attached as they are already marked that way. Introduce
vma_iter_store_attached() to be used with already attached vmas.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 12 ++++++++++++
 mm/vma.c           |  8 ++++----
 mm/vma.h           | 11 +++++++++--
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a9d8dd5745f7..e0d403c1ff63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
+static inline void vma_assert_attached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(vma->detached, vma);
+}
+
+static inline void vma_assert_detached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(!vma->detached, vma);
+}
+
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
 	vma->detached = false;
@@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
+static inline void vma_assert_attached(struct vm_area_struct *vma) {}
+static inline void vma_assert_detached(struct vm_area_struct *vma) {}
 static inline void vma_mark_attached(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
diff --git a/mm/vma.c b/mm/vma.c
index d603494e69d7..b9cf552e120c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
 	vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
 
 	if (expanded)
-		vma_iter_store(vmg->vmi, vmg->vma);
+		vma_iter_store_attached(vmg->vmi, vmg->vma);
 
 	if (adj_start) {
 		adjust->vm_start += adj_start;
 		adjust->vm_pgoff += PHYS_PFN(adj_start);
 		if (adj_start < 0) {
 			WARN_ON(expanded);
-			vma_iter_store(vmg->vmi, adjust);
+			vma_iter_store_attached(vmg->vmi, adjust);
 		}
 	}
 
@@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store_attached(&vmi, vma);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
@@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store_attached(&vmi, vma);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
diff --git a/mm/vma.h b/mm/vma.h
index 2a2668de8d2c..63dd38d5230c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
 }
 
 /* Store a VMA with preallocated memory */
-static inline void vma_iter_store(struct vma_iterator *vmi,
-				  struct vm_area_struct *vma)
+static inline void vma_iter_store_attached(struct vma_iterator *vmi,
+					   struct vm_area_struct *vma)
 {
+	vma_assert_attached(vma);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
@@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+}
+
+static inline void vma_iter_store(struct vma_iterator *vmi,
+				  struct vm_area_struct *vma)
+{
 	vma_mark_attached(vma);
+	vma_iter_store_attached(vmi, vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 05/16] mm: mark vmas detached upon exit
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 06/16] types: move struct rcuwait into types.h Suren Baghdasaryan
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

When exit_mmap() removes vmas belonging to an exiting task, it does not
mark them as detached since they can't be reached by other tasks and they
will be freed shortly. Once we introduce vma reuse, all vmas will have to
be in detached state before they are freed to ensure vma when reused is
in a consistent state. Add missing vma_mark_detached() before freeing the
vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vma.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index b9cf552e120c..93ff42ac2002 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable)
+	if (unreachable) {
+		vma_mark_detached(vma);
 		__vm_area_free(vma);
-	else
+	} else {
 		vm_area_free(vma);
+	}
 }
 
 /*
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 06/16] types: move struct rcuwait into types.h
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 05/16] mm: mark vmas detached upon exit Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 07/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb, Liam R. Howlett

Move rcuwait struct definition into types.h so that rcuwait can be used
without including rcuwait.h which includes other headers. Without this
change mm_types.h can't use rcuwait due to a the following circular
dependency:

mm_types.h -> rcuwait.h -> signal.h -> mm_types.h

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/rcuwait.h | 13 +------------
 include/linux/types.h   | 12 ++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 27343424225c..9ad134a04b41 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -4,18 +4,7 @@
 
 #include <linux/rcupdate.h>
 #include <linux/sched/signal.h>
-
-/*
- * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner.
- *
- * The only time @task is non-nil is when a user is blocked (or
- * checking if it needs to) on a condition, and reset as soon as we
- * know that the condition has succeeded and are awoken.
- */
-struct rcuwait {
-	struct task_struct __rcu *task;
-};
+#include <linux/types.h>
 
 #define __RCUWAIT_INITIALIZER(name)		\
 	{ .task = NULL, }
diff --git a/include/linux/types.h b/include/linux/types.h
index 2d7b9ae8714c..f1356a9a5730 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
 typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
 typedef int (*cmp_func_t)(const void *a, const void *b);
 
+/*
+ * rcuwait provides a way of blocking and waking up a single
+ * task in an rcu-safe manner.
+ *
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
+ */
+struct rcuwait {
+	struct task_struct __rcu *task;
+};
+
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 07/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 06/16] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 08/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_locked_nested failing
due to refcount overflow. Prepare for such possibility by changing these
APIs and adjusting their users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Lokesh Gidra <lokeshgidra@google.com>
---
 include/linux/mm.h |  6 ++++--
 mm/userfaultfd.c   | 18 +++++++++++++-----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0d403c1ff63..6e6edfd4f3d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read_nested(&vma->vm_lock.lock, subclass);
+	return true;
 }
 
 /*
@@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked(struct vm_area_struct *vma)
+static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read(&vma->vm_lock.lock);
+	return true;
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a03c6f1ceb9e..eb2ca37b32ee 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma))
-		vma_start_read_locked(vma);
+		if (!vma_start_read_locked(vma))
+			vma = ERR_PTR(-EAGAIN);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1482,10 +1483,17 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		vma_start_read_locked(*dst_vmap);
-		if (*dst_vmap != *src_vmap)
-			vma_start_read_locked_nested(*src_vmap,
-						SINGLE_DEPTH_NESTING);
+		if (vma_start_read_locked(*dst_vmap)) {
+			if (*dst_vmap != *src_vmap) {
+				if (!vma_start_read_locked_nested(*src_vmap,
+							SINGLE_DEPTH_NESTING)) {
+					vma_end_read(*dst_vmap);
+					err = -EAGAIN;
+				}
+			}
+		} else {
+			err = -EAGAIN;
+		}
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 08/16] mm: move mmap_init_lock() out of the header file
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 07/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 09/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file. This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmap_lock.h | 6 ------
 kernel/fork.c             | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff6..4706c6769902 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
-static inline void mmap_init_lock(struct mm_struct *mm)
-{
-	init_rwsem(&mm->mmap_lock);
-	mm_lock_seqcount_init(mm);
-}
-
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
diff --git a/kernel/fork.c b/kernel/fork.c
index f2f9e7b427ad..d4c75428ccaf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static inline void mmap_init_lock(struct mm_struct *mm)
+{
+	init_rwsem(&mm->mmap_lock);
+	mm_lock_seqcount_init(mm);
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 09/16] mm: uninline the main body of vma_start_write()
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 08/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit the future code size growth.
No functional changes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 12 +++---------
 mm/memory.c        | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6e6edfd4f3d9..bc8067de41c5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -787,6 +787,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq);
+
 /*
  * Begin writing to a VMA.
  * Exclude concurrent readers under the per-VMA lock until the currently
@@ -799,15 +801,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock.lock);
-	/*
-	 * We should use WRITE_ONCE() here because we can have concurrent reads
-	 * from the early lockless pessimistic check in vma_start_read().
-	 * We don't really care about the correctness of that early check, but
-	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
-	 */
-	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	__vma_start_write(vma, mm_lock_seq);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index 105b99064ce5..26569a44fb5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6370,6 +6370,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+{
+	down_write(&vma->vm_lock.lock);
+	/*
+	 * We should use WRITE_ONCE() here because we can have concurrent reads
+	 * from the early lockless pessimistic check in vma_start_read().
+	 * We don't really care about the correctness of that early check, but
+	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+	 */
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+	up_write(&vma->vm_lock.lock);
+}
+EXPORT_SYMBOL_GPL(__vma_start_write);
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 09/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09 14:42   ` Vlastimil Babka
  2025-01-09  2:30 ` [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

Introduce functions to increase refcount but with a top limit above which
they will fail to increase (the limit is inclusive). Setting the limit to
INT_MAX indicates no limit.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/refcount.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..4934247848cf 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -137,13 +137,19 @@ static inline unsigned int refcount_read(const refcount_t *r)
 }
 
 static inline __must_check __signed_wrap
-bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
+				     int limit)
 {
 	int old = refcount_read(r);
 
 	do {
 		if (!old)
 			break;
+		if (i > limit - old) {
+			if (oldp)
+				*oldp = old;
+			return false;
+		}
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
@@ -155,6 +161,12 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	return old;
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_limited(i, r, oldp, INT_MAX);
+}
+
 /**
  * refcount_add_not_zero - add a value to a refcount unless it is 0
  * @i: the value to add to the refcount
@@ -213,6 +225,12 @@ static inline void refcount_add(int i, refcount_t *r)
 	__refcount_add(i, r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_limited(refcount_t *r,
+								int *oldp, int limit)
+{
+	return __refcount_add_not_zero_limited(1, r, oldp, limit);
+}
+
 static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
 {
 	return __refcount_add_not_zero(1, r, oldp);
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09 10:35   ` Hillf Danton
  2025-01-10 14:34   ` Vlastimil Babka
  2025-01-09  2:30 ` [PATCH v8 12/16] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore and the vma->detached flag with
a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached()
to be done only after vma has been write-locked. vma_mark_attached()
changes vm_refcnt to 1 to indicate that it has been attached to the vma
tree. When a reader takes read lock, it increments vm_refcnt, unless the
top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
a writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up.
refcount might overflow if there are many competing readers, in which case
read-locking will fail. Readers are expected to handle such failures.
In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (ignoring the writer bit), indicating an attached vma with no readers;
5. vm_refcnt overflow is handled by the readers.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 98 ++++++++++++++++++++++----------
 include/linux/mm_types.h         | 22 ++++---
 kernel/fork.c                    | 13 ++---
 mm/init-mm.c                     |  1 +
 mm/memory.c                      | 77 +++++++++++++++++++++----
 tools/testing/vma/linux/atomic.h |  5 ++
 tools/testing/vma/vma_internal.h | 66 +++++++++++----------
 7 files changed, 193 insertions(+), 89 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8067de41c5..ec7c064792ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/rcuwait.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -697,12 +698,41 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_lock_init(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 {
-	init_rwsem(&vma->vm_lock.lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+	if (reset_refcnt)
+		refcount_set(&vma->vm_refcnt, 0);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline bool is_vma_writer_only(int refcnt)
+{
+	/*
+	 * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
+	 * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
+	 * a detached vma happens only in vma_mark_detached() and is a rare
+	 * case, therefore most of the time there will be no unnecessary wakeup.
+	 */
+	return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;
+}
+
+static inline void vma_refcount_put(struct vm_area_struct *vma)
+{
+	int oldcnt;
+
+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+
+		if (is_vma_writer_only(oldcnt - 1))
+			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
+	}
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -710,6 +740,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	int oldcnt;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -720,13 +752,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+	/*
+	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
+	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
+	 */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
 		return false;
 
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	/*
-	 * Overflow might produce false locked result.
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 *
 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
@@ -735,9 +773,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+		vma_refcount_put(vma);
 		return false;
 	}
+
 	return true;
 }
 
@@ -749,8 +788,14 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
+	int oldcnt;
+
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
+		return false;
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	return true;
 }
 
@@ -762,15 +807,13 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
-	return true;
+	return vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
+	vma_refcount_put(vma);
 	rcu_read_unlock();
 }
 
@@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (refcount_read(&vma->vm_refcnt) <= 1)
 		vma_assert_write_locked(vma);
 }
 
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
 static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(vma->detached, vma);
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!vma->detached, vma);
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
-}
-
-static inline void vma_mark_detached(struct vm_area_struct *vma)
-{
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
 }
 
-static inline bool is_vma_detached(struct vm_area_struct *vma)
-{
-	return vma->detached;
-}
+void vma_mark_detached(struct vm_area_struct *vma);
 
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
@@ -865,7 +905,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_lock_init(struct vm_area_struct *vma) {}
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-#endif
 	vma_numab_state_init(vma);
-	vma_lock_init(vma);
+	vma_lock_init(vma, false);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0ca63dee1902..2d83d79d1899 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -637,9 +638,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
+#define VMA_LOCK_OFFSET	0x40000000
+#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 1)
 
 struct vma_numab_state {
 	/*
@@ -717,19 +717,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/*
-	 * Flag to indicate areas detached from the mm->mm_mt tree.
-	 * Unstable RCU readers are allowed to read this.
-	 */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -792,7 +786,10 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map vmlock_dep_map;
+#endif
 #endif
 } __randomize_layout;
 
@@ -927,6 +924,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct rcuwait vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/kernel/fork.c b/kernel/fork.c
index d4c75428ccaf..9d9275783cf8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	vma_lock_init(new);
+	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
-#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
@@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void __vm_area_free(struct vm_area_struct *vma)
 {
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
@@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
 
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	rcuwait_init(&mm->vma_writer_wait);
+#endif
 }
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..4600e7605cab 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
diff --git a/mm/memory.c b/mm/memory.c
index 26569a44fb5c..fe1b47c34052 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
+{
+	/*
+	 * If vma is detached then only vma_mark_attached() can raise the
+	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
+	 */
+	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
+		return false;
+
+	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
+		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
+		   TASK_UNINTERRUPTIBLE);
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
+
+	return true;
+}
+
+static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
+{
+	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+}
+
 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 {
-	down_write(&vma->vm_lock.lock);
+	bool locked;
+
+	/*
+	 * __vma_enter_locked() returns false immediately if the vma is not
+	 * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
+	 * indicating that vma is attached with no readers.
+	 */
+	locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
+
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6380,10 +6412,43 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+
+	if (locked) {
+		bool detached;
+
+		__vma_exit_locked(vma, &detached);
+		VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
+	}
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
+void vma_mark_detached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_attached(vma);
+
+	/*
+	 * We are the only writer, so no need to use vma_refcount_put().
+	 * The condition below is unlikely because the vma has been already
+	 * write-locked and readers can increment vm_refcnt only temporarily
+	 * before they check vm_lock_seq, realize the vma is locked and drop
+	 * back the vm_refcnt. That is a narrow window for observing a raised
+	 * vm_refcnt.
+	 */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Wait until refcnt is VMA_LOCK_OFFSET => detached with no
+		 * readers.
+		 */
+		if (__vma_enter_locked(vma, VMA_LOCK_OFFSET)) {
+			bool detached;
+
+			__vma_exit_locked(vma, &detached);
+			VM_BUG_ON_VMA(!detached, vma);
+		}
+	}
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
@@ -6396,7 +6461,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	rcu_read_lock();
-retry:
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -6404,13 +6468,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
-		vma_end_read(vma);
-		count_vm_vma_lock_event(VMA_LOCK_MISS);
-		/* The area was replaced with another one */
-		goto retry;
-	}
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
index 3e1b6adc027b..788c597c4fde 100644
--- a/tools/testing/vma/linux/atomic.h
+++ b/tools/testing/vma/linux/atomic.h
@@ -9,4 +9,9 @@
 #define atomic_set(x, y) uatomic_set(x, y)
 #define U8_MAX UCHAR_MAX
 
+#ifndef atomic_cmpxchg_relaxed
+#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
+#define  atomic_cmpxchg_release         uatomic_cmpxchg
+#endif /* atomic_cmpxchg_relaxed */
+
 #endif	/* _LINUX_ATOMIC_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 47c8b03ffbbd..2ce032943861 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -25,7 +25,7 @@
 #include <linux/maple_tree.h>
 #include <linux/mm.h>
 #include <linux/rbtree.h>
-#include <linux/rwsem.h>
+#include <linux/refcount.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -134,10 +134,6 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 struct kref {
 	refcount_t refcount;
 };
@@ -232,15 +228,12 @@ struct mm_struct {
 	unsigned long flags; /* Must use atomic bitops to access */
 };
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
-
-
 struct file {
 	struct address_space	*f_mapping;
 };
 
+#define VMA_LOCK_OFFSET	0x40000000
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -268,16 +261,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/* Flag to indicate areas detached from the mm->mm_mt tree */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock.lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock.lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -286,7 +276,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -339,6 +328,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	refcount_t vm_refcnt;
+#endif
 } __randomize_layout;
 
 struct vm_fault {};
@@ -463,23 +456,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline void vma_lock_init(struct vm_area_struct *vma)
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
+static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
-	vma->vm_lock_seq = UINT_MAX;
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
+}
+
 static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_attached(vma);
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -492,9 +503,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-	vma_lock_init(vma);
+	vma->vm_lock_seq = UINT_MAX;
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -517,10 +526,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	vma_lock_init(new);
+	refcount_set(&new->vm_refcnt, 0);
+	new->vm_lock_seq = UINT_MAX;
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 12/16] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 13/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

vm_refcnt encodes a number of useful states:
- whether vma is attached or detached
- the number of current vma readers
- presence of a vma writer
Let's include it in the vma dump.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/debug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/debug.c b/mm/debug.c
index 8d2acf432385..325d7bf22038 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
 
 void dump_vma(const struct vm_area_struct *vma)
 {
+#ifdef CONFIG_PER_VMA_LOCK
+	pr_emerg("vma %px start %px end %px mm %px\n"
+		"prot %lx anon_vma %px vm_ops %px\n"
+		"pgoff %lx file %px private_data %px\n"
+		"flags: %#lx(%pGv) refcnt %x\n",
+		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
+		(unsigned long)pgprot_val(vma->vm_page_prot),
+		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
+		vma->vm_file, vma->vm_private_data,
+		vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
+#else
 	pr_emerg("vma %px start %px end %px mm %px\n"
 		"prot %lx anon_vma %px vm_ops %px\n"
 		"pgoff %lx file %px private_data %px\n"
@@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
 		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
 		vma->vm_file, vma->vm_private_data,
 		vma->vm_flags, &vma->vm_flags);
+#endif
 }
 EXPORT_SYMBOL(dump_vma);
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 13/16] mm: remove extra vma_numab_state_init() call
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 12/16] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

vma_init() already memset's the whole vm_area_struct to 0, so there is
no need to an additional vma_numab_state_init().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec7c064792ff..aca65cc0a26e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -948,7 +948,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_numab_state_init(vma);
 	vma_lock_init(vma, false);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 13/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:30 ` [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu() locks
the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using
the original mm and after locking the vma it should ensure that vma->vm_mm
has not changed from under us.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 10 ++++++----
 mm/memory.c        |  7 ++++---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index aca65cc0a26e..1d6b1563b956 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -737,8 +737,10 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
  * using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
  */
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	int oldcnt;
 
@@ -749,7 +751,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
 	/*
@@ -772,7 +774,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		vma_refcount_put(vma);
 		return false;
 	}
@@ -906,7 +908,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index fe1b47c34052..a8e7e794178e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6465,7 +6465,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	if (!vma_start_read(vma))
+	if (!vma_start_read(mm, vma))
 		goto inval;
 
 	/*
@@ -6475,8 +6475,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * fields are accessible for RCU readers.
 	 */
 
-	/* Check since vm_start/vm_end might change before we lock the VMA */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != mm ||
+		     address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
 
 	rcu_read_unlock();
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-10 15:32   ` Vlastimil Babka
  2025-01-10 17:47   ` Liam R. Howlett
  2025-01-09  2:30 ` [PATCH v8 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
                   ` (3 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().
Current checks are sufficient as long as vma is detached before it is
freed. The only place this is not currently happening is in exit_mmap().
Add the missing vma_mark_detached() in exit_mmap().
Another issue which might trick lock_vma_under_rcu() during vma reuse
is vm_area_dup(), which copies the entire content of the vma into a new
one, overriding new vma's vm_refcnt and temporarily making it appear as
attached. This might trick a racing lock_vma_under_rcu() to operate on
a reused vma if it found the vma before it got reused. To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree. Introduce vma_copy() which preserves new vma's
vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
state with no current readers when they are freed, lock_vma_under_rcu()
will not be able to take vm_refcnt after vma got detached even if vma
is reused.
Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               |  2 -
 include/linux/mm_types.h         | 10 +++--
 include/linux/slab.h             |  6 ---
 kernel/fork.c                    | 72 ++++++++++++++++++++------------
 mm/mmap.c                        |  3 +-
 mm/vma.c                         | 11 ++---
 mm/vma.h                         |  2 +-
 tools/testing/vma/vma_internal.h |  7 +---
 8 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1d6b1563b956..a674558e4c05 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
-/* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2d83d79d1899..93bfcd0c1fde 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
 
 typedef unsigned long vm_flags_t;
 
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
  * conditions.  These are held in a global tree and are pinned by the VMAs that
@@ -695,9 +701,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..681b685b6c4e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -234,12 +234,6 @@ enum _slab_flag_bits {
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
 #endif
 
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d9275783cf8..770b973a099c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 	return vma;
 }
 
+static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently, but the clone
+	 * will be reinitialized.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+	dest->vm_region = src->vm_region;
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
 	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
@@ -458,11 +493,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	/*
-	 * orig->shared.rb may be modified concurrently, but the clone
-	 * will be reinitialized.
-	 */
-	data_race(memcpy(new, orig, sizeof(*new)));
+	vma_copy(orig, new);
 	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
@@ -471,7 +502,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
-void __vm_area_free(struct vm_area_struct *vma)
+void vm_area_free(struct vm_area_struct *vma)
 {
 	/* The vma should be detached while being destroyed. */
 	vma_assert_detached(vma);
@@ -480,25 +511,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
-{
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
-	__vm_area_free(vma);
-}
-#endif
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
-	__vm_area_free(vma);
-#endif
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3144,6 +3156,11 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3160,8 +3177,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
 			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
diff --git a/mm/mmap.c b/mm/mmap.c
index cda01071c7b1..7aa36216ecc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1305,7 +1305,8 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true);
+		vma_mark_detached(vma);
+		remove_vma(vma);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 93ff42ac2002..0a5158d611e3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -406,19 +406,14 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma)
 {
 	might_sleep();
 	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable) {
-		vma_mark_detached(vma);
-		__vm_area_free(vma);
-	} else {
-		vm_area_free(vma);
-	}
+	vm_area_free(vma);
 }
 
 /*
@@ -1201,7 +1196,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* unreachable = */ false);
+		remove_vma(vma);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
diff --git a/mm/vma.h b/mm/vma.h
index 63dd38d5230c..f51005b95b39 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -170,7 +170,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);
 
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2ce032943861..49a85ce0d45a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -697,14 +697,9 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void __vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void vm_area_free(struct vm_area_struct *vma)
 {
-	__vm_area_free(vma);
+	free(vma);
 }
 
 static inline void lru_add_drain(void)
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v8 16/16] docs/mm: document latest changes to vm_lock
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2025-01-09  2:30 ` Suren Baghdasaryan
  2025-01-09  2:32 ` [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:30 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, surenb, Liam R. Howlett

Change the documentation to reflect that vm_lock is integrated into vma
and replaced with vm_refcnt.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 81417fa2ed20..f573de936b5d 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
-their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
-via :c:func:`!vma_end_read`.
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
+fail due to lock contention but the caller should still check their return values
+in case they fail for other reasons.
+
+VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
+duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
+:c:func:`!vma_end_read`.
 
 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
@@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
 lock, releasing or downgrading the mmap write lock also releases the VMA write
 lock so there is no :c:func:`!vma_end_write` function.
 
-Note that a semaphore write lock is not held across a VMA lock. Rather, a
-sequence number is used for serialisation, and the write semaphore is only
-acquired at the point of write lock to update this.
+Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
+modified so that readers can detect the presense of a writer. The reference counter is
+restored once the vma sequence number used for serialisation is updated.
 
 This ensures the semantics we require - VMA write locks provide exclusive write
 access to the VMA.
@@ -738,7 +743,7 @@ Implementation details
 
 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
 of the heavily contended mmap lock. It is implemented using a combination of a
-read/write semaphore and sequence numbers belonging to the containing
+reference counter and sequence numbers belonging to the containing
 :c:struct:`!struct mm_struct` and the VMA.
 
 Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
@@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
 keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
-Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
-the sequence count of the VMA does not match that of the mm.
+Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
+reference counter and check that the sequence count of the VMA does not match
+that of the mm.
 
-If it does, the read lock fails. If it does not, we hold the lock, excluding
-writers, but permitting other readers, who will also obtain this lock under RCU.
+If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
+If it does not, we keep the reference counter raised, excluding writers, but
+permitting other readers, who can also obtain this lock under RCU.
 
 Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
-read/write semaphore, before setting the VMA's sequence number under this lock,
-also simultaneously holding the mmap write lock.
+On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
+modified by readers and wait for all readers to drop their reference count.
+Once there are no readers, VMA's sequence number is set to match that of the
+mm. During this entire operation mmap write lock is held.
 
 This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
 until these are finished and mutual exclusion is achieved.
 
-After setting the VMA's sequence number, the lock is released, avoiding
-complexity with a long-term held write lock.
+After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
+indicating a writer is cleared. From this point on, VMA's sequence number will
+indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
 
-This clever combination of a read/write semaphore and sequence count allows for
+This clever combination of a reference counter and sequence count allows for
 fast RCU-based per-VMA lock acquisition (especially on page fault, though
 utilised elsewhere) with minimal complexity around lock ordering.
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2025-01-09  2:30 ` [PATCH v8 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2025-01-09  2:32 ` Suren Baghdasaryan
  2025-01-09 11:51 ` Peter Zijlstra
  2025-01-09 13:41 ` Vlastimil Babka
  18 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09  2:32 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Jan 8, 2025 at 6:30 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Changes since v7 [4]:
> - Removed additional parameter for vma_iter_store() and introduced
> vma_iter_store_attached() instead, per Vlastimil Babka and
> Liam R. Howlett
> - Fixed coding style nits, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> - Added Acked-by, per Davidlohr Bueso
> - Removed unnecessary patch changeing nommu.c
> - Folded a fixup patch [5] into the patch it was fixing
> - Changed calculation in __refcount_add_not_zero_limited() to avoid
> overflow, to change the limit to be inclusive and to use INT_MAX to
> indicate no limits, per Vlastimil Babka and Matthew Wilcox
> - Folded a fixup patch [6] into the patch it was fixing
> - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> to not reserve one count for a writer, per Liam R. Howlett
> - Changed vma_refcount_put() to wake up writers only when the last reader
> is leaving, per Liam R. Howlett
> - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> the way down_read_trylock() does lockdep, per Vlastimil Babka
> - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> per Vlastimil Babka
>
> What I did not include in this patchset:
> - Liam's suggestion to change dump_vma() output since it's unclear to me
> how it should look like. The patch is for debug only and not critical for
> the rest of the series, we can change the output later or even drop it if
> necessary.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
>
> Patchset applies over mm-unstable after reverting v7
> (current SHA range: 588f0086398e - fb2270654630)

^^^ Please note that to apply this patchset over mm-unstable you
should revert the previous version. Thanks!

>
> Suren Baghdasaryan (16):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
>
>  Documentation/mm/process_addrs.rst |  44 +++++----
>  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  36 ++++---
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  20 +++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 128 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  94 +++++++++++++++---
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
>  18 files changed, 435 insertions(+), 260 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-09  2:30 ` [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2025-01-09 10:35   ` Hillf Danton
  2025-01-09 16:01     ` Suren Baghdasaryan
  2025-01-10 14:34   ` Vlastimil Babka
  1 sibling, 1 reply; 53+ messages in thread
From: Hillf Danton @ 2025-01-09 10:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: peterz, lorenzo.stoakes, hannes, hughd, linux-mm, linux-kernel

On Wed,  8 Jan 2025 18:30:20 -0800 Suren Baghdasaryan <surenb@google.com>
> +
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> +	int oldcnt;
> +
> +	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);

In up_read() rwsem is released reguardless wakeup, which is different
than what is added here. Nit.

> +
> +		if (is_vma_writer_only(oldcnt - 1))
> +			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> +	}
> +}
...
> @@ -735,9 +773,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock.lock);
> +		vma_refcount_put(vma);
>  		return false;
>  	}

void up_read(struct rw_semaphore *sem)
{
	rwsem_release(&sem->dep_map, _RET_IP_);
	__up_read(sem);
}


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (16 preceding siblings ...)
  2025-01-09  2:32 ` [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2025-01-09 11:51 ` Peter Zijlstra
  2025-01-09 15:48   ` Suren Baghdasaryan
  2025-01-09 13:41 ` Vlastimil Babka
  18 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2025-01-09 11:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.

Does not clean up that reattach nonsense :-(


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (17 preceding siblings ...)
  2025-01-09 11:51 ` Peter Zijlstra
@ 2025-01-09 13:41 ` Vlastimil Babka
  2025-01-09 15:57   ` Suren Baghdasaryan
  2025-01-09 15:59   ` Suren Baghdasaryan
  18 siblings, 2 replies; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-09 13:41 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

Btw the subject became rather incomplete given all the series does :)

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
> 
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Given how patch 2 discusses memory growth due to moving the lock, should
also patch 11 discuss how the replacement with refcount reduces the
memory footprint? And/or the cover letter could summarize the impact of
the whole series in that aspect? Perhaps the refcount doesn't reduce
anything as it's smaller but sits alone in the cacheline? Could it be
grouped with some non-hot fields instead as a followup, so could we get
to <=192 (non-debug) size without impacting performance?

> Changes since v7 [4]:
> - Removed additional parameter for vma_iter_store() and introduced
> vma_iter_store_attached() instead, per Vlastimil Babka and
> Liam R. Howlett
> - Fixed coding style nits, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> - Added Acked-by, per Davidlohr Bueso
> - Removed unnecessary patch changeing nommu.c
> - Folded a fixup patch [5] into the patch it was fixing
> - Changed calculation in __refcount_add_not_zero_limited() to avoid
> overflow, to change the limit to be inclusive and to use INT_MAX to
> indicate no limits, per Vlastimil Babka and Matthew Wilcox
> - Folded a fixup patch [6] into the patch it was fixing
> - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> to not reserve one count for a writer, per Liam R. Howlett
> - Changed vma_refcount_put() to wake up writers only when the last reader
> is leaving, per Liam R. Howlett
> - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> the way down_read_trylock() does lockdep, per Vlastimil Babka
> - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> per Vlastimil Babka
> 
> What I did not include in this patchset:
> - Liam's suggestion to change dump_vma() output since it's unclear to me
> how it should look like. The patch is for debug only and not critical for
> the rest of the series, we can change the output later or even drop it if
> necessary.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> 
> Patchset applies over mm-unstable after reverting v7
> (current SHA range: 588f0086398e - fb2270654630)
> 
> Suren Baghdasaryan (16):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
> 
>  Documentation/mm/process_addrs.rst |  44 +++++----
>  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  36 ++++---
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  20 +++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 128 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  94 +++++++++++++++---
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
>  18 files changed, 435 insertions(+), 260 deletions(-)
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-09  2:30 ` [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
@ 2025-01-09 14:01   ` Vlastimil Babka
  0 siblings, 0 replies; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-09 14:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> vma_iter_store() functions can be used both when adding a new vma and
> when updating an existing one. However for existing ones we do not need
> to mark them attached as they are already marked that way. Introduce
> vma_iter_store_attached() to be used with already attached vmas.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-09  2:30 ` [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2025-01-09 14:42   ` Vlastimil Babka
  2025-01-11  1:33     ` Suren Baghdasaryan
  0 siblings, 1 reply; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-09 14:42 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> Introduce functions to increase refcount but with a top limit above which
> they will fail to increase (the limit is inclusive). Setting the limit to
> INT_MAX indicates no limit.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 11:51 ` Peter Zijlstra
@ 2025-01-09 15:48   ` Suren Baghdasaryan
  2025-01-10 17:01     ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
>
> Does not clean up that reattach nonsense :-(

Oh, no. I think it does. That's why in [1] I introduce
vma_iter_store_attached() to be used on already attached vmas and to
avoid marking them attached again. Also I added assertions in
vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
re-detaching. Unless I misunderstood your comment?

[1] https://lore.kernel.org/all/20250109023025.2242447-5-surenb@google.com/


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 13:41 ` Vlastimil Babka
@ 2025-01-09 15:57   ` Suren Baghdasaryan
  2025-01-10  0:14     ` Suren Baghdasaryan
  2025-01-09 15:59   ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09 15:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Btw the subject became rather incomplete given all the series does :)
>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Given how patch 2 discusses memory growth due to moving the lock, should
> also patch 11 discuss how the replacement with refcount reduces the
> memory footprint? And/or the cover letter could summarize the impact of
> the whole series in that aspect?

That's a good idea. I can amend the cover letter and the description
of patch 11 to include size information.

> Perhaps the refcount doesn't reduce
> anything as it's smaller but sits alone in the cacheline? Could it be
> grouped with some non-hot fields instead as a followup, so could we get
> to <=192 (non-debug) size without impacting performance?

Yes, absolutely. Before this series, vm_area_struct was roughly 168
bytes and vm_lock was 40 bytes. After the changes vm_area_struct
becomes 256 bytes. I was planning to pack the fields as a follow-up
patch similar to an earlier one [1] and bring the size of
vm_area_struct to < 192. I felt this patchset already does many things
and did not include it here but I can add it at the end of this
patchset if you think it's essential.

[1] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

>
> > Changes since v7 [4]:
> > - Removed additional parameter for vma_iter_store() and introduced
> > vma_iter_store_attached() instead, per Vlastimil Babka and
> > Liam R. Howlett
> > - Fixed coding style nits, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > - Added Acked-by, per Davidlohr Bueso
> > - Removed unnecessary patch changeing nommu.c
> > - Folded a fixup patch [5] into the patch it was fixing
> > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > overflow, to change the limit to be inclusive and to use INT_MAX to
> > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > - Folded a fixup patch [6] into the patch it was fixing
> > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > to not reserve one count for a writer, per Liam R. Howlett
> > - Changed vma_refcount_put() to wake up writers only when the last reader
> > is leaving, per Liam R. Howlett
> > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > per Vlastimil Babka
> >
> > What I did not include in this patchset:
> > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > how it should look like. The patch is for debug only and not critical for
> > the rest of the series, we can change the output later or even drop it if
> > necessary.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v7
> > (current SHA range: 588f0086398e - fb2270654630)
> >
> > Suren Baghdasaryan (16):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 +++++----
> >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  36 ++++---
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  20 +++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 128 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  94 +++++++++++++++---
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> >  18 files changed, 435 insertions(+), 260 deletions(-)
> >
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 13:41 ` Vlastimil Babka
  2025-01-09 15:57   ` Suren Baghdasaryan
@ 2025-01-09 15:59   ` Suren Baghdasaryan
  2025-01-10  0:16     ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09 15:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Btw the subject became rather incomplete given all the series does :)

Missed this one. What do you think is worth mentioning here? It's
true, the patchset does many small things but I wanted to outline the
main conceptual changes. Please LMK if you think there are more
changes big enough to be mentioned here.

>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Given how patch 2 discusses memory growth due to moving the lock, should
> also patch 11 discuss how the replacement with refcount reduces the
> memory footprint? And/or the cover letter could summarize the impact of
> the whole series in that aspect? Perhaps the refcount doesn't reduce
> anything as it's smaller but sits alone in the cacheline? Could it be
> grouped with some non-hot fields instead as a followup, so could we get
> to <=192 (non-debug) size without impacting performance?
>
> > Changes since v7 [4]:
> > - Removed additional parameter for vma_iter_store() and introduced
> > vma_iter_store_attached() instead, per Vlastimil Babka and
> > Liam R. Howlett
> > - Fixed coding style nits, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > - Added Acked-by, per Davidlohr Bueso
> > - Removed unnecessary patch changeing nommu.c
> > - Folded a fixup patch [5] into the patch it was fixing
> > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > overflow, to change the limit to be inclusive and to use INT_MAX to
> > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > - Folded a fixup patch [6] into the patch it was fixing
> > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > to not reserve one count for a writer, per Liam R. Howlett
> > - Changed vma_refcount_put() to wake up writers only when the last reader
> > is leaving, per Liam R. Howlett
> > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > per Vlastimil Babka
> >
> > What I did not include in this patchset:
> > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > how it should look like. The patch is for debug only and not critical for
> > the rest of the series, we can change the output later or even drop it if
> > necessary.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v7
> > (current SHA range: 588f0086398e - fb2270654630)
> >
> > Suren Baghdasaryan (16):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 +++++----
> >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  36 ++++---
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  20 +++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 128 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  94 +++++++++++++++---
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> >  18 files changed, 435 insertions(+), 260 deletions(-)
> >
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-09 10:35   ` Hillf Danton
@ 2025-01-09 16:01     ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-09 16:01 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, lorenzo.stoakes, hannes, hughd, linux-mm, linux-kernel

On Thu, Jan 9, 2025 at 2:36 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Wed,  8 Jan 2025 18:30:20 -0800 Suren Baghdasaryan <surenb@google.com>
> > +
> > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > +{
> > +     int oldcnt;
> > +
> > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>
> In up_read() rwsem is released reguardless wakeup, which is different
> than what is added here. Nit.

Good point. I'll send a fixup since it's a small change. Thanks!

>
> > +
> > +             if (is_vma_writer_only(oldcnt - 1))
> > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > +     }
> > +}
> ...
> > @@ -735,9 +773,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock.lock);
> > +             vma_refcount_put(vma);
> >               return false;
> >       }
>
> void up_read(struct rw_semaphore *sem)
> {
>         rwsem_release(&sem->dep_map, _RET_IP_);
>         __up_read(sem);
> }
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 15:57   ` Suren Baghdasaryan
@ 2025-01-10  0:14     ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10  0:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 7:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Btw the subject became rather incomplete given all the series does :)
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Given how patch 2 discusses memory growth due to moving the lock, should
> > also patch 11 discuss how the replacement with refcount reduces the
> > memory footprint? And/or the cover letter could summarize the impact of
> > the whole series in that aspect?
>
> That's a good idea. I can amend the cover letter and the description
> of patch 11 to include size information.
>
> > Perhaps the refcount doesn't reduce
> > anything as it's smaller but sits alone in the cacheline? Could it be
> > grouped with some non-hot fields instead as a followup, so could we get
> > to <=192 (non-debug) size without impacting performance?
>
> Yes, absolutely. Before this series, vm_area_struct was roughly 168
> bytes and vm_lock was 40 bytes. After the changes vm_area_struct
> becomes 256 bytes. I was planning to pack the fields as a follow-up
> patch similar to an earlier one [1] and bring the size of
> vm_area_struct to < 192. I felt this patchset already does many things
> and did not include it here but I can add it at the end of this
> patchset if you think it's essential.
>
> [1] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

Actually I tried to rewrite the above patch [1] over the latest
patchset and it's pretty much the same. I think I should include it at
the end of this patchset as it's pretty simple. Planning to post v9
tomorrow morning, so if you don't want it in this patchset, please let
me know.
Thanks!

>
> >
> > > Changes since v7 [4]:
> > > - Removed additional parameter for vma_iter_store() and introduced
> > > vma_iter_store_attached() instead, per Vlastimil Babka and
> > > Liam R. Howlett
> > > - Fixed coding style nits, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > > - Added Acked-by, per Davidlohr Bueso
> > > - Removed unnecessary patch changeing nommu.c
> > > - Folded a fixup patch [5] into the patch it was fixing
> > > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > > overflow, to change the limit to be inclusive and to use INT_MAX to
> > > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > > - Folded a fixup patch [6] into the patch it was fixing
> > > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > > to not reserve one count for a writer, per Liam R. Howlett
> > > - Changed vma_refcount_put() to wake up writers only when the last reader
> > > is leaving, per Liam R. Howlett
> > > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > > per Vlastimil Babka
> > >
> > > What I did not include in this patchset:
> > > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > > how it should look like. The patch is for debug only and not critical for
> > > the rest of the series, we can change the output later or even drop it if
> > > necessary.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v7
> > > (current SHA range: 588f0086398e - fb2270654630)
> > >
> > > Suren Baghdasaryan (16):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 +++++----
> > >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  36 ++++---
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  20 +++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 128 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  94 +++++++++++++++---
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> > >  18 files changed, 435 insertions(+), 260 deletions(-)
> > >
> >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 15:59   ` Suren Baghdasaryan
@ 2025-01-10  0:16     ` Suren Baghdasaryan
  2025-01-10 15:36       ` Vlastimil Babka
  0 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10  0:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Btw the subject became rather incomplete given all the series does :)
>
> Missed this one. What do you think is worth mentioning here? It's
> true, the patchset does many small things but I wanted to outline the
> main conceptual changes. Please LMK if you think there are more
> changes big enough to be mentioned here.

I just realized that your comment was only about the subject of this
cover letter. Maybe something like this:

per-vma lock and vm_area_struct cache optimizations

Would that be better?

>
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Given how patch 2 discusses memory growth due to moving the lock, should
> > also patch 11 discuss how the replacement with refcount reduces the
> > memory footprint? And/or the cover letter could summarize the impact of
> > the whole series in that aspect? Perhaps the refcount doesn't reduce
> > anything as it's smaller but sits alone in the cacheline? Could it be
> > grouped with some non-hot fields instead as a followup, so could we get
> > to <=192 (non-debug) size without impacting performance?
> >
> > > Changes since v7 [4]:
> > > - Removed additional parameter for vma_iter_store() and introduced
> > > vma_iter_store_attached() instead, per Vlastimil Babka and
> > > Liam R. Howlett
> > > - Fixed coding style nits, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > > - Added Acked-by, per Davidlohr Bueso
> > > - Removed unnecessary patch changeing nommu.c
> > > - Folded a fixup patch [5] into the patch it was fixing
> > > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > > overflow, to change the limit to be inclusive and to use INT_MAX to
> > > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > > - Folded a fixup patch [6] into the patch it was fixing
> > > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > > to not reserve one count for a writer, per Liam R. Howlett
> > > - Changed vma_refcount_put() to wake up writers only when the last reader
> > > is leaving, per Liam R. Howlett
> > > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > > per Vlastimil Babka
> > >
> > > What I did not include in this patchset:
> > > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > > how it should look like. The patch is for debug only and not critical for
> > > the rest of the series, we can change the output later or even drop it if
> > > necessary.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v7
> > > (current SHA range: 588f0086398e - fb2270654630)
> > >
> > > Suren Baghdasaryan (16):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 +++++----
> > >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  36 ++++---
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  20 +++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 128 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  94 +++++++++++++++---
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> > >  18 files changed, 435 insertions(+), 260 deletions(-)
> > >
> >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-09  2:30 ` [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2025-01-09 10:35   ` Hillf Danton
@ 2025-01-10 14:34   ` Vlastimil Babka
  2025-01-10 15:56     ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-10 14:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore and the vma->detached flag with
> a refcount (vm_refcnt).
> When vma is in detached state, vm_refcnt is 0 and only a call to
> vma_mark_attached() can take it out of this state. Note that unlike
> before, now we enforce both vma_mark_attached() and vma_mark_detached()
> to be done only after vma has been write-locked. vma_mark_attached()
> changes vm_refcnt to 1 to indicate that it has been attached to the vma
> tree. When a reader takes read lock, it increments vm_refcnt, unless the
> top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> a writer. When writer takes write lock, it sets the top usable bit to
> indicate its presence. If there are readers, writer will wait using newly
> introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> mode first, there can be only one writer at a time. The last reader to
> release the lock will signal the writer to wake up.
> refcount might overflow if there are many competing readers, in which case
> read-locking will fail. Readers are expected to handle such failures.
> In summary:
> 1. all readers increment the vm_refcnt;
> 2. writer sets top usable (writer) bit of vm_refcnt;
> 3. readers cannot increment the vm_refcnt if the writer bit is set;
> 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> 5. vm_refcnt overflow is handled by the readers.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

But think there's a problem that will manifest after patch 15.
Also I don't feel qualified enough about the lockdep parts though
(although I think I spotted another issue with those, below) so best if
PeterZ can review those.
Some nits below too.

> +
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> +	int oldcnt;
> +
> +	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);

Shouldn't we rwsem_release always? And also shouldn't it precede the
refcount operation itself?

> +		if (is_vma_writer_only(oldcnt - 1))
> +			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

Hmm hmm we should maybe read the vm_mm pointer before dropping the
refcount? In case this races in a way that is_vma_writer_only tests true
but the writer meanwhile finishes and frees the vma. It's safe now but
not after making the cache SLAB_TYPESAFE_BY_RCU ?

> +	}
> +}
> +

>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */

This should refer to vma_refcount_put(). But after fixing it I think we
could stop doing this altogether? It will no longer keep vma "alive"
with SLAB_TYPESAFE_BY_RCU.

> -	up_read(&vma->vm_lock.lock);
> +	vma_refcount_put(vma);
>  	rcu_read_unlock();
>  }
>  

<snip>

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>  #endif
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> +{
> +	/*
> +	 * If vma is detached then only vma_mark_attached() can raise the
> +	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> +	 */
> +	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> +		return false;
> +
> +	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> +	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> +		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> +		   TASK_UNINTERRUPTIBLE);
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> +
> +	return true;
> +}
> +
> +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> +{
> +	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> +	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +}
> +
>  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>  {
> -	down_write(&vma->vm_lock.lock);
> +	bool locked;
> +
> +	/*
> +	 * __vma_enter_locked() returns false immediately if the vma is not
> +	 * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> +	 * indicating that vma is attached with no readers.
> +	 */
> +	locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);

Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
__vma_enter_locked() itself as it's the one adding it in the first place.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-09  2:30 ` [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2025-01-10 15:32   ` Vlastimil Babka
  2025-01-10 16:07     ` Suren Baghdasaryan
  2025-01-10 17:47   ` Liam R. Howlett
  1 sibling, 1 reply; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-10 15:32 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected by
> lock_vma_under_rcu().
> Current checks are sufficient as long as vma is detached before it is
> freed. The only place this is not currently happening is in exit_mmap().
> Add the missing vma_mark_detached() in exit_mmap().
> Another issue which might trick lock_vma_under_rcu() during vma reuse
> is vm_area_dup(), which copies the entire content of the vma into a new
> one, overriding new vma's vm_refcnt and temporarily making it appear as
> attached. This might trick a racing lock_vma_under_rcu() to operate on
> a reused vma if it found the vma before it got reused. To prevent this
> situation, we should ensure that vm_refcnt stays at detached state (0)
> when it is copied and advances to attached state only after it is added
> into the vma tree. Introduce vma_copy() which preserves new vma's
> vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> state with no current readers when they are freed, lock_vma_under_rcu()
> will not be able to take vm_refcnt after vma got detached even if vma
> is reused.
> Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> vm_area_struct reuse and will minimize the number of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

You could also drop the reset_refcnt parameter of vma_lock_init() now,
as the usage in vm_area_dup() should now be just setting 0 over 0. Maybe
a VM_WARN_ON if it's not 0 already?
And a comment in vm_area_struct definition to consider vma_copy() when
adding any new field?

> +	/*
> +	 * src->shared.rb may be modified concurrently, but the clone
> +	 * will be reinitialized.
> +	 */
> +	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));

The comment makes it sound as if we didn't need to do it at all? But I
didn't verify. If we do need it in some cases (i.e. the just allocated
vma might have garbage from previous lifetime, but src is well defined
and it's a case where it's not reinitialized afterwards) maybe the
comment should say? Or if it's either reinitialized later or zeroes at
src, we could memset() the zeroes instead of memcpying them, etc.




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-10  0:16     ` Suren Baghdasaryan
@ 2025-01-10 15:36       ` Vlastimil Babka
  2025-01-10 16:08         ` Suren Baghdasaryan
  0 siblings, 1 reply; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-10 15:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team



On 1/10/25 1:16 AM, Suren Baghdasaryan wrote:
> On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> Btw the subject became rather incomplete given all the series does :)
>>
>> Missed this one. What do you think is worth mentioning here? It's
>> true, the patchset does many small things but I wanted to outline the
>> main conceptual changes. Please LMK if you think there are more
>> changes big enough to be mentioned here.
> 
> I just realized that your comment was only about the subject of this
> cover letter. Maybe something like this:
> 
> per-vma lock and vm_area_struct cache optimizations

arguably the biggest change here is:

reimplement per-vma lock as a refcount

but yours is ok to, don't want to bikeshed



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-10 14:34   ` Vlastimil Babka
@ 2025-01-10 15:56     ` Suren Baghdasaryan
  2025-01-10 16:47       ` Suren Baghdasaryan
  2025-01-10 22:26       ` Vlastimil Babka
  0 siblings, 2 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 15:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 6:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore and the vma->detached flag with
> > a refcount (vm_refcnt).
> > When vma is in detached state, vm_refcnt is 0 and only a call to
> > vma_mark_attached() can take it out of this state. Note that unlike
> > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > to be done only after vma has been write-locked. vma_mark_attached()
> > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > a writer. When writer takes write lock, it sets the top usable bit to
> > indicate its presence. If there are readers, writer will wait using newly
> > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> > mode first, there can be only one writer at a time. The last reader to
> > release the lock will signal the writer to wake up.
> > refcount might overflow if there are many competing readers, in which case
> > read-locking will fail. Readers are expected to handle such failures.
> > In summary:
> > 1. all readers increment the vm_refcnt;
> > 2. writer sets top usable (writer) bit of vm_refcnt;
> > 3. readers cannot increment the vm_refcnt if the writer bit is set;
> > 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> > to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> > 5. vm_refcnt overflow is handled by the readers.
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> But think there's a problem that will manifest after patch 15.
> Also I don't feel qualified enough about the lockdep parts though
> (although I think I spotted another issue with those, below) so best if
> PeterZ can review those.
> Some nits below too.
>
> > +
> > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > +{
> > +     int oldcnt;
> > +
> > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>
> Shouldn't we rwsem_release always? And also shouldn't it precede the
> refcount operation itself?

Yes. Hillf pointed to the same issue. It will be fixed in the next version.

>
> > +             if (is_vma_writer_only(oldcnt - 1))
> > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
>
> Hmm hmm we should maybe read the vm_mm pointer before dropping the
> refcount? In case this races in a way that is_vma_writer_only tests true
> but the writer meanwhile finishes and frees the vma. It's safe now but
> not after making the cache SLAB_TYPESAFE_BY_RCU ?

Hmm. But if is_vma_writer_only() is true that means the writed is
blocked and is waiting for the reader to drop the vm_refcnt. IOW, it
won't proceed and free the vma until the reader calls
rcuwait_wake_up(). Your suggested change is trivial and I can do it
but I want to make sure I'm not missing something. Am I?

>
> > +     }
> > +}
> > +
>
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> >  {
> >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
>
> This should refer to vma_refcount_put(). But after fixing it I think we
> could stop doing this altogether? It will no longer keep vma "alive"
> with SLAB_TYPESAFE_BY_RCU.

Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock()
here can be safely removed.

>
> > -     up_read(&vma->vm_lock.lock);
> > +     vma_refcount_put(vma);
> >       rcu_read_unlock();
> >  }
> >
>
> <snip>
>
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> >  #endif
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> > +{
> > +     /*
> > +      * If vma is detached then only vma_mark_attached() can raise the
> > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> > +      */
> > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> > +             return false;
> > +
> > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > +                TASK_UNINTERRUPTIBLE);
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > +
> > +     return true;
> > +}
> > +
> > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> > +{
> > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +}
> > +
> >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> >  {
> > -     down_write(&vma->vm_lock.lock);
> > +     bool locked;
> > +
> > +     /*
> > +      * __vma_enter_locked() returns false immediately if the vma is not
> > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> > +      * indicating that vma is attached with no readers.
> > +      */
> > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
>
> Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
> below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
> __vma_enter_locked() itself as it's the one adding it in the first place.

Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
and inside __vma_enter_locked() we do:

unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;

Is that better?

>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 15:32   ` Vlastimil Babka
@ 2025-01-10 16:07     ` Suren Baghdasaryan
  2025-01-10 22:14       ` Vlastimil Babka
  2025-01-11  3:37       ` Suren Baghdasaryan
  0 siblings, 2 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 16:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 7:31 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected by
> > lock_vma_under_rcu().
> > Current checks are sufficient as long as vma is detached before it is
> > freed. The only place this is not currently happening is in exit_mmap().
> > Add the missing vma_mark_detached() in exit_mmap().
> > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > is vm_area_dup(), which copies the entire content of the vma into a new
> > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > a reused vma if it found the vma before it got reused. To prevent this
> > situation, we should ensure that vm_refcnt stays at detached state (0)
> > when it is copied and advances to attached state only after it is added
> > into the vma tree. Introduce vma_copy() which preserves new vma's
> > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > state with no current readers when they are freed, lock_vma_under_rcu()
> > will not be able to take vm_refcnt after vma got detached even if vma
> > is reused.
> > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> You could also drop the reset_refcnt parameter of vma_lock_init() now,
> as the usage in vm_area_dup() should now be just setting 0 over 0. Maybe
> a VM_WARN_ON if it's not 0 already?

Yeah, that's a good idea. Will do.

> And a comment in vm_area_struct definition to consider vma_copy() when
> adding any new field?

Sure, will add.

>
> > +     /*
> > +      * src->shared.rb may be modified concurrently, but the clone
> > +      * will be reinitialized.
> > +      */
> > +     data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
>
> The comment makes it sound as if we didn't need to do it at all? But I
> didn't verify. If we do need it in some cases (i.e. the just allocated
> vma might have garbage from previous lifetime, but src is well defined
> and it's a case where it's not reinitialized afterwards) maybe the
> comment should say? Or if it's either reinitialized later or zeroes at
> src, we could memset() the zeroes instead of memcpying them, etc.

I see vm_area_dup() being used in dup_mmap() and I think this comment
is about this usage in case the src vma changes from under us.
However, vm_area_dup() is also used when we simply duplicate an
existing vma while holding an mmap_write_lock, like in __split_vma().
In these cases there is no possibility of a race and copied value
should hold. Maybe I should amend this comment like this:

/*
 * src->shared.rb may be modified concurrently when called from dup_mmap(),
 * but the clone will reinitialize it.
 */

WDYT?

>
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-10 15:36       ` Vlastimil Babka
@ 2025-01-10 16:08         ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 16:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 7:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
>
>
> On 1/10/25 1:16 AM, Suren Baghdasaryan wrote:
> > On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>>
> >>> Btw the subject became rather incomplete given all the series does :)
> >>
> >> Missed this one. What do you think is worth mentioning here? It's
> >> true, the patchset does many small things but I wanted to outline the
> >> main conceptual changes. Please LMK if you think there are more
> >> changes big enough to be mentioned here.
> >
> > I just realized that your comment was only about the subject of this
> > cover letter. Maybe something like this:
> >
> > per-vma lock and vm_area_struct cache optimizations
>
> arguably the biggest change here is:
>
> reimplement per-vma lock as a refcount

Ok, I'll use that. Thanks!

>
> but yours is ok to, don't want to bikeshed
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-10 15:56     ` Suren Baghdasaryan
@ 2025-01-10 16:47       ` Suren Baghdasaryan
  2025-01-10 16:50         ` Suren Baghdasaryan
  2025-01-10 22:26       ` Vlastimil Babka
  1 sibling, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 16:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 7:56 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 10, 2025 at 6:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > > 1. Readers never wait. They try to take the vma_lock and fall back to
> > > mmap_lock if that fails.
> > > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > > because writers first take mmap_lock in write mode.
> > > Because of these requirements, full rw_semaphore functionality is not
> > > needed and we can replace rw_semaphore and the vma->detached flag with
> > > a refcount (vm_refcnt).
> > > When vma is in detached state, vm_refcnt is 0 and only a call to
> > > vma_mark_attached() can take it out of this state. Note that unlike
> > > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > > to be done only after vma has been write-locked. vma_mark_attached()
> > > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > > a writer. When writer takes write lock, it sets the top usable bit to
> > > indicate its presence. If there are readers, writer will wait using newly
> > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> > > mode first, there can be only one writer at a time. The last reader to
> > > release the lock will signal the writer to wake up.
> > > refcount might overflow if there are many competing readers, in which case
> > > read-locking will fail. Readers are expected to handle such failures.
> > > In summary:
> > > 1. all readers increment the vm_refcnt;
> > > 2. writer sets top usable (writer) bit of vm_refcnt;
> > > 3. readers cannot increment the vm_refcnt if the writer bit is set;
> > > 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> > > to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> > > 5. vm_refcnt overflow is handled by the readers.
> > >
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > But think there's a problem that will manifest after patch 15.
> > Also I don't feel qualified enough about the lockdep parts though
> > (although I think I spotted another issue with those, below) so best if
> > PeterZ can review those.
> > Some nits below too.
> >
> > > +
> > > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > +{
> > > +     int oldcnt;
> > > +
> > > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >
> > Shouldn't we rwsem_release always? And also shouldn't it precede the
> > refcount operation itself?
>
> Yes. Hillf pointed to the same issue. It will be fixed in the next version.
>
> >
> > > +             if (is_vma_writer_only(oldcnt - 1))
> > > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > Hmm hmm we should maybe read the vm_mm pointer before dropping the
> > refcount? In case this races in a way that is_vma_writer_only tests true
> > but the writer meanwhile finishes and frees the vma. It's safe now but
> > not after making the cache SLAB_TYPESAFE_BY_RCU ?
>
> Hmm. But if is_vma_writer_only() is true that means the writed is
> blocked and is waiting for the reader to drop the vm_refcnt. IOW, it
> won't proceed and free the vma until the reader calls
> rcuwait_wake_up(). Your suggested change is trivial and I can do it
> but I want to make sure I'm not missing something. Am I?

Ok, after thinking some more, I think the race you might be referring
to is this:

writer                                           reader

    __vma_enter_locked
        refcount_add_not_zero(VMA_LOCK_OFFSET, ...)
                                                    vma_refcount_put
                                                       __refcount_dec_and_test()
                                                           if
(is_vma_writer_only())
        rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, ...)
    __vma_exit_locked
        refcount_sub_and_test(VMA_LOCK_OFFSET, ...)
    free the vma

rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

I think it's possible and your suggestion of storing the mm before
doing __refcount_dec_and_test() should work. Thanks for pointing this
out! I'll fix it in the next version.

>
> >
> > > +     }
> > > +}
> > > +
> >
> > >  static inline void vma_end_read(struct vm_area_struct *vma)
> > >  {
> > >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> >
> > This should refer to vma_refcount_put(). But after fixing it I think we
> > could stop doing this altogether? It will no longer keep vma "alive"
> > with SLAB_TYPESAFE_BY_RCU.
>
> Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock()
> here can be safely removed.
>
> >
> > > -     up_read(&vma->vm_lock.lock);
> > > +     vma_refcount_put(vma);
> > >       rcu_read_unlock();
> > >  }
> > >
> >
> > <snip>
> >
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> > >  #endif
> > >
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> > > +{
> > > +     /*
> > > +      * If vma is detached then only vma_mark_attached() can raise the
> > > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> > > +      */
> > > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> > > +             return false;
> > > +
> > > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> > > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > > +                TASK_UNINTERRUPTIBLE);
> > > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > > +
> > > +     return true;
> > > +}
> > > +
> > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> > > +{
> > > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> > > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > +}
> > > +
> > >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> > >  {
> > > -     down_write(&vma->vm_lock.lock);
> > > +     bool locked;
> > > +
> > > +     /*
> > > +      * __vma_enter_locked() returns false immediately if the vma is not
> > > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> > > +      * indicating that vma is attached with no readers.
> > > +      */
> > > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
> >
> > Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
> > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
> > __vma_enter_locked() itself as it's the one adding it in the first place.
>
> Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
> and inside __vma_enter_locked() we do:
>
> unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;
>
> Is that better?
>
> >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-10 16:47       ` Suren Baghdasaryan
@ 2025-01-10 16:50         ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 16:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 8:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 10, 2025 at 7:56 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 6:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > > 1. Readers never wait. They try to take the vma_lock and fall back to
> > > > mmap_lock if that fails.
> > > > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > > > because writers first take mmap_lock in write mode.
> > > > Because of these requirements, full rw_semaphore functionality is not
> > > > needed and we can replace rw_semaphore and the vma->detached flag with
> > > > a refcount (vm_refcnt).
> > > > When vma is in detached state, vm_refcnt is 0 and only a call to
> > > > vma_mark_attached() can take it out of this state. Note that unlike
> > > > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > > > to be done only after vma has been write-locked. vma_mark_attached()
> > > > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > > > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > > > a writer. When writer takes write lock, it sets the top usable bit to
> > > > indicate its presence. If there are readers, writer will wait using newly
> > > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> > > > mode first, there can be only one writer at a time. The last reader to
> > > > release the lock will signal the writer to wake up.
> > > > refcount might overflow if there are many competing readers, in which case
> > > > read-locking will fail. Readers are expected to handle such failures.
> > > > In summary:
> > > > 1. all readers increment the vm_refcnt;
> > > > 2. writer sets top usable (writer) bit of vm_refcnt;
> > > > 3. readers cannot increment the vm_refcnt if the writer bit is set;
> > > > 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> > > > to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> > > > 5. vm_refcnt overflow is handled by the readers.
> > > >
> > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > >
> > > But think there's a problem that will manifest after patch 15.
> > > Also I don't feel qualified enough about the lockdep parts though
> > > (although I think I spotted another issue with those, below) so best if
> > > PeterZ can review those.
> > > Some nits below too.
> > >
> > > > +
> > > > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > +{
> > > > +     int oldcnt;
> > > > +
> > > > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > >
> > > Shouldn't we rwsem_release always? And also shouldn't it precede the
> > > refcount operation itself?
> >
> > Yes. Hillf pointed to the same issue. It will be fixed in the next version.
> >
> > >
> > > > +             if (is_vma_writer_only(oldcnt - 1))
> > > > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > >
> > > Hmm hmm we should maybe read the vm_mm pointer before dropping the
> > > refcount? In case this races in a way that is_vma_writer_only tests true
> > > but the writer meanwhile finishes and frees the vma. It's safe now but
> > > not after making the cache SLAB_TYPESAFE_BY_RCU ?
> >
> > Hmm. But if is_vma_writer_only() is true that means the writed is
> > blocked and is waiting for the reader to drop the vm_refcnt. IOW, it
> > won't proceed and free the vma until the reader calls
> > rcuwait_wake_up(). Your suggested change is trivial and I can do it
> > but I want to make sure I'm not missing something. Am I?
>
> Ok, after thinking some more, I think the race you might be referring
> to is this:
>
> writer                                           reader
>
>     __vma_enter_locked
>         refcount_add_not_zero(VMA_LOCK_OFFSET, ...)
>                                                     vma_refcount_put
>                                                        __refcount_dec_and_test()
>                                                            if
> (is_vma_writer_only())
>         rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, ...)
>     __vma_exit_locked
>         refcount_sub_and_test(VMA_LOCK_OFFSET, ...)
>     free the vma
>
> rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

Sorry, this should be more readable:

writer             reader

__vma_enter_locked
    refcount_add_not_zero(VMA_LOCK_OFFSET, ...)
                   vma_refcount_put
                       __refcount_dec_and_test()
                           if (is_vma_writer_only())

    rcuwait_wait_event()
__vma_exit_locked
    refcount_sub_and_test(VMA_LOCK_OFFSET, ...)
free the vma

                               rcuwait_wake_up(); <-- access to vma->vm_mm

>
> I think it's possible and your suggestion of storing the mm before
> doing __refcount_dec_and_test() should work. Thanks for pointing this
> out! I'll fix it in the next version.
>
> >
> > >
> > > > +     }
> > > > +}
> > > > +
> > >
> > > >  static inline void vma_end_read(struct vm_area_struct *vma)
> > > >  {
> > > >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > >
> > > This should refer to vma_refcount_put(). But after fixing it I think we
> > > could stop doing this altogether? It will no longer keep vma "alive"
> > > with SLAB_TYPESAFE_BY_RCU.
> >
> > Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock()
> > here can be safely removed.
> >
> > >
> > > > -     up_read(&vma->vm_lock.lock);
> > > > +     vma_refcount_put(vma);
> > > >       rcu_read_unlock();
> > > >  }
> > > >
> > >
> > > <snip>
> > >
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> > > >  #endif
> > > >
> > > >  #ifdef CONFIG_PER_VMA_LOCK
> > > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> > > > +{
> > > > +     /*
> > > > +      * If vma is detached then only vma_mark_attached() can raise the
> > > > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> > > > +      */
> > > > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> > > > +             return false;
> > > > +
> > > > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > > > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> > > > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > > > +                TASK_UNINTERRUPTIBLE);
> > > > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > > > +
> > > > +     return true;
> > > > +}
> > > > +
> > > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> > > > +{
> > > > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> > > > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > > +}
> > > > +
> > > >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> > > >  {
> > > > -     down_write(&vma->vm_lock.lock);
> > > > +     bool locked;
> > > > +
> > > > +     /*
> > > > +      * __vma_enter_locked() returns false immediately if the vma is not
> > > > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> > > > +      * indicating that vma is attached with no readers.
> > > > +      */
> > > > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
> > >
> > > Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
> > > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
> > > __vma_enter_locked() itself as it's the one adding it in the first place.
> >
> > Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
> > and inside __vma_enter_locked() we do:
> >
> > unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;
> >
> > Is that better?
> >
> > >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-09 15:48   ` Suren Baghdasaryan
@ 2025-01-10 17:01     ` Peter Zijlstra
  2025-01-15  8:59       ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2025-01-10 17:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 09, 2025 at 07:48:32AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> >
> > Does not clean up that reattach nonsense :-(
> 
> Oh, no. I think it does. That's why in [1] I introduce
> vma_iter_store_attached() to be used on already attached vmas and to
> avoid marking them attached again. Also I added assertions in
> vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
> re-detaching. Unless I misunderstood your comment?

Hmm, I'll go read the thing again, maybe I missed it.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-09  2:30 ` [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2025-01-10 15:32   ` Vlastimil Babka
@ 2025-01-10 17:47   ` Liam R. Howlett
  2025-01-10 19:07     ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Liam R. Howlett @ 2025-01-10 17:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

* Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected by
> lock_vma_under_rcu().
> Current checks are sufficient as long as vma is detached before it is
> freed. The only place this is not currently happening is in exit_mmap().
> Add the missing vma_mark_detached() in exit_mmap().
> Another issue which might trick lock_vma_under_rcu() during vma reuse
> is vm_area_dup(), which copies the entire content of the vma into a new
> one, overriding new vma's vm_refcnt and temporarily making it appear as
> attached. This might trick a racing lock_vma_under_rcu() to operate on
> a reused vma if it found the vma before it got reused. To prevent this
> situation, we should ensure that vm_refcnt stays at detached state (0)
> when it is copied and advances to attached state only after it is added
> into the vma tree. Introduce vma_copy() which preserves new vma's
> vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> state with no current readers when they are freed, lock_vma_under_rcu()
> will not be able to take vm_refcnt after vma got detached even if vma
> is reused.
> Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> vm_area_struct reuse and will minimize the number of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h               |  2 -
>  include/linux/mm_types.h         | 10 +++--
>  include/linux/slab.h             |  6 ---
>  kernel/fork.c                    | 72 ++++++++++++++++++++------------
>  mm/mmap.c                        |  3 +-
>  mm/vma.c                         | 11 ++---
>  mm/vma.h                         |  2 +-
>  tools/testing/vma/vma_internal.h |  7 +---
>  8 files changed, 59 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1d6b1563b956..a674558e4c05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>  void vm_area_free(struct vm_area_struct *);
> -/* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
>  
>  #ifndef CONFIG_MMU
>  extern struct rb_root nommu_region_tree;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d83d79d1899..93bfcd0c1fde 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
>  
>  typedef unsigned long vm_flags_t;
>  
> +/*
> + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> + */
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  /*
>   * A region containing a mapping of a non-memory backed file under NOMMU
>   * conditions.  These are held in a global tree and are pinned by the VMAs that
> @@ -695,9 +701,7 @@ struct vm_area_struct {
>  			unsigned long vm_start;
>  			unsigned long vm_end;
>  		};
> -#ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> -#endif
> +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>  	};
>  
>  	/*
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..681b685b6c4e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -234,12 +234,6 @@ enum _slab_flag_bits {
>  #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
>  #endif
>  
> -/*
> - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> - */
> -typedef struct { unsigned long v; } freeptr_t;
> -
>  /*
>   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
>   *
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9d9275783cf8..770b973a099c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  	return vma;
>  }
>  

There exists a copy_vma() which copies the vma to a new area in the mm
in rmap.  Naming this vma_copy() is confusing :)

It might be better to just put this code in the vm_area_dup() or call it
__vm_area_dup(), or __vma_dup() ?

> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> +	dest->vm_mm = src->vm_mm;
> +	dest->vm_ops = src->vm_ops;
> +	dest->vm_start = src->vm_start;
> +	dest->vm_end = src->vm_end;
> +	dest->anon_vma = src->anon_vma;
> +	dest->vm_pgoff = src->vm_pgoff;
> +	dest->vm_file = src->vm_file;
> +	dest->vm_private_data = src->vm_private_data;
> +	vm_flags_init(dest, src->vm_flags);
> +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +	       sizeof(dest->vm_page_prot));
> +	/*
> +	 * src->shared.rb may be modified concurrently, but the clone
> +	 * will be reinitialized.
> +	 */
> +	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +	       sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +	dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +	       sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +	dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  {
>  	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> @@ -458,11 +493,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> -	/*
> -	 * orig->shared.rb may be modified concurrently, but the clone
> -	 * will be reinitialized.
> -	 */
> -	data_race(memcpy(new, orig, sizeof(*new)));
> +	vma_copy(orig, new);
>  	vma_lock_init(new, true);

I think this suffers from a race still?

That is, we can still race between vm_lock_seq == mm_lock_seq and the
lock acquire, where a free and reuse happens.  In the even that the
reader is caught between the sequence and lock taking, the
vma->vmlock_dep_map may not be replaced and it could see the old lock
(or zero?) and things go bad:

It could try to take vmlock_dep_map == 0 in read mode.

It can take the old lock, detect the refcnt is wrong and release the new
lock.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 17:47   ` Liam R. Howlett
@ 2025-01-10 19:07     ` Suren Baghdasaryan
  2025-01-10 19:46       ` Liam R. Howlett
  2025-01-10 19:51       ` Liam R. Howlett
  0 siblings, 2 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 19:07 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected by
> > lock_vma_under_rcu().
> > Current checks are sufficient as long as vma is detached before it is
> > freed. The only place this is not currently happening is in exit_mmap().
> > Add the missing vma_mark_detached() in exit_mmap().
> > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > is vm_area_dup(), which copies the entire content of the vma into a new
> > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > a reused vma if it found the vma before it got reused. To prevent this
> > situation, we should ensure that vm_refcnt stays at detached state (0)
> > when it is copied and advances to attached state only after it is added
> > into the vma tree. Introduce vma_copy() which preserves new vma's
> > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > state with no current readers when they are freed, lock_vma_under_rcu()
> > will not be able to take vm_refcnt after vma got detached even if vma
> > is reused.
> > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h               |  2 -
> >  include/linux/mm_types.h         | 10 +++--
> >  include/linux/slab.h             |  6 ---
> >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> >  mm/mmap.c                        |  3 +-
> >  mm/vma.c                         | 11 ++---
> >  mm/vma.h                         |  2 +-
> >  tools/testing/vma/vma_internal.h |  7 +---
> >  8 files changed, 59 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1d6b1563b956..a674558e4c05 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> >  void vm_area_free(struct vm_area_struct *);
> > -/* Use only if VMA has no other users */
> > -void __vm_area_free(struct vm_area_struct *vma);
> >
> >  #ifndef CONFIG_MMU
> >  extern struct rb_root nommu_region_tree;
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2d83d79d1899..93bfcd0c1fde 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> >
> >  typedef unsigned long vm_flags_t;
> >
> > +/*
> > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > + */
> > +typedef struct { unsigned long v; } freeptr_t;
> > +
> >  /*
> >   * A region containing a mapping of a non-memory backed file under NOMMU
> >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > @@ -695,9 +701,7 @@ struct vm_area_struct {
> >                       unsigned long vm_start;
> >                       unsigned long vm_end;
> >               };
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > -#endif
> > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> >       };
> >
> >       /*
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 10a971c2bde3..681b685b6c4e 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> >  #endif
> >
> > -/*
> > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > - */
> > -typedef struct { unsigned long v; } freeptr_t;
> > -
> >  /*
> >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> >   *
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9d9275783cf8..770b973a099c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >       return vma;
> >  }
> >
>
> There exists a copy_vma() which copies the vma to a new area in the mm
> in rmap.  Naming this vma_copy() is confusing :)
>
> It might be better to just put this code in the vm_area_dup() or call it
> __vm_area_dup(), or __vma_dup() ?

Hmm. It's not really duplicating a vma but copying its content (no
allocation). How about __vm_area_copy() to indicate it is copying
vm_area_struct content?

>
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > +     dest->vm_mm = src->vm_mm;
> > +     dest->vm_ops = src->vm_ops;
> > +     dest->vm_start = src->vm_start;
> > +     dest->vm_end = src->vm_end;
> > +     dest->anon_vma = src->anon_vma;
> > +     dest->vm_pgoff = src->vm_pgoff;
> > +     dest->vm_file = src->vm_file;
> > +     dest->vm_private_data = src->vm_private_data;
> > +     vm_flags_init(dest, src->vm_flags);
> > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > +            sizeof(dest->vm_page_prot));
> > +     /*
> > +      * src->shared.rb may be modified concurrently, but the clone
> > +      * will be reinitialized.
> > +      */
> > +     data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > +            sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > +            sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +     dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +     dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >  {
> >       struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > @@ -458,11 +493,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > -     /*
> > -      * orig->shared.rb may be modified concurrently, but the clone
> > -      * will be reinitialized.
> > -      */
> > -     data_race(memcpy(new, orig, sizeof(*new)));
> > +     vma_copy(orig, new);
> >       vma_lock_init(new, true);
>
> I think this suffers from a race still?
>
> That is, we can still race between vm_lock_seq == mm_lock_seq and the
> lock acquire, where a free and reuse happens.  In the even that the
> reader is caught between the sequence and lock taking, the
> vma->vmlock_dep_map may not be replaced and it could see the old lock
> (or zero?) and things go bad:
>
> It could try to take vmlock_dep_map == 0 in read mode.
>
> It can take the old lock, detect the refcnt is wrong and release the new
> lock.

I don't think this race can happen. Notice a call to
vma_assert_detached() inside vm_area_free(), so before vma is freed
and possibly reused, it has to be detached. vma_mark_detached()
ensures that there are no current or future readers by executing the
__vma_enter_locked() + __vma_exit_locked() sequence if vm_refcnt is
not already at 0. Once __vma_exit_locked() is done, vm_refcnt is at 0
and any new reader will be rejected on
__refcount_inc_not_zero_limited(), before even checking vm_lock_seq ==
mm_lock_seq. Even if a reader tries to sneak in between
__vma_enter_locked() and __vma_exit_locked() calls,
__refcount_inc_not_zero_limited() will reject it because
VMA_LOCK_OFFSET is set and VMA_REF_LIMIT will be violated.
IOW, when VMA is freed, it's guaranteed to be detached with no current
or future readers, therefore "race between vm_lock_seq == mm_lock_seq
and the lock acquire, where a free and reuse happens" should not be
possible.

Did I understand your concern correctly and does my explanation make
sense to you?

>
> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 19:07     ` Suren Baghdasaryan
@ 2025-01-10 19:46       ` Liam R. Howlett
  2025-01-10 20:34         ` Suren Baghdasaryan
  2025-01-10 19:51       ` Liam R. Howlett
  1 sibling, 1 reply; 53+ messages in thread
From: Liam R. Howlett @ 2025-01-10 19:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

* Suren Baghdasaryan <surenb@google.com> [250110 14:08]:
> On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > object reuse before RCU grace period is over will be detected by
> > > lock_vma_under_rcu().
> > > Current checks are sufficient as long as vma is detached before it is
> > > freed. The only place this is not currently happening is in exit_mmap().
> > > Add the missing vma_mark_detached() in exit_mmap().
> > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > a reused vma if it found the vma before it got reused. To prevent this
> > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > when it is copied and advances to attached state only after it is added
> > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > will not be able to take vm_refcnt after vma got detached even if vma
> > > is reused.
> > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/mm.h               |  2 -
> > >  include/linux/mm_types.h         | 10 +++--
> > >  include/linux/slab.h             |  6 ---
> > >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> > >  mm/mmap.c                        |  3 +-
> > >  mm/vma.c                         | 11 ++---
> > >  mm/vma.h                         |  2 +-
> > >  tools/testing/vma/vma_internal.h |  7 +---
> > >  8 files changed, 59 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1d6b1563b956..a674558e4c05 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > >  void vm_area_free(struct vm_area_struct *);
> > > -/* Use only if VMA has no other users */
> > > -void __vm_area_free(struct vm_area_struct *vma);
> > >
> > >  #ifndef CONFIG_MMU
> > >  extern struct rb_root nommu_region_tree;
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 2d83d79d1899..93bfcd0c1fde 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> > >
> > >  typedef unsigned long vm_flags_t;
> > >
> > > +/*
> > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > + */
> > > +typedef struct { unsigned long v; } freeptr_t;
> > > +
> > >  /*
> > >   * A region containing a mapping of a non-memory backed file under NOMMU
> > >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > @@ -695,9 +701,7 @@ struct vm_area_struct {
> > >                       unsigned long vm_start;
> > >                       unsigned long vm_end;
> > >               };
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > -#endif
> > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > >       };
> > >
> > >       /*
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index 10a971c2bde3..681b685b6c4e 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> > >  #endif
> > >
> > > -/*
> > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > - */
> > > -typedef struct { unsigned long v; } freeptr_t;
> > > -
> > >  /*
> > >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > >   *
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 9d9275783cf8..770b973a099c 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >       return vma;
> > >  }
> > >
> >
> > There exists a copy_vma() which copies the vma to a new area in the mm
> > in rmap.  Naming this vma_copy() is confusing :)
> >
> > It might be better to just put this code in the vm_area_dup() or call it
> > __vm_area_dup(), or __vma_dup() ?
> 
> Hmm. It's not really duplicating a vma but copying its content (no
> allocation). How about __vm_area_copy() to indicate it is copying
> vm_area_struct content?
> 
> >
> > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > +{
> > > +     dest->vm_mm = src->vm_mm;
> > > +     dest->vm_ops = src->vm_ops;
> > > +     dest->vm_start = src->vm_start;
> > > +     dest->vm_end = src->vm_end;
> > > +     dest->anon_vma = src->anon_vma;
> > > +     dest->vm_pgoff = src->vm_pgoff;
> > > +     dest->vm_file = src->vm_file;
> > > +     dest->vm_private_data = src->vm_private_data;
> > > +     vm_flags_init(dest, src->vm_flags);
> > > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > +            sizeof(dest->vm_page_prot));
> > > +     /*
> > > +      * src->shared.rb may be modified concurrently, but the clone
> > > +      * will be reinitialized.
> > > +      */
> > > +     data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > +            sizeof(dest->vm_userfaultfd_ctx));
> > > +#ifdef CONFIG_ANON_VMA_NAME
> > > +     dest->anon_name = src->anon_name;
> > > +#endif
> > > +#ifdef CONFIG_SWAP
> > > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > +            sizeof(dest->swap_readahead_info));
> > > +#endif
> > > +#ifndef CONFIG_MMU
> > > +     dest->vm_region = src->vm_region;
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > +     dest->vm_policy = src->vm_policy;
> > > +#endif
> > > +}
> > > +
> > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >  {
> > >       struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > > @@ -458,11 +493,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >
> > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > > -     /*
> > > -      * orig->shared.rb may be modified concurrently, but the clone
> > > -      * will be reinitialized.
> > > -      */
> > > -     data_race(memcpy(new, orig, sizeof(*new)));
> > > +     vma_copy(orig, new);
> > >       vma_lock_init(new, true);
> >
> > I think this suffers from a race still?
> >
> > That is, we can still race between vm_lock_seq == mm_lock_seq and the
> > lock acquire, where a free and reuse happens.  In the even that the
> > reader is caught between the sequence and lock taking, the
> > vma->vmlock_dep_map may not be replaced and it could see the old lock
> > (or zero?) and things go bad:
> >
> > It could try to take vmlock_dep_map == 0 in read mode.
> >
> > It can take the old lock, detect the refcnt is wrong and release the new
> > lock.
> 
> I don't think this race can happen. Notice a call to
> vma_assert_detached() inside vm_area_free(), so before vma is freed
> and possibly reused, it has to be detached. vma_mark_detached()
> ensures that there are no current or future readers by executing the
> __vma_enter_locked() + __vma_exit_locked() sequence if vm_refcnt is
> not already at 0. Once __vma_exit_locked() is done, vm_refcnt is at 0
> and any new reader will be rejected on
> __refcount_inc_not_zero_limited(), before even checking vm_lock_seq ==
> mm_lock_seq.

Isn't the vm_lock_seq check before the ref count in vma_start_read()?

From patch 11/16:

@@ -720,13 +752,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+	/*
+	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
+	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
+	 */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
 		return false;
 
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);



>Even if a reader tries to sneak in between
> __vma_enter_locked() and __vma_exit_locked() calls,
> __refcount_inc_not_zero_limited() will reject it because
> VMA_LOCK_OFFSET is set and VMA_REF_LIMIT will be violated.
> IOW, when VMA is freed, it's guaranteed to be detached with no current
> or future readers, therefore "race between vm_lock_seq == mm_lock_seq
> and the lock acquire, where a free and reuse happens" should not be
> possible.
> 
> Did I understand your concern correctly and does my explanation make
> sense to you?

It is close to what Vlastimil said before.

Here is the sequence for a NULL dereference, refcnt value is not needed:

A:			B:	C:
lock_vma_under_rcu()
  vma = mas_walk()
  vma_start_read()
    vm_lock_seq == mm->mm_lock_seq.sequence

                        vma_start_write
			vma detached and freed

					
				vm_area_dup()
				- vma reallocated
				- zero vma

    rwsem_acquire_read(NULL)


Here is a sequence for unlocking the new lock while locking the old one.
The refcnt failure detects the detached state but does not protect
against the wrong lock use:
A:			B:	C:
lock_vma_under_rcu()
  vma = mas_walk()
  vma_start_read()
    vm_lock_seq == mm->mm_lock_seq.sequence

                        vma_start_write
			vma detached and freed
					
				vm_area_dup()
				- vma reallocated

    rwsem_acquire_read(old lock)
    __refcount_inc_not_zero_limited() fails

				vma_init_lock();

    rwsem_release(new lock)

I don't think avoiding the copy of the ref count from the old vma is
enough to stop these races?

Thanks,
Liam


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 19:07     ` Suren Baghdasaryan
  2025-01-10 19:46       ` Liam R. Howlett
@ 2025-01-10 19:51       ` Liam R. Howlett
  2025-01-10 20:40         ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Liam R. Howlett @ 2025-01-10 19:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

* Suren Baghdasaryan <surenb@google.com> [250110 14:08]:
> On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > object reuse before RCU grace period is over will be detected by
> > > lock_vma_under_rcu().
> > > Current checks are sufficient as long as vma is detached before it is
> > > freed. The only place this is not currently happening is in exit_mmap().
> > > Add the missing vma_mark_detached() in exit_mmap().
> > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > a reused vma if it found the vma before it got reused. To prevent this
> > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > when it is copied and advances to attached state only after it is added
> > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > will not be able to take vm_refcnt after vma got detached even if vma
> > > is reused.
> > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/mm.h               |  2 -
> > >  include/linux/mm_types.h         | 10 +++--
> > >  include/linux/slab.h             |  6 ---
> > >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> > >  mm/mmap.c                        |  3 +-
> > >  mm/vma.c                         | 11 ++---
> > >  mm/vma.h                         |  2 +-
> > >  tools/testing/vma/vma_internal.h |  7 +---
> > >  8 files changed, 59 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1d6b1563b956..a674558e4c05 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > >  void vm_area_free(struct vm_area_struct *);
> > > -/* Use only if VMA has no other users */
> > > -void __vm_area_free(struct vm_area_struct *vma);
> > >
> > >  #ifndef CONFIG_MMU
> > >  extern struct rb_root nommu_region_tree;
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 2d83d79d1899..93bfcd0c1fde 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> > >
> > >  typedef unsigned long vm_flags_t;
> > >
> > > +/*
> > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > + */
> > > +typedef struct { unsigned long v; } freeptr_t;
> > > +
> > >  /*
> > >   * A region containing a mapping of a non-memory backed file under NOMMU
> > >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > @@ -695,9 +701,7 @@ struct vm_area_struct {
> > >                       unsigned long vm_start;
> > >                       unsigned long vm_end;
> > >               };
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > -#endif
> > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > >       };
> > >
> > >       /*
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index 10a971c2bde3..681b685b6c4e 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> > >  #endif
> > >
> > > -/*
> > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > - */
> > > -typedef struct { unsigned long v; } freeptr_t;
> > > -
> > >  /*
> > >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > >   *
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 9d9275783cf8..770b973a099c 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >       return vma;
> > >  }
> > >
> >
> > There exists a copy_vma() which copies the vma to a new area in the mm
> > in rmap.  Naming this vma_copy() is confusing :)
> >
> > It might be better to just put this code in the vm_area_dup() or call it
> > __vm_area_dup(), or __vma_dup() ?
> 
> Hmm. It's not really duplicating a vma but copying its content (no
> allocation). How about __vm_area_copy() to indicate it is copying
> vm_area_struct content?


Sorry, I missed this.  it's not copying all the content either.

vm_area_init_dup() maybe?

Considering the scope of the series, I'm not sure I want to have a
bike shed conversation.. But I also don't want copy_<foo> <foo>_copy
confusion in the future.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 19:46       ` Liam R. Howlett
@ 2025-01-10 20:34         ` Suren Baghdasaryan
  2025-01-10 20:47           ` Liam R. Howlett
  0 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 20:34 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 11:46 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250110 14:08]:
> > On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > > object reuse before RCU grace period is over will be detected by
> > > > lock_vma_under_rcu().
> > > > Current checks are sufficient as long as vma is detached before it is
> > > > freed. The only place this is not currently happening is in exit_mmap().
> > > > Add the missing vma_mark_detached() in exit_mmap().
> > > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > > a reused vma if it found the vma before it got reused. To prevent this
> > > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > > when it is copied and advances to attached state only after it is added
> > > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > > will not be able to take vm_refcnt after vma got detached even if vma
> > > > is reused.
> > > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  include/linux/mm.h               |  2 -
> > > >  include/linux/mm_types.h         | 10 +++--
> > > >  include/linux/slab.h             |  6 ---
> > > >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> > > >  mm/mmap.c                        |  3 +-
> > > >  mm/vma.c                         | 11 ++---
> > > >  mm/vma.h                         |  2 +-
> > > >  tools/testing/vma/vma_internal.h |  7 +---
> > > >  8 files changed, 59 insertions(+), 54 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 1d6b1563b956..a674558e4c05 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> > > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > > >  void vm_area_free(struct vm_area_struct *);
> > > > -/* Use only if VMA has no other users */
> > > > -void __vm_area_free(struct vm_area_struct *vma);
> > > >
> > > >  #ifndef CONFIG_MMU
> > > >  extern struct rb_root nommu_region_tree;
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 2d83d79d1899..93bfcd0c1fde 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> > > >
> > > >  typedef unsigned long vm_flags_t;
> > > >
> > > > +/*
> > > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > + */
> > > > +typedef struct { unsigned long v; } freeptr_t;
> > > > +
> > > >  /*
> > > >   * A region containing a mapping of a non-memory backed file under NOMMU
> > > >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > > @@ -695,9 +701,7 @@ struct vm_area_struct {
> > > >                       unsigned long vm_start;
> > > >                       unsigned long vm_end;
> > > >               };
> > > > -#ifdef CONFIG_PER_VMA_LOCK
> > > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > > -#endif
> > > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > > >       };
> > > >
> > > >       /*
> > > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > > index 10a971c2bde3..681b685b6c4e 100644
> > > > --- a/include/linux/slab.h
> > > > +++ b/include/linux/slab.h
> > > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > > >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> > > >  #endif
> > > >
> > > > -/*
> > > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > - */
> > > > -typedef struct { unsigned long v; } freeptr_t;
> > > > -
> > > >  /*
> > > >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > > >   *
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 9d9275783cf8..770b973a099c 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > >       return vma;
> > > >  }
> > > >
> > >
> > > There exists a copy_vma() which copies the vma to a new area in the mm
> > > in rmap.  Naming this vma_copy() is confusing :)
> > >
> > > It might be better to just put this code in the vm_area_dup() or call it
> > > __vm_area_dup(), or __vma_dup() ?
> >
> > Hmm. It's not really duplicating a vma but copying its content (no
> > allocation). How about __vm_area_copy() to indicate it is copying
> > vm_area_struct content?
> >
> > >
> > > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > > +{
> > > > +     dest->vm_mm = src->vm_mm;
> > > > +     dest->vm_ops = src->vm_ops;
> > > > +     dest->vm_start = src->vm_start;
> > > > +     dest->vm_end = src->vm_end;
> > > > +     dest->anon_vma = src->anon_vma;
> > > > +     dest->vm_pgoff = src->vm_pgoff;
> > > > +     dest->vm_file = src->vm_file;
> > > > +     dest->vm_private_data = src->vm_private_data;
> > > > +     vm_flags_init(dest, src->vm_flags);
> > > > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > > +            sizeof(dest->vm_page_prot));
> > > > +     /*
> > > > +      * src->shared.rb may be modified concurrently, but the clone
> > > > +      * will be reinitialized.
> > > > +      */
> > > > +     data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > > +            sizeof(dest->vm_userfaultfd_ctx));
> > > > +#ifdef CONFIG_ANON_VMA_NAME
> > > > +     dest->anon_name = src->anon_name;
> > > > +#endif
> > > > +#ifdef CONFIG_SWAP
> > > > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > > +            sizeof(dest->swap_readahead_info));
> > > > +#endif
> > > > +#ifndef CONFIG_MMU
> > > > +     dest->vm_region = src->vm_region;
> > > > +#endif
> > > > +#ifdef CONFIG_NUMA
> > > > +     dest->vm_policy = src->vm_policy;
> > > > +#endif
> > > > +}
> > > > +
> > > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > > >  {
> > > >       struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> > > > @@ -458,11 +493,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > > >
> > > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > > > -     /*
> > > > -      * orig->shared.rb may be modified concurrently, but the clone
> > > > -      * will be reinitialized.
> > > > -      */
> > > > -     data_race(memcpy(new, orig, sizeof(*new)));
> > > > +     vma_copy(orig, new);
> > > >       vma_lock_init(new, true);
> > >
> > > I think this suffers from a race still?
> > >
> > > That is, we can still race between vm_lock_seq == mm_lock_seq and the
> > > lock acquire, where a free and reuse happens.  In the even that the
> > > reader is caught between the sequence and lock taking, the
> > > vma->vmlock_dep_map may not be replaced and it could see the old lock
> > > (or zero?) and things go bad:
> > >
> > > It could try to take vmlock_dep_map == 0 in read mode.
> > >
> > > It can take the old lock, detect the refcnt is wrong and release the new
> > > lock.
> >
> > I don't think this race can happen. Notice a call to
> > vma_assert_detached() inside vm_area_free(), so before vma is freed
> > and possibly reused, it has to be detached. vma_mark_detached()
> > ensures that there are no current or future readers by executing the
> > __vma_enter_locked() + __vma_exit_locked() sequence if vm_refcnt is
> > not already at 0. Once __vma_exit_locked() is done, vm_refcnt is at 0
> > and any new reader will be rejected on
> > __refcount_inc_not_zero_limited(), before even checking vm_lock_seq ==
> > mm_lock_seq.
>
> Isn't the vm_lock_seq check before the ref count in vma_start_read()?
>
> From patch 11/16:
>
> @@ -720,13 +752,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>         if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>                 return false;
>
> -       if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> +       /*
> +        * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> +        * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> +        */
> +       if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +                                                     VMA_REF_LIMIT)))
>                 return false;
>
> +       rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
>
>
>
> >Even if a reader tries to sneak in between
> > __vma_enter_locked() and __vma_exit_locked() calls,
> > __refcount_inc_not_zero_limited() will reject it because
> > VMA_LOCK_OFFSET is set and VMA_REF_LIMIT will be violated.
> > IOW, when VMA is freed, it's guaranteed to be detached with no current
> > or future readers, therefore "race between vm_lock_seq == mm_lock_seq
> > and the lock acquire, where a free and reuse happens" should not be
> > possible.
> >
> > Did I understand your concern correctly and does my explanation make
> > sense to you?
>
> It is close to what Vlastimil said before.
>
> Here is the sequence for a NULL dereference, refcnt value is not needed:
>
> A:                      B:      C:
> lock_vma_under_rcu()
>   vma = mas_walk()
>   vma_start_read()
>     vm_lock_seq == mm->mm_lock_seq.sequence
>
>                         vma_start_write
>                         vma detached and freed

At this point B makes vm_refcnt==0.

>
>
>                                 vm_area_dup()
>                                 - vma reallocated
>                                 - zero vma

vm_refcnt here is still 0.

>

Here before calling rwsem_acquire_read(), the reader has to
successfully do __refcount_inc_not_zero_limited() and that will fail
because vm_refcnt==0. So rwsem_acquire_read(NULL) will not be called.
What am I missing?

>     rwsem_acquire_read(NULL)
>
>
> Here is a sequence for unlocking the new lock while locking the old one.
> The refcnt failure detects the detached state but does not protect
> against the wrong lock use:
> A:                      B:      C:
> lock_vma_under_rcu()
>   vma = mas_walk()
>   vma_start_read()
>     vm_lock_seq == mm->mm_lock_seq.sequence
>
>                         vma_start_write
>                         vma detached and freed
>
>                                 vm_area_dup()
>                                 - vma reallocated

vm_refcnt is still 0 after reallocation.

>
>     rwsem_acquire_read(old lock)
>     __refcount_inc_not_zero_limited() fails

Are we looking at the same code
(https://lore.kernel.org/all/20250109023025.2242447-12-surenb@google.com/)?
The sequence should be reversed here like this:

     if (!__refcount_inc_not_zero_limited())
              return false;
     rwsem_acquire_read(old lock)

and because __refcount_inc_not_zero_limited() fails,
rwsem_acquire_read(old lock) should never be called.

>
>                                 vma_init_lock();
>
>     rwsem_release(new lock)

The reader will not call rwsem_release(new lock) because it failed to
acquire the lock.

>
> I don't think avoiding the copy of the ref count from the old vma is
> enough to stop these races?
>
> Thanks,
> Liam
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 19:51       ` Liam R. Howlett
@ 2025-01-10 20:40         ` Suren Baghdasaryan
  2025-01-10 20:48           ` Liam R. Howlett
  0 siblings, 1 reply; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 20:40 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 11:51 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250110 14:08]:
> > On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > > object reuse before RCU grace period is over will be detected by
> > > > lock_vma_under_rcu().
> > > > Current checks are sufficient as long as vma is detached before it is
> > > > freed. The only place this is not currently happening is in exit_mmap().
> > > > Add the missing vma_mark_detached() in exit_mmap().
> > > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > > a reused vma if it found the vma before it got reused. To prevent this
> > > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > > when it is copied and advances to attached state only after it is added
> > > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > > will not be able to take vm_refcnt after vma got detached even if vma
> > > > is reused.
> > > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  include/linux/mm.h               |  2 -
> > > >  include/linux/mm_types.h         | 10 +++--
> > > >  include/linux/slab.h             |  6 ---
> > > >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> > > >  mm/mmap.c                        |  3 +-
> > > >  mm/vma.c                         | 11 ++---
> > > >  mm/vma.h                         |  2 +-
> > > >  tools/testing/vma/vma_internal.h |  7 +---
> > > >  8 files changed, 59 insertions(+), 54 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 1d6b1563b956..a674558e4c05 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> > > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > > >  void vm_area_free(struct vm_area_struct *);
> > > > -/* Use only if VMA has no other users */
> > > > -void __vm_area_free(struct vm_area_struct *vma);
> > > >
> > > >  #ifndef CONFIG_MMU
> > > >  extern struct rb_root nommu_region_tree;
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 2d83d79d1899..93bfcd0c1fde 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> > > >
> > > >  typedef unsigned long vm_flags_t;
> > > >
> > > > +/*
> > > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > + */
> > > > +typedef struct { unsigned long v; } freeptr_t;
> > > > +
> > > >  /*
> > > >   * A region containing a mapping of a non-memory backed file under NOMMU
> > > >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > > @@ -695,9 +701,7 @@ struct vm_area_struct {
> > > >                       unsigned long vm_start;
> > > >                       unsigned long vm_end;
> > > >               };
> > > > -#ifdef CONFIG_PER_VMA_LOCK
> > > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > > -#endif
> > > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > > >       };
> > > >
> > > >       /*
> > > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > > index 10a971c2bde3..681b685b6c4e 100644
> > > > --- a/include/linux/slab.h
> > > > +++ b/include/linux/slab.h
> > > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > > >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> > > >  #endif
> > > >
> > > > -/*
> > > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > - */
> > > > -typedef struct { unsigned long v; } freeptr_t;
> > > > -
> > > >  /*
> > > >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > > >   *
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 9d9275783cf8..770b973a099c 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > >       return vma;
> > > >  }
> > > >
> > >
> > > There exists a copy_vma() which copies the vma to a new area in the mm
> > > in rmap.  Naming this vma_copy() is confusing :)
> > >
> > > It might be better to just put this code in the vm_area_dup() or call it
> > > __vm_area_dup(), or __vma_dup() ?
> >
> > Hmm. It's not really duplicating a vma but copying its content (no
> > allocation). How about __vm_area_copy() to indicate it is copying
> > vm_area_struct content?
>
>
> Sorry, I missed this.  it's not copying all the content either.
>
> vm_area_init_dup() maybe?

Ah, how about vm_area_init_from(src, dest)?

>
> Considering the scope of the series, I'm not sure I want to have a
> bike shed conversation.. But I also don't want copy_<foo> <foo>_copy
> confusion in the future.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 20:34         ` Suren Baghdasaryan
@ 2025-01-10 20:47           ` Liam R. Howlett
  2025-01-10 21:32             ` Suren Baghdasaryan
  0 siblings, 1 reply; 53+ messages in thread
From: Liam R. Howlett @ 2025-01-10 20:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

* Suren Baghdasaryan <surenb@google.com> [250110 15:35]:
> On Fri, Jan 10, 2025 at 11:46 AM 'Liam R. Howlett' via kernel-team
> 
> >
> >     rwsem_acquire_read(old lock)
> >     __refcount_inc_not_zero_limited() fails
> 
> Are we looking at the same code
> (https://lore.kernel.org/all/20250109023025.2242447-12-surenb@google.com/)?
> The sequence should be reversed here like this:
> 
>      if (!__refcount_inc_not_zero_limited())
>               return false;
>      rwsem_acquire_read(old lock)
> 
> and because __refcount_inc_not_zero_limited() fails,
> rwsem_acquire_read(old lock) should never be called.
> 

We are not, I had an older version of the patch from the 7th.

Sorry for the confusion.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 20:40         ` Suren Baghdasaryan
@ 2025-01-10 20:48           ` Liam R. Howlett
  0 siblings, 0 replies; 53+ messages in thread
From: Liam R. Howlett @ 2025-01-10 20:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

* Suren Baghdasaryan <surenb@google.com> [250110 15:40]:
> On Fri, Jan 10, 2025 at 11:51 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [250110 14:08]:
> > > On Fri, Jan 10, 2025 at 9:48 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Suren Baghdasaryan <surenb@google.com> [250108 21:31]:
> > > > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > > > object reuse before RCU grace period is over will be detected by
> > > > > lock_vma_under_rcu().
> > > > > Current checks are sufficient as long as vma is detached before it is
> > > > > freed. The only place this is not currently happening is in exit_mmap().
> > > > > Add the missing vma_mark_detached() in exit_mmap().
> > > > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > > > a reused vma if it found the vma before it got reused. To prevent this
> > > > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > > > when it is copied and advances to attached state only after it is added
> > > > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > > > will not be able to take vm_refcnt after vma got detached even if vma
> > > > > is reused.
> > > > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > ---
> > > > >  include/linux/mm.h               |  2 -
> > > > >  include/linux/mm_types.h         | 10 +++--
> > > > >  include/linux/slab.h             |  6 ---
> > > > >  kernel/fork.c                    | 72 ++++++++++++++++++++------------
> > > > >  mm/mmap.c                        |  3 +-
> > > > >  mm/vma.c                         | 11 ++---
> > > > >  mm/vma.h                         |  2 +-
> > > > >  tools/testing/vma/vma_internal.h |  7 +---
> > > > >  8 files changed, 59 insertions(+), 54 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 1d6b1563b956..a674558e4c05 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> > > > >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > > > >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > > > >  void vm_area_free(struct vm_area_struct *);
> > > > > -/* Use only if VMA has no other users */
> > > > > -void __vm_area_free(struct vm_area_struct *vma);
> > > > >
> > > > >  #ifndef CONFIG_MMU
> > > > >  extern struct rb_root nommu_region_tree;
> > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > > index 2d83d79d1899..93bfcd0c1fde 100644
> > > > > --- a/include/linux/mm_types.h
> > > > > +++ b/include/linux/mm_types.h
> > > > > @@ -582,6 +582,12 @@ static inline void *folio_get_private(struct folio *folio)
> > > > >
> > > > >  typedef unsigned long vm_flags_t;
> > > > >
> > > > > +/*
> > > > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > > + */
> > > > > +typedef struct { unsigned long v; } freeptr_t;
> > > > > +
> > > > >  /*
> > > > >   * A region containing a mapping of a non-memory backed file under NOMMU
> > > > >   * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > > > @@ -695,9 +701,7 @@ struct vm_area_struct {
> > > > >                       unsigned long vm_start;
> > > > >                       unsigned long vm_end;
> > > > >               };
> > > > > -#ifdef CONFIG_PER_VMA_LOCK
> > > > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > > > -#endif
> > > > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > > > >       };
> > > > >
> > > > >       /*
> > > > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > > > index 10a971c2bde3..681b685b6c4e 100644
> > > > > --- a/include/linux/slab.h
> > > > > +++ b/include/linux/slab.h
> > > > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > > > >  #define SLAB_NO_OBJ_EXT              __SLAB_FLAG_UNUSED
> > > > >  #endif
> > > > >
> > > > > -/*
> > > > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > > > - */
> > > > > -typedef struct { unsigned long v; } freeptr_t;
> > > > > -
> > > > >  /*
> > > > >   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > > > >   *
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 9d9275783cf8..770b973a099c 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -449,6 +449,41 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > > >       return vma;
> > > > >  }
> > > > >
> > > >
> > > > There exists a copy_vma() which copies the vma to a new area in the mm
> > > > in rmap.  Naming this vma_copy() is confusing :)
> > > >
> > > > It might be better to just put this code in the vm_area_dup() or call it
> > > > __vm_area_dup(), or __vma_dup() ?
> > >
> > > Hmm. It's not really duplicating a vma but copying its content (no
> > > allocation). How about __vm_area_copy() to indicate it is copying
> > > vm_area_struct content?
> >
> >
> > Sorry, I missed this.  it's not copying all the content either.
> >
> > vm_area_init_dup() maybe?
> 
> Ah, how about vm_area_init_from(src, dest)?
> 

Sure, thanks.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 20:47           ` Liam R. Howlett
@ 2025-01-10 21:32             ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 21:32 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, peterz, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 12:47 PM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250110 15:35]:
> > On Fri, Jan 10, 2025 at 11:46 AM 'Liam R. Howlett' via kernel-team
> >
> > >
> > >     rwsem_acquire_read(old lock)
> > >     __refcount_inc_not_zero_limited() fails
> >
> > Are we looking at the same code
> > (https://lore.kernel.org/all/20250109023025.2242447-12-surenb@google.com/)?
> > The sequence should be reversed here like this:
> >
> >      if (!__refcount_inc_not_zero_limited())
> >               return false;
> >      rwsem_acquire_read(old lock)
> >
> > and because __refcount_inc_not_zero_limited() fails,
> > rwsem_acquire_read(old lock) should never be called.
> >
>
> We are not, I had an older version of the patch from the 7th.
>
> Sorry for the confusion.

No worries. Thanks for taking a look!

>
> Thanks,
> Liam
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 16:07     ` Suren Baghdasaryan
@ 2025-01-10 22:14       ` Vlastimil Babka
  2025-01-11  3:37       ` Suren Baghdasaryan
  1 sibling, 0 replies; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-10 22:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/10/25 17:07, Suren Baghdasaryan wrote:
> On Fri, Jan 10, 2025 at 7:31 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> I see vm_area_dup() being used in dup_mmap() and I think this comment
> is about this usage in case the src vma changes from under us.
> However, vm_area_dup() is also used when we simply duplicate an
> existing vma while holding an mmap_write_lock, like in __split_vma().
> In these cases there is no possibility of a race and copied value
> should hold. Maybe I should amend this comment like this:
> 
> /*
>  * src->shared.rb may be modified concurrently when called from dup_mmap(),
>  * but the clone will reinitialize it.
>  */
> 
> WDYT?

Sounds good, thanks!

>>
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-10 15:56     ` Suren Baghdasaryan
  2025-01-10 16:47       ` Suren Baghdasaryan
@ 2025-01-10 22:26       ` Vlastimil Babka
  2025-01-10 22:37         ` Suren Baghdasaryan
  1 sibling, 1 reply; 53+ messages in thread
From: Vlastimil Babka @ 2025-01-10 22:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/10/25 16:56, Suren Baghdasaryan wrote:
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>> >  #endif
>> >
>> >  #ifdef CONFIG_PER_VMA_LOCK
>> > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
>> > +{
>> > +     /*
>> > +      * If vma is detached then only vma_mark_attached() can raise the
>> > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
>> > +      */
>> > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
>> > +             return false;
>> > +
>> > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
>> > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
>> > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
>> > +                TASK_UNINTERRUPTIBLE);
>> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>> > +
>> > +     return true;
>> > +}
>> > +
>> > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
>> > +{
>> > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
>> > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>> > +}
>> > +
>> >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>> >  {
>> > -     down_write(&vma->vm_lock.lock);
>> > +     bool locked;
>> > +
>> > +     /*
>> > +      * __vma_enter_locked() returns false immediately if the vma is not
>> > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
>> > +      * indicating that vma is attached with no readers.
>> > +      */
>> > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
>>
>> Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
>> below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
>> __vma_enter_locked() itself as it's the one adding it in the first place.
> 
> Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
> and inside __vma_enter_locked() we do:
> 
> unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;
> 
> Is that better?

Yeah I think so as it centralizes the target refcount logic into a single
place __vma_enter_locked().
Hm but then it's weird that __vma_start_write() would set vma_attached to
true and yet it handles also a case where it's not attached.
Maybe call the parameter "detaching" and switch the 0 and 1?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count
  2025-01-10 22:26       ` Vlastimil Babka
@ 2025-01-10 22:37         ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-10 22:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 2:26 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/10/25 16:56, Suren Baghdasaryan wrote:
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> >> >  #endif
> >> >
> >> >  #ifdef CONFIG_PER_VMA_LOCK
> >> > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> >> > +{
> >> > +     /*
> >> > +      * If vma is detached then only vma_mark_attached() can raise the
> >> > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> >> > +      */
> >> > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> >> > +             return false;
> >> > +
> >> > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> >> > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> >> > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> >> > +                TASK_UNINTERRUPTIBLE);
> >> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >> > +
> >> > +     return true;
> >> > +}
> >> > +
> >> > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> >> > +{
> >> > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> >> > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >> > +}
> >> > +
> >> >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> >> >  {
> >> > -     down_write(&vma->vm_lock.lock);
> >> > +     bool locked;
> >> > +
> >> > +     /*
> >> > +      * __vma_enter_locked() returns false immediately if the vma is not
> >> > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> >> > +      * indicating that vma is attached with no readers.
> >> > +      */
> >> > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
> >>
> >> Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
> >> below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
> >> __vma_enter_locked() itself as it's the one adding it in the first place.
> >
> > Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
> > and inside __vma_enter_locked() we do:
> >
> > unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;
> >
> > Is that better?
>
> Yeah I think so as it centralizes the target refcount logic into a single
> place __vma_enter_locked().
> Hm but then it's weird that __vma_start_write() would set vma_attached to
> true and yet it handles also a case where it's not attached.

Ah, good point.

> Maybe call the parameter "detaching" and switch the 0 and 1?

Yes, that would be less confusing. Thanks for the suggestion, I'll use it.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-09 14:42   ` Vlastimil Babka
@ 2025-01-11  1:33     ` Suren Baghdasaryan
  0 siblings, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Thu, Jan 9, 2025 at 6:41 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > Introduce functions to increase refcount but with a top limit above which
> > they will fail to increase (the limit is inclusive). Setting the limit to
> > INT_MAX indicates no limit.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks Vlastimil!
Since per David's suggestion in v7 I'll be adding a new static check
in this patch, I won't add your Acked-by in v9 until you have a chance
to look at it again.

>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-10 16:07     ` Suren Baghdasaryan
  2025-01-10 22:14       ` Vlastimil Babka
@ 2025-01-11  3:37       ` Suren Baghdasaryan
  1 sibling, 0 replies; 53+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  3:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 8:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 10, 2025 at 7:31 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > object reuse before RCU grace period is over will be detected by
> > > lock_vma_under_rcu().
> > > Current checks are sufficient as long as vma is detached before it is
> > > freed. The only place this is not currently happening is in exit_mmap().
> > > Add the missing vma_mark_detached() in exit_mmap().
> > > Another issue which might trick lock_vma_under_rcu() during vma reuse
> > > is vm_area_dup(), which copies the entire content of the vma into a new
> > > one, overriding new vma's vm_refcnt and temporarily making it appear as
> > > attached. This might trick a racing lock_vma_under_rcu() to operate on
> > > a reused vma if it found the vma before it got reused. To prevent this
> > > situation, we should ensure that vm_refcnt stays at detached state (0)
> > > when it is copied and advances to attached state only after it is added
> > > into the vma tree. Introduce vma_copy() which preserves new vma's
> > > vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached
> > > state with no current readers when they are freed, lock_vma_under_rcu()
> > > will not be able to take vm_refcnt after vma got detached even if vma
> > > is reused.
> > > Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
> > > vm_area_struct reuse and will minimize the number of call_rcu() calls.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > You could also drop the reset_refcnt parameter of vma_lock_init() now,
> > as the usage in vm_area_dup() should now be just setting 0 over 0. Maybe
> > a VM_WARN_ON if it's not 0 already?
>
> Yeah, that's a good idea. Will do.

Ugh, once I made this change, the newly added VM_WARN_ON() immediately
triggered because vm_area_dup() does not memset(0) the entire vma and
kmem_cache_alloc(vm_area_cachep) does not always return a reused vma.
I could add a vm_area_cachep constructor to always initialize
vm_refcnt to 0 but that would lead to more changes. I think I'll keep
reset_refcnt for now and will add vm_area_cachep constructor as a
follow-up optimization after this patchset is merged.

>
> > And a comment in vm_area_struct definition to consider vma_copy() when
> > adding any new field?
>
> Sure, will add.
>
> >
> > > +     /*
> > > +      * src->shared.rb may be modified concurrently, but the clone
> > > +      * will be reinitialized.
> > > +      */
> > > +     data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> >
> > The comment makes it sound as if we didn't need to do it at all? But I
> > didn't verify. If we do need it in some cases (i.e. the just allocated
> > vma might have garbage from previous lifetime, but src is well defined
> > and it's a case where it's not reinitialized afterwards) maybe the
> > comment should say? Or if it's either reinitialized later or zeroes at
> > src, we could memset() the zeroes instead of memcpying them, etc.
>
> I see vm_area_dup() being used in dup_mmap() and I think this comment
> is about this usage in case the src vma changes from under us.
> However, vm_area_dup() is also used when we simply duplicate an
> existing vma while holding an mmap_write_lock, like in __split_vma().
> In these cases there is no possibility of a race and copied value
> should hold. Maybe I should amend this comment like this:
>
> /*
>  * src->shared.rb may be modified concurrently when called from dup_mmap(),
>  * but the clone will reinitialize it.
>  */
>
> WDYT?
>
> >
> >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct
  2025-01-10 17:01     ` Peter Zijlstra
@ 2025-01-15  8:59       ` Peter Zijlstra
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2025-01-15  8:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 10, 2025 at 06:01:05PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 07:48:32AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > vm_area_struct in [1] because of the performance regression caused by
> > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > even there it can be mitigated by disabling adjacent cacheline
> > > > prefetching, see [3].
> > > > Splitting single logical structure into multiple ones leads to more
> > > > complicated management, extra pointer dereferences and overall less
> > > > maintainable code. When that split-away part is a lock, it complicates
> > > > things even further. With no performance benefits, there are no reasons
> > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > This patchset:
> > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > cacheline sharing;
> > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > it is inserted into vma tree;
> > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > reuse and to minimize call_rcu() calls.
> > >
> > > Does not clean up that reattach nonsense :-(
> > 
> > Oh, no. I think it does. That's why in [1] I introduce
> > vma_iter_store_attached() to be used on already attached vmas and to
> > avoid marking them attached again. Also I added assertions in
> > vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
> > re-detaching. Unless I misunderstood your comment?
> 
> Hmm, I'll go read the thing again, maybe I missed it.

You're right. I was looking for the approach that changed the need to
reattach, by moving the point of no return.

This should do for now.

Let me see if I can find time today to finally do a proper reading.

Thanks!


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-01-15  9:00 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-09  2:30 [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 04/16] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
2025-01-09 14:01   ` Vlastimil Babka
2025-01-09  2:30 ` [PATCH v8 05/16] mm: mark vmas detached upon exit Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 06/16] types: move struct rcuwait into types.h Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 07/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 08/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 09/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 10/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
2025-01-09 14:42   ` Vlastimil Babka
2025-01-11  1:33     ` Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
2025-01-09 10:35   ` Hillf Danton
2025-01-09 16:01     ` Suren Baghdasaryan
2025-01-10 14:34   ` Vlastimil Babka
2025-01-10 15:56     ` Suren Baghdasaryan
2025-01-10 16:47       ` Suren Baghdasaryan
2025-01-10 16:50         ` Suren Baghdasaryan
2025-01-10 22:26       ` Vlastimil Babka
2025-01-10 22:37         ` Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 12/16] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 13/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
2025-01-09  2:30 ` [PATCH v8 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2025-01-10 15:32   ` Vlastimil Babka
2025-01-10 16:07     ` Suren Baghdasaryan
2025-01-10 22:14       ` Vlastimil Babka
2025-01-11  3:37       ` Suren Baghdasaryan
2025-01-10 17:47   ` Liam R. Howlett
2025-01-10 19:07     ` Suren Baghdasaryan
2025-01-10 19:46       ` Liam R. Howlett
2025-01-10 20:34         ` Suren Baghdasaryan
2025-01-10 20:47           ` Liam R. Howlett
2025-01-10 21:32             ` Suren Baghdasaryan
2025-01-10 19:51       ` Liam R. Howlett
2025-01-10 20:40         ` Suren Baghdasaryan
2025-01-10 20:48           ` Liam R. Howlett
2025-01-09  2:30 ` [PATCH v8 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2025-01-09  2:32 ` [PATCH v8 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
2025-01-09 11:51 ` Peter Zijlstra
2025-01-09 15:48   ` Suren Baghdasaryan
2025-01-10 17:01     ` Peter Zijlstra
2025-01-15  8:59       ` Peter Zijlstra
2025-01-09 13:41 ` Vlastimil Babka
2025-01-09 15:57   ` Suren Baghdasaryan
2025-01-10  0:14     ` Suren Baghdasaryan
2025-01-09 15:59   ` Suren Baghdasaryan
2025-01-10  0:16     ` Suren Baghdasaryan
2025-01-10 15:36       ` Vlastimil Babka
2025-01-10 16:08         ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox