linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] move per-vma lock into vm_area_struct
@ 2024-11-20  0:08 Suren Baghdasaryan
  2024-11-20  0:08 ` [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.
Pagefault microbenchmarks do not show noticeable performance change.

Changes since v3 [4]
- Added SOBs, per Lorenzo Stoakes and Davidlohr Bueso;
- Replaced vma write-locking in vma_mark_attached() with memory barriers to
order accesses to vma->detached vs vm_mm/vm_start/vm_end

Patch applies over mm-unstable

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241117080931.600731-1-surenb@google.com/

Suren Baghdasaryan (5):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  10 ++-
 include/linux/mm.h                 | 129 ++++++++++++++++++++++++-----
 include/linux/mm_types.h           |  19 ++---
 kernel/fork.c                      |  88 +++++++-------------
 mm/memory.c                        |  17 ++--
 mm/userfaultfd.c                   |  22 ++---
 mm/vma.c                           |   8 +-
 mm/vma.h                           |   2 +
 tools/testing/vma/vma_internal.h   |  55 +++++-------
 9 files changed, 198 insertions(+), 152 deletions(-)


base-commit: 5a7056135bb69da2ce0a42eb8c07968c1331777b
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-20  0:08 ` Suren Baghdasaryan
  2024-11-20 22:11   ` Shakeel Butt
  2024-11-20  0:08 ` [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read. Replace direct accesses to vma->vm_lock
with these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/userfaultfd.c   | 22 +++++-----------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fecd47239fa9..1ba2e480ae63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -722,6 +722,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60a0be33766f..87db4b32b82a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
-	if (!IS_ERR(vma)) {
-		/*
-		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
-		 */
-		down_read(&vma->vm_lock->lock);
-	}
+	if (!IS_ERR(vma))
+		vma_start_read_locked(vma);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1476,14 +1468,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		/*
-		 * See comment in uffd_lock_vma() as to why not using
-		 * vma_start_read() here.
-		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-20  0:08 ` [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-11-20  0:08 ` Suren Baghdasaryan
  2024-11-20 23:32   ` Shakeel Butt
  2024-11-20  0:08 ` [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
Move vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned as well.
With kernel compiled using defconfig, this causes VMA memory consumption
to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes. Therefore these calculations are close to the
worst case scenario. A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.
This memory consumption growth can be addressed later by optimizing the
vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               | 28 ++++++++++--------
 include/linux/mm_types.h         |  6 ++--
 kernel/fork.c                    | 49 ++++----------------------------
 tools/testing/vma/vma_internal.h | 33 +++++----------------
 4 files changed, 32 insertions(+), 84 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ba2e480ae63..737c003b0a1e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -731,7 +737,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -743,13 +749,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -778,7 +784,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -786,7 +792,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -798,7 +804,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -831,6 +837,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -865,10 +872,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -877,6 +880,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80fef38d9d64..5c4bfdcfac72 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -716,8 +716,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -770,6 +768,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 0061cf2450ef..7823797e31d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 1d9fc97b8e80..11c2c38ca4e8 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -230,10 +230,10 @@ struct vm_area_struct {
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_lock.lock (in write mode)
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_lock.lock (in read or write mode)
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -242,7 +242,7 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock *vm_lock;
+	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -408,17 +408,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
+	init_rwsem(&vma->vm_lock.lock);
 	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -439,6 +432,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
+	vma_lock_init(vma);
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -449,10 +443,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		free(vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -465,10 +455,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	if (!vma_lock_alloc(new)) {
-		free(new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 
 	return new;
@@ -638,14 +625,8 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	free(vma->vm_lock);
-}
-
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	vma_lock_free(vma);
 	free(vma);
 }
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-20  0:08 ` [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2024-11-20  0:08 ` [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-20  0:08 ` Suren Baghdasaryan
  2024-11-21  0:13   ` Shakeel Butt
  2024-11-22 16:46   ` Lorenzo Stoakes
  2024-11-20  0:08 ` [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Current implementation does not set detached flag when a VMA is first
allocated. This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree. Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added
into a tree.
Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to
indicate detached vma and vma_mark_attached() will need an additional
mm parameter.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 27 ++++++++++++++++++++-------
 kernel/fork.c                    |  4 ++++
 mm/memory.c                      |  2 +-
 mm/vma.c                         |  6 +++---
 mm/vma.h                         |  2 ++
 tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
 6 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 737c003b0a1e..dd1b6190df28 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -808,12 +808,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -844,8 +853,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
-				     bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
@@ -878,7 +887,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
@@ -1073,6 +1085,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 7823797e31d2..f0cec673583c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
+#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f..d0197a0c0996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index 8a454a7bbc80..73104d434567 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -295,7 +295,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 
 	if (vp->remove) {
 again:
-		vma_mark_detached(vp->remove, true);
+		vma_mark_detached(vp->remove);
 		if (vp->file) {
 			uprobe_munmap(vp->remove, vp->remove->vm_start,
 				      vp->remove->vm_end);
@@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
 
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
+		vma_mark_attached(vma);
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
+		vma_mark_detached(next);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index 388d34748674..2e680f357ace 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -162,6 +162,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
@@ -385,6 +386,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+	vma_mark_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 11c2c38ca4e8..2fed366d20ef 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -414,13 +414,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -431,7 +435,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -457,6 +462,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
 
 	return new;
 }
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2024-11-20  0:08 ` [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-11-20  0:08 ` Suren Baghdasaryan
  2024-11-20  4:36   ` Matthew Wilcox
  2024-11-20 10:16   ` Vlastimil Babka
  2024-11-20  0:08 ` [PATCH v4 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  2024-11-20 22:10 ` [PATCH v4 0/5] move per-vma lock into vm_area_struct Shakeel Butt
  5 siblings, 2 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected inside
lock_vma_under_rcu().
lock_vma_under_rcu() enters RCU read section, finds the vma at the
given address, locks the vma and checks if it got detached or remapped
to cover a different address range. These last checks are there
to ensure that the vma was not modified after we found it but before
locking it.
vma reuse introduces several new possibilities:
1. vma can be reused after it was found but before it is locked;
2. vma can be reused and reinitialized (including changing its vm_mm)
while being locked in vma_start_read();
3. vma can be reused and reinitialized after it was found but before
it is locked, then attached at a new address or to a new mm while being
read-locked;
For case #1 current checks will help detecting cases when:
- vma was reused but not yet added into the tree (detached check)
- vma was reused at a different address range (address check);
We are missing the check for vm_mm to ensure the reused vma was not
attached to a different mm. This patch adds the missing check.
For case #2, we pass mm to vma_start_read() to prevent access to
unstable vma->vm_mm.
For case #3, we ensure the order in which vma->detached flag and
vm_start/vm_end/vm_mm are set and checked. vma gets attached after
vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
vma->detached before checking vm_start/vm_end/vm_mm. This is required
because attaching vma happens without vma write-lock, as opposed to
vma detaching, which requires vma write-lock. This patch adds memory
barriers inside is_vma_detached() and vma_mark_attached() needed to
order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
This will facilitate vm_area_struct reuse and will minimize the number
of call_rcu() calls.
Adding a freeptr_t into vm_area_struct (unioned with vm_start/vm_end)
could be used to avoids bloating the structure, however currently
custom free pointers are not supported in combination with a ctor
(see the comment for kmem_cache_args.freeptr_offset).

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 60 +++++++++++++++++++++++++++-----
 include/linux/mm_types.h         | 13 +++----
 kernel/fork.c                    | 53 +++++++++++++++++-----------
 mm/memory.c                      | 15 +++++---
 mm/vma.c                         |  2 +-
 tools/testing/vma/vma_internal.h |  7 ++--
 6 files changed, 103 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd1b6190df28..2a4794b7a513 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
 /* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
+void vm_area_free_unreachable(struct vm_area_struct *vma);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
@@ -690,12 +690,32 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+#define VMA_BEFORE_LOCK		offsetof(struct vm_area_struct, vm_lock)
+#define VMA_LOCK_END(vma)	\
+	(((void *)(vma)) + offsetofend(struct vm_area_struct, vm_lock))
+#define VMA_AFTER_LOCK		\
+	(sizeof(struct vm_area_struct) - offsetofend(struct vm_area_struct, vm_lock))
+
+static inline void vma_clear(struct vm_area_struct *vma)
+{
+	/* Preserve vma->vm_lock */
+	memset(vma, 0, VMA_BEFORE_LOCK);
+	memset(VMA_LOCK_END(vma), 0, VMA_AFTER_LOCK);
+}
+
+static inline void vma_copy(struct vm_area_struct *new, struct vm_area_struct *orig)
+{
+	/* Preserve vma->vm_lock */
+	data_race(memcpy(new, orig, VMA_BEFORE_LOCK));
+	data_race(memcpy(VMA_LOCK_END(new), VMA_LOCK_END(orig), VMA_AFTER_LOCK));
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
  * using mmap_lock. The function should never yield false unlocked result.
  */
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	/*
 	 * Check before locking. A race might cause false locked result.
@@ -704,7 +724,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
 	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
@@ -721,7 +741,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		up_read(&vma->vm_lock.lock);
 		return false;
 	}
@@ -810,7 +830,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	/*
+	 * This pairs with smp_rmb() inside is_vma_detached().
+	 * vma is marked attached after all vma modifications are done and it
+	 * got added into the vma tree. All prior vma modifications should be
+	 * made visible before marking the vma attached.
+	 */
+	smp_wmb();
+	/* This pairs with READ_ONCE() in is_vma_detached(). */
+	WRITE_ONCE(vma->detached, false);
 }
 
 static inline void vma_mark_detached(struct vm_area_struct *vma)
@@ -822,7 +850,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
 
 static inline bool is_vma_detached(struct vm_area_struct *vma)
 {
-	return vma->detached;
+	bool detached;
+
+	/* This pairs with WRITE_ONCE() in vma_mark_attached(). */
+	detached = READ_ONCE(vma->detached);
+	/*
+	 * This pairs with smp_wmb() inside vma_mark_attached() to ensure
+	 * vma->detached is read before vma attributes read later inside
+	 * lock_vma_under_rcu().
+	 */
+	smp_rmb();
+
+	return detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -847,7 +886,11 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_lock_init(struct vm_area_struct *vma) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline void vma_clear(struct vm_area_struct *vma)
+		{ memset(vma, 0, sizeof(*vma)); }
+static inline void vma_copy(struct vm_area_struct *new, struct vm_area_struct *orig)
+		{ data_race(memcpy(new, orig, sizeof(*new))); }
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
@@ -883,7 +926,7 @@ extern const struct vm_operations_struct vma_dummy_vm_ops;
 
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
-	memset(vma, 0, sizeof(*vma));
+	vma_clear(vma);
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
@@ -892,7 +935,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->detached = true;
 #endif
 	vma_numab_state_init(vma);
-	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5c4bfdcfac72..8f6b0c935c2b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -667,15 +667,10 @@ struct vma_numab_state {
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
-	union {
-		struct {
-			/* VMA covers [vm_start; vm_end) addresses within mm */
-			unsigned long vm_start;
-			unsigned long vm_end;
-		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+	struct {
+		/* VMA covers [vm_start; vm_end) addresses within mm */
+		unsigned long vm_start;
+		unsigned long vm_end;
 	};
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index f0cec673583c..76c68b041f8a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,6 +436,11 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
+static void vm_area_ctor(void *data)
+{
+	vma_lock_init(data);
+}
+
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -462,8 +467,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * orig->shared.rb may be modified concurrently, but the clone
 	 * will be reinitialized.
 	 */
-	data_race(memcpy(new, orig, sizeof(*new)));
-	vma_lock_init(new);
+	vma_copy(new, orig);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 #ifdef CONFIG_PER_VMA_LOCK
 	/* vma is not locked, can't use vma_mark_detached() */
@@ -475,32 +479,37 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
-void __vm_area_free(struct vm_area_struct *vma)
+static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
 {
+#ifdef CONFIG_PER_VMA_LOCK
+	/*
+	 * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
+	 * vma->detached to be set before vma is returned into the cache.
+	 * This way reused object won't be used by readers until it's
+	 * initialized and reattached.
+	 * If vma is unreachable, there can be no other users and we
+	 * can set vma->detached directly with no risk of a race.
+	 * If vma is reachable, then it should have been already detached
+	 * under vma write-lock or it was never attached.
+	 */
+	if (unreachable)
+		vma->detached = true;
+	else
+		VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
+#endif
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
+void vm_area_free(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
-	__vm_area_free(vma);
+	__vm_area_free(vma, false);
 }
-#endif
 
-void vm_area_free(struct vm_area_struct *vma)
+void vm_area_free_unreachable(struct vm_area_struct *vma)
 {
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
-	__vm_area_free(vma);
-#endif
+	__vm_area_free(vma, true);
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
@@ -3135,9 +3144,11 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
-			SLAB_ACCOUNT);
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), 0,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
+			SLAB_ACCOUNT, vm_area_ctor);
+
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/mm/memory.c b/mm/memory.c
index d0197a0c0996..b5fbc71b46bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6275,10 +6275,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	if (!vma_start_read(vma))
+	if (!vma_start_read(mm, vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
+	/*
+	 * Check if the VMA got isolated after we found it.
+	 * Note: vma we found could have been recycled and is being reattached.
+	 * It's possible to attach a vma while it is read-locked, however a
+	 * read-locked vma can't be detached (detaching requires write-locking).
+	 * Therefore if this check passes, we have an attached and stable vma.
+	 */
 	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
@@ -6292,8 +6298,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * fields are accessible for RCU readers.
 	 */
 
-	/* Check since vm_start/vm_end might change before we lock the VMA */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != mm ||
+		     address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
 
 	rcu_read_unlock();
diff --git a/mm/vma.c b/mm/vma.c
index 73104d434567..050b83df3df2 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -382,7 +382,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
 	if (unreachable)
-		__vm_area_free(vma);
+		vm_area_free_unreachable(vma);
 	else
 		vm_area_free(vma);
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2fed366d20ef..fd668d6cafc0 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -632,14 +632,15 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void __vm_area_free(struct vm_area_struct *vma)
+static inline void vm_area_free(struct vm_area_struct *vma)
 {
 	free(vma);
 }
 
-static inline void vm_area_free(struct vm_area_struct *vma)
+static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
 {
-	__vm_area_free(vma);
+	vma->detached = true;
+	vm_area_free(vma);
 }
 
 static inline void lru_add_drain(void)
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 5/5] docs/mm: document latest changes to vm_lock
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2024-11-20  0:08 ` [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-11-20  0:08 ` Suren Baghdasaryan
  2024-11-20 22:10 ` [PATCH v4 0/5] move per-vma lock into vm_area_struct Shakeel Butt
  5 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  0:08 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Change the documentation to reflect that vm_lock is integrated into vma.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 Documentation/mm/process_addrs.rst | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 1bf7ad010fc0..a18450b6496d 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -686,7 +686,11 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
+succeed in acquiring VMA read lock.
+
+VMA read locks hold the read lock on the :c:member:`!vma.vm_lock` semaphore for
 their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
 via :c:func:`!vma_end_read`.
 
@@ -750,7 +754,7 @@ keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
 Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
+:c:member:`!vma.vm_lock` read/write semaphore and hold it, while checking that
 the sequence count of the VMA does not match that of the mm.
 
 If it does, the read lock fails. If it does not, we hold the lock, excluding
@@ -760,7 +764,7 @@ Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
+On the write side, we acquire a write lock on the :c:member:`!vma.vm_lock`
 read/write semaphore, before setting the VMA's sequence number under this lock,
 also simultaneously holding the mmap write lock.
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20  0:08 ` [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-11-20  4:36   ` Matthew Wilcox
  2024-11-20  6:37     ` Suren Baghdasaryan
  2024-11-20 10:16   ` Vlastimil Babka
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2024-11-20  4:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:25PM -0800, Suren Baghdasaryan wrote:
> +static inline void vma_clear(struct vm_area_struct *vma)
> +{
> +	/* Preserve vma->vm_lock */
> +	memset(vma, 0, VMA_BEFORE_LOCK);
> +	memset(VMA_LOCK_END(vma), 0, VMA_AFTER_LOCK);
> +}

This isn't how you're supposed to handle constructors.  You've fixed
the immediate problem rather than writing the code in the intended style.

> +static void vm_area_ctor(void *data)
> +{
> +	vma_lock_init(data);
> +}

After the ctor has run, the object should be in the same state as
it is after it's freed.  If you want to memset the entire thing
then you can do it in the ctor.  But there should be no need to
do it in vma_init().

And there's lots of things you can move from vma_init() to the ctor.
For example, at free time, anon_vma_chain should be an empty list.
So if you init it in the ctor, you can avoid doing it in vma_init().
I'd suggest that vma_numab_state_free() should be the place which
sets vma->numab_state to NULL and we can delete vma_numab_state_init()
entirely.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20  4:36   ` Matthew Wilcox
@ 2024-11-20  6:37     ` Suren Baghdasaryan
  2024-11-22 22:43       ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20  6:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 8:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Nov 19, 2024 at 04:08:25PM -0800, Suren Baghdasaryan wrote:
> > +static inline void vma_clear(struct vm_area_struct *vma)
> > +{
> > +     /* Preserve vma->vm_lock */
> > +     memset(vma, 0, VMA_BEFORE_LOCK);
> > +     memset(VMA_LOCK_END(vma), 0, VMA_AFTER_LOCK);
> > +}
>
> This isn't how you're supposed to handle constructors.  You've fixed
> the immediate problem rather than writing the code in the intended style.

Yeah, I don't like this myself but the only alternative I can think of
is to set the struct members individually.

>
> > +static void vm_area_ctor(void *data)
> > +{
> > +     vma_lock_init(data);
> > +}
>
> After the ctor has run, the object should be in the same state as
> it is after it's freed.  If you want to memset the entire thing
> then you can do it in the ctor.  But there should be no need to
> do it in vma_init().

IIUC, your suggestion is to memset() the vma and initialize vm_lock
inside the ctor. Then when it's time to free the vma, we reset all
members except vm_lock before freeing the vma. As you mention later,
members like anon_vma_chain, which are already clear, also won't need
to be reset at this point. Am I understanding your proposal correctly?

BTW, if so, then vma_copy() will have to also copy vma members individually.

>
> And there's lots of things you can move from vma_init() to the ctor.
> For example, at free time, anon_vma_chain should be an empty list.
> So if you init it in the ctor, you can avoid doing it in vma_init().

True.

> I'd suggest that vma_numab_state_free() should be the place which
> sets vma->numab_state to NULL and we can delete vma_numab_state_init()
> entirely.

Sounds good to me.

Please confirm if I correctly got your idea and I'll update this patch.
Thanks for the feedback!

>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20  0:08 ` [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2024-11-20  4:36   ` Matthew Wilcox
@ 2024-11-20 10:16   ` Vlastimil Babka
  2024-11-20 15:54     ` Suren Baghdasaryan
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2024-11-20 10:16 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On 11/20/24 01:08, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while being
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm.

So we may now be looking at different mm's mm_lock_seq.sequence and return a
false unlocked result, right? I guess the mm validation in
lock_vma_under_rcu() handles that, but maybe the comment of vma_start_read()
needs updating.

> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
> Adding a freeptr_t into vm_area_struct (unioned with vm_start/vm_end)
> could be used to avoids bloating the structure, however currently
> custom free pointers are not supported in combination with a ctor
> (see the comment for kmem_cache_args.freeptr_offset).

I think there's nothing fundamental preventing to support that, there was
just no user of it. We can do it later.

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,6 +436,11 @@ static struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> +static void vm_area_ctor(void *data)
> +{
> +	vma_lock_init(data);
> +}
> +
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> @@ -462,8 +467,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	 * orig->shared.rb may be modified concurrently, but the clone
>  	 * will be reinitialized.
>  	 */
> -	data_race(memcpy(new, orig, sizeof(*new)));
> -	vma_lock_init(new);
> +	vma_copy(new, orig);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  #ifdef CONFIG_PER_VMA_LOCK
>  	/* vma is not locked, can't use vma_mark_detached() */

Here we mark it detached but we might have already copied it as attached and
confused a reader?

I think this will be covered by what you said in reply to willy:
"vma_copy() will have to also copy vma members individually."

> @@ -475,32 +479,37 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	return new;
>  }
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20 10:16   ` Vlastimil Babka
@ 2024-11-20 15:54     ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20 15:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 2:16 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/20/24 01:08, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while being
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm.
>
> So we may now be looking at different mm's mm_lock_seq.sequence and return a
> false unlocked result, right? I guess the mm validation in
> lock_vma_under_rcu() handles that, but maybe the comment of vma_start_read()
> needs updating.

Correct. I'll add a comment about this.

>
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
> > Adding a freeptr_t into vm_area_struct (unioned with vm_start/vm_end)
> > could be used to avoids bloating the structure, however currently
> > custom free pointers are not supported in combination with a ctor
> > (see the comment for kmem_cache_args.freeptr_offset).
>
> I think there's nothing fundamental preventing to support that, there was
> just no user of it. We can do it later.

Oh, ok. I can add it back so that we have one user and then when the
mechanism is implemented it can be used for testing. Adding freeptr_t
has no negative effects and will reduce later churn.

>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,6 +436,11 @@ static struct kmem_cache *vm_area_cachep;
> >  /* SLAB cache for mm_struct structures (tsk->mm) */
> >  static struct kmem_cache *mm_cachep;
> >
> > +static void vm_area_ctor(void *data)
> > +{
> > +     vma_lock_init(data);
> > +}
> > +
> >  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >  {
> >       struct vm_area_struct *vma;
> > @@ -462,8 +467,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >        * orig->shared.rb may be modified concurrently, but the clone
> >        * will be reinitialized.
> >        */
> > -     data_race(memcpy(new, orig, sizeof(*new)));
> > -     vma_lock_init(new);
> > +     vma_copy(new, orig);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       /* vma is not locked, can't use vma_mark_detached() */
>
> Here we mark it detached but we might have already copied it as attached and
> confused a reader?

Very true. Thanks for catching this one!

>
> I think this will be covered by what you said in reply to willy:
> "vma_copy() will have to also copy vma members individually."

Yes, I think so. vma_copy() will need to copy most but not all
members. vma->detached will be among those not copied.
Thanks!

>
> > @@ -475,32 +479,37 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >       return new;
> >  }
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/5] move per-vma lock into vm_area_struct
  2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2024-11-20  0:08 ` [PATCH v4 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2024-11-20 22:10 ` Shakeel Butt
  2024-11-20 23:52   ` Suren Baghdasaryan
  2024-11-21  2:00   ` Matthew Wilcox
  5 siblings, 2 replies; 27+ messages in thread
From: Shakeel Butt @ 2024-11-20 22:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:21PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

If 'struct vm_area_struct' is prone to performance issues due to
cacheline misalignments then we should do something about the
__randomize_layout tag for it. I imagine we can identify the fields
which might be performance critical to be on same cacheline or different
cacheline due to false sharing then we can divide the fields into
different cacheline groups and fields can be __randomize_layout within
the group. WDYT?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-20  0:08 ` [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-11-20 22:11   ` Shakeel Butt
  0 siblings, 0 replies; 27+ messages in thread
From: Shakeel Butt @ 2024-11-20 22:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:22PM -0800, Suren Baghdasaryan wrote:
> Introduce helper functions which can be used to read-lock a VMA when
> holding mmap_lock for read. Replace direct accesses to vma->vm_lock
> with these new helpers.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-20  0:08 ` [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-20 23:32   ` Shakeel Butt
  2024-11-20 23:44     ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2024-11-20 23:32 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:23PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> Move vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned as well.
> With kernel compiled using defconfig, this causes VMA memory consumption
> to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
> 
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
> 
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs. Note that the size of this structure is
> dependent on the kernel configuration and typically the original size is
> higher than 160 bytes. Therefore these calculations are close to the
> worst case scenario. A more realistic vm_area_struct usage before this
> change is:
> 
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    176   46    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> which is 3.9MB per 100000 VMAs.
> This memory consumption growth can be addressed later by optimizing the
> vm_lock.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

One question below.

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -716,8 +716,6 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	/* Unstable RCU readers are allowed to read this. */
> -	struct vma_lock *vm_lock;
>  #endif
>  
>  	/*
> @@ -770,6 +768,10 @@ struct vm_area_struct {
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* Unstable RCU readers are allowed to read this. */
> +	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +#endif
>  } __randomize_layout;

Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
want 'struct vma_lock vm_lock' to be on a separate cacheline as well?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-20 23:32   ` Shakeel Butt
@ 2024-11-20 23:44     ` Suren Baghdasaryan
  2024-11-21  0:04       ` Shakeel Butt
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20 23:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 3:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Nov 19, 2024 at 04:08:23PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > Move vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned as well.
> > With kernel compiled using defconfig, this causes VMA memory consumption
> > to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs. Note that the size of this structure is
> > dependent on the kernel configuration and typically the original size is
> > higher than 160 bytes. Therefore these calculations are close to the
> > worst case scenario. A more realistic vm_area_struct usage before this
> > change is:
> >
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    176   46    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> > which is 3.9MB per 100000 VMAs.
> > This memory consumption growth can be addressed later by optimizing the
> > vm_lock.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

Thanks!

>
>
> One question below.
>
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -716,8 +716,6 @@ struct vm_area_struct {
> >        * slowpath.
> >        */
> >       unsigned int vm_lock_seq;
> > -     /* Unstable RCU readers are allowed to read this. */
> > -     struct vma_lock *vm_lock;
> >  #endif
> >
> >       /*
> > @@ -770,6 +768,10 @@ struct vm_area_struct {
> >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* Unstable RCU readers are allowed to read this. */
> > +     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +#endif
> >  } __randomize_layout;
>
> Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
> want 'struct vma_lock vm_lock' to be on a separate cacheline as well?

We want both to minimize cacheline sharing.

>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/5] move per-vma lock into vm_area_struct
  2024-11-20 22:10 ` [PATCH v4 0/5] move per-vma lock into vm_area_struct Shakeel Butt
@ 2024-11-20 23:52   ` Suren Baghdasaryan
  2024-11-21  2:00   ` Matthew Wilcox
  1 sibling, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20 23:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 2:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Nov 19, 2024 at 04:08:21PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> If 'struct vm_area_struct' is prone to performance issues due to
> cacheline misalignments then we should do something about the
> __randomize_layout tag for it. I imagine we can identify the fields
> which might be performance critical to be on same cacheline or different
> cacheline due to false sharing then we can divide the fields into
> different cacheline groups and fields can be __randomize_layout within
> the group. WDYT?

I think that's a good idea since shuffling these fields around does
affect performance. I can look into it once these changes get
finalized and the layout gets more stable.

>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-20 23:44     ` Suren Baghdasaryan
@ 2024-11-21  0:04       ` Shakeel Butt
  2024-11-21  0:33         ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2024-11-21  0:04 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 03:44:29PM -0800, Suren Baghdasaryan wrote:
> On Wed, Nov 20, 2024 at 3:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Nov 19, 2024 at 04:08:23PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > Move vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned as well.
> > > With kernel compiled using defconfig, this causes VMA memory consumption
> > > to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
> > >
> > >     slabinfo before:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vma_lock         ...     40  102    1 : ...
> > >      vm_area_struct   ...    160   51    2 : ...
> > >
> > >     slabinfo after moving vm_lock:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    256   32    2 : ...
> > >
> > > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > > which is 5.5MB per 100000 VMAs. Note that the size of this structure is
> > > dependent on the kernel configuration and typically the original size is
> > > higher than 160 bytes. Therefore these calculations are close to the
> > > worst case scenario. A more realistic vm_area_struct usage before this
> > > change is:
> > >
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vma_lock         ...     40  102    1 : ...
> > >      vm_area_struct   ...    176   46    2 : ...
> > >
> > > Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> > > which is 3.9MB per 100000 VMAs.
> > > This memory consumption growth can be addressed later by optimizing the
> > > vm_lock.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
> Thanks!
> 
> >
> >
> > One question below.
> >
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -716,8 +716,6 @@ struct vm_area_struct {
> > >        * slowpath.
> > >        */
> > >       unsigned int vm_lock_seq;
> > > -     /* Unstable RCU readers are allowed to read this. */
> > > -     struct vma_lock *vm_lock;
> > >  #endif
> > >
> > >       /*
> > > @@ -770,6 +768,10 @@ struct vm_area_struct {
> > >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> > >  #endif
> > >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +     /* Unstable RCU readers are allowed to read this. */
> > > +     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > > +#endif
> > >  } __randomize_layout;
> >
> > Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
> > want 'struct vma_lock vm_lock' to be on a separate cacheline as well?
> 
> We want both to minimize cacheline sharing.
> 

For later, you will need to add a pad after vm_lock as well, so any
future addition will not share the cacheline with vm_lock. I would do
something like below. This is a nit and can be done later.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7654c766cbe2..5cc4fff163a0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -751,10 +751,12 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
+	CACHELINE_PADDING(__pad1__);
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	struct vma_lock vm_lock;
+	CACHELINE_PADDING(__pad2__);
 #endif
-} __randomize_layout;
+} __randomize_layout ____cacheline_aligned_in_smp;
 
 #ifdef CONFIG_NUMA
 #define vma_policy(vma) ((vma)->vm_policy)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-20  0:08 ` [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-11-21  0:13   ` Shakeel Butt
  2024-11-22 16:46   ` Lorenzo Stoakes
  1 sibling, 0 replies; 27+ messages in thread
From: Shakeel Butt @ 2024-11-21  0:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:24PM -0800, Suren Baghdasaryan wrote:
> Current implementation does not set detached flag when a VMA is first
> allocated. This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree. Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added
> into a tree.
> Introduce vma_mark_attached() to make the API more readable and to
> simplify possible future cleanup when vma->vm_mm might be used to
> indicate detached vma and vma_mark_attached() will need an additional
> mm parameter.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-21  0:04       ` Shakeel Butt
@ 2024-11-21  0:33         ` Suren Baghdasaryan
  2024-11-21  7:01           ` Shakeel Butt
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-21  0:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 4:05 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Nov 20, 2024 at 03:44:29PM -0800, Suren Baghdasaryan wrote:
> > On Wed, Nov 20, 2024 at 3:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Nov 19, 2024 at 04:08:23PM -0800, Suren Baghdasaryan wrote:
> > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > vm_area_struct in [1] because of the performance regression caused by
> > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > even there it can be mitigated by disabling adjacent cacheline
> > > > prefetching, see [3].
> > > > Splitting single logical structure into multiple ones leads to more
> > > > complicated management, extra pointer dereferences and overall less
> > > > maintainable code. When that split-away part is a lock, it complicates
> > > > things even further. With no performance benefits, there are no reasons
> > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > Move vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > boundary and changing the cache to be cacheline-aligned as well.
> > > > With kernel compiled using defconfig, this causes VMA memory consumption
> > > > to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
> > > >
> > > >     slabinfo before:
> > > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > > >      vma_lock         ...     40  102    1 : ...
> > > >      vm_area_struct   ...    160   51    2 : ...
> > > >
> > > >     slabinfo after moving vm_lock:
> > > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > > >      vm_area_struct   ...    256   32    2 : ...
> > > >
> > > > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > > > which is 5.5MB per 100000 VMAs. Note that the size of this structure is
> > > > dependent on the kernel configuration and typically the original size is
> > > > higher than 160 bytes. Therefore these calculations are close to the
> > > > worst case scenario. A more realistic vm_area_struct usage before this
> > > > change is:
> > > >
> > > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > > >      vma_lock         ...     40  102    1 : ...
> > > >      vm_area_struct   ...    176   46    2 : ...
> > > >
> > > > Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> > > > which is 3.9MB per 100000 VMAs.
> > > > This memory consumption growth can be addressed later by optimizing the
> > > > vm_lock.
> > > >
> > > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >
> > Thanks!
> >
> > >
> > >
> > > One question below.
> > >
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -716,8 +716,6 @@ struct vm_area_struct {
> > > >        * slowpath.
> > > >        */
> > > >       unsigned int vm_lock_seq;
> > > > -     /* Unstable RCU readers are allowed to read this. */
> > > > -     struct vma_lock *vm_lock;
> > > >  #endif
> > > >
> > > >       /*
> > > > @@ -770,6 +768,10 @@ struct vm_area_struct {
> > > >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> > > >  #endif
> > > >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +     /* Unstable RCU readers are allowed to read this. */
> > > > +     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > > > +#endif
> > > >  } __randomize_layout;
> > >
> > > Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
> > > want 'struct vma_lock vm_lock' to be on a separate cacheline as well?
> >
> > We want both to minimize cacheline sharing.
> >
>
> For later, you will need to add a pad after vm_lock as well, so any
> future addition will not share the cacheline with vm_lock. I would do
> something like below. This is a nit and can be done later.
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7654c766cbe2..5cc4fff163a0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -751,10 +751,12 @@ struct vm_area_struct {
>  #endif
>         struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  #ifdef CONFIG_PER_VMA_LOCK
> +       CACHELINE_PADDING(__pad1__);
>         /* Unstable RCU readers are allowed to read this. */
> -       struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +       struct vma_lock vm_lock;
> +       CACHELINE_PADDING(__pad2__);
>  #endif
> -} __randomize_layout;
> +} __randomize_layout ____cacheline_aligned_in_smp;

I thought SLAB_HWCACHE_ALIGN for vm_area_cachep added in this patch
would have the same result, no?

>
>  #ifdef CONFIG_NUMA
>  #define vma_policy(vma) ((vma)->vm_policy)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/5] move per-vma lock into vm_area_struct
  2024-11-20 22:10 ` [PATCH v4 0/5] move per-vma lock into vm_area_struct Shakeel Butt
  2024-11-20 23:52   ` Suren Baghdasaryan
@ 2024-11-21  2:00   ` Matthew Wilcox
  2024-11-22 11:56     ` Lorenzo Stoakes
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2024-11-21  2:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Suren Baghdasaryan, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 02:10:44PM -0800, Shakeel Butt wrote:
> If 'struct vm_area_struct' is prone to performance issues due to
> cacheline misalignments then we should do something about the
> __randomize_layout tag for it. I imagine we can identify the fields
> which might be performance critical to be on same cacheline or different
> cacheline due to false sharing then we can divide the fields into
> different cacheline groups and fields can be __randomize_layout within
> the group. WDYT?

Pretty sure the people who think security is more important than
performance are the only ones who randomize structs.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-21  0:33         ` Suren Baghdasaryan
@ 2024-11-21  7:01           ` Shakeel Butt
  2024-11-21 17:05             ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2024-11-21  7:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 04:33:37PM -0800, Suren Baghdasaryan wrote:
[...]
> > > >
> > > > Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
> > > > want 'struct vma_lock vm_lock' to be on a separate cacheline as well?
> > >
> > > We want both to minimize cacheline sharing.
> > >
> >
> > For later, you will need to add a pad after vm_lock as well, so any
> > future addition will not share the cacheline with vm_lock. I would do
> > something like below. This is a nit and can be done later.
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7654c766cbe2..5cc4fff163a0 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -751,10 +751,12 @@ struct vm_area_struct {
> >  #endif
> >         struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +       CACHELINE_PADDING(__pad1__);
> >         /* Unstable RCU readers are allowed to read this. */
> > -       struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +       struct vma_lock vm_lock;
> > +       CACHELINE_PADDING(__pad2__);
> >  #endif
> > -} __randomize_layout;
> > +} __randomize_layout ____cacheline_aligned_in_smp;
> 
> I thought SLAB_HWCACHE_ALIGN for vm_area_cachep added in this patch
> would have the same result, no?
> 

SLAB_HWCACHE_ALIGN is more about the slub allocator allocating cache
aligned memory. It does not say anything about the internals of the
struct for which the kmem_cache is being created. The
____cacheline_aligned_in_smp tag in your patch made sure that the field
vm_lock will be put in a new cacheline and there can be a hole between
vm_lock and the previous field if the previous field is not ending at
the cacheline boundary. Please note that if you add a new field after
vm_lock (without cacheline alignment tag), it will be on the same
cacheline as vm_lock. So, your code is achieving the vm_lock on its own
cacheline goal but vm_lock being the only field on that cacheline is not
being achieved.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-21  7:01           ` Shakeel Butt
@ 2024-11-21 17:05             ` Suren Baghdasaryan
  2024-11-21 18:25               ` Shakeel Butt
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-21 17:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 20, 2024 at 11:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Nov 20, 2024 at 04:33:37PM -0800, Suren Baghdasaryan wrote:
> [...]
> > > > >
> > > > > Do we just want 'struct vm_area_struct' to be cacheline aligned or do we
> > > > > want 'struct vma_lock vm_lock' to be on a separate cacheline as well?
> > > >
> > > > We want both to minimize cacheline sharing.
> > > >
> > >
> > > For later, you will need to add a pad after vm_lock as well, so any
> > > future addition will not share the cacheline with vm_lock. I would do
> > > something like below. This is a nit and can be done later.
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 7654c766cbe2..5cc4fff163a0 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -751,10 +751,12 @@ struct vm_area_struct {
> > >  #endif
> > >         struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > +       CACHELINE_PADDING(__pad1__);
> > >         /* Unstable RCU readers are allowed to read this. */
> > > -       struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > > +       struct vma_lock vm_lock;
> > > +       CACHELINE_PADDING(__pad2__);
> > >  #endif
> > > -} __randomize_layout;
> > > +} __randomize_layout ____cacheline_aligned_in_smp;
> >
> > I thought SLAB_HWCACHE_ALIGN for vm_area_cachep added in this patch
> > would have the same result, no?
> >
>
> SLAB_HWCACHE_ALIGN is more about the slub allocator allocating cache
> aligned memory. It does not say anything about the internals of the
> struct for which the kmem_cache is being created. The
> ____cacheline_aligned_in_smp tag in your patch made sure that the field
> vm_lock will be put in a new cacheline and there can be a hole between
> vm_lock and the previous field if the previous field is not ending at
> the cacheline boundary. Please note that if you add a new field after
> vm_lock (without cacheline alignment tag), it will be on the same
> cacheline as vm_lock. So, your code is achieving the vm_lock on its own
> cacheline goal but vm_lock being the only field on that cacheline is not
> being achieved.

Sorry, I should have been more clear. It's ok if some fields which are
rarely accessed in the pagefault path are placed in the same cacheling
with vm_lock. In fact I've done that to pack them better in the
previous version of this patchset here:
https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
(removed for now based on the feedback). So, vm_lock being the only
field on the cacheline is not my goal. After this patchset I'm
planning to try packing the members better and save some memory.

>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-21 17:05             ` Suren Baghdasaryan
@ 2024-11-21 18:25               ` Shakeel Butt
  0 siblings, 0 replies; 27+ messages in thread
From: Shakeel Butt @ 2024-11-21 18:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Thu, Nov 21, 2024 at 09:05:21AM -0800, Suren Baghdasaryan wrote:
[...]
> 
> Sorry, I should have been more clear. It's ok if some fields which are
> rarely accessed in the pagefault path are placed in the same cacheling
> with vm_lock. In fact I've done that to pack them better in the
> previous version of this patchset here:
> https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> (removed for now based on the feedback). So, vm_lock being the only
> field on the cacheline is not my goal. After this patchset I'm
> planning to try packing the members better and save some memory.
> 

Nah, my bad, somehow I thought you want vm_lock to be on a cacheline
alone.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/5] move per-vma lock into vm_area_struct
  2024-11-21  2:00   ` Matthew Wilcox
@ 2024-11-22 11:56     ` Lorenzo Stoakes
  2024-11-22 15:06       ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22 11:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Suren Baghdasaryan, akpm, liam.howlett, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Thu, Nov 21, 2024 at 02:00:46AM +0000, Matthew Wilcox wrote:
> On Wed, Nov 20, 2024 at 02:10:44PM -0800, Shakeel Butt wrote:
> > If 'struct vm_area_struct' is prone to performance issues due to
> > cacheline misalignments then we should do something about the
> > __randomize_layout tag for it. I imagine we can identify the fields
> > which might be performance critical to be on same cacheline or different
> > cacheline due to false sharing then we can divide the fields into
> > different cacheline groups and fields can be __randomize_layout within
> > the group. WDYT?
>
> Pretty sure the people who think security is more important than
> performance are the only ones who randomize structs.

I agree that I don't think we need concern ourselves with users of this
setting for precisely this reason.

I wouldn't want supporting this to cause difficulty for users who do not
enable this when those who do aren't really concerned about the perf issues
as Matthew says.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/5] move per-vma lock into vm_area_struct
  2024-11-22 11:56     ` Lorenzo Stoakes
@ 2024-11-22 15:06       ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-22 15:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, Shakeel Butt, akpm, liam.howlett, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Fri, Nov 22, 2024 at 3:57 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Nov 21, 2024 at 02:00:46AM +0000, Matthew Wilcox wrote:
> > On Wed, Nov 20, 2024 at 02:10:44PM -0800, Shakeel Butt wrote:
> > > If 'struct vm_area_struct' is prone to performance issues due to
> > > cacheline misalignments then we should do something about the
> > > __randomize_layout tag for it. I imagine we can identify the fields
> > > which might be performance critical to be on same cacheline or different
> > > cacheline due to false sharing then we can divide the fields into
> > > different cacheline groups and fields can be __randomize_layout within
> > > the group. WDYT?
> >
> > Pretty sure the people who think security is more important than
> > performance are the only ones who randomize structs.
>
> I agree that I don't think we need concern ourselves with users of this
> setting for precisely this reason.
>
> I wouldn't want supporting this to cause difficulty for users who do not
> enable this when those who do aren't really concerned about the perf issues
> as Matthew says.

Ack. Will keep it as is. Thanks!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-20  0:08 ` [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
  2024-11-21  0:13   ` Shakeel Butt
@ 2024-11-22 16:46   ` Lorenzo Stoakes
  2024-11-22 17:47     ` Suren Baghdasaryan
  1 sibling, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22 16:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 04:08:24PM -0800, Suren Baghdasaryan wrote:
> Current implementation does not set detached flag when a VMA is first
> allocated. This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree. Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added
> into a tree.
> Introduce vma_mark_attached() to make the API more readable and to
> simplify possible future cleanup when vma->vm_mm might be used to
> indicate detached vma and vma_mark_attached() will need an additional
> mm parameter.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

I tested this (whole series) locally and on real hardware and did a kernel
compile on real hardware just to be sure :)) and all looks good.

The code looks sensible, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h               | 27 ++++++++++++++++++++-------
>  kernel/fork.c                    |  4 ++++
>  mm/memory.c                      |  2 +-
>  mm/vma.c                         |  6 +++---
>  mm/vma.h                         |  2 ++
>  tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
>  6 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 737c003b0a1e..dd1b6190df28 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -808,12 +808,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>  		vma_assert_write_locked(vma);
>  }
>
> -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
> +{
> +	vma->detached = false;
> +}
> +
> +static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
>  	/* When detaching vma should be write-locked */
> -	if (detached)
> -		vma_assert_write_locked(vma);
> -	vma->detached = detached;
> +	vma_assert_write_locked(vma);
> +	vma->detached = true;
> +}
> +
> +static inline bool is_vma_detached(struct vm_area_struct *vma)
> +{
> +	return vma->detached;
>  }
>
>  static inline void release_fault_lock(struct vm_fault *vmf)
> @@ -844,8 +853,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
>  static inline void vma_start_write(struct vm_area_struct *vma) {}
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  		{ mmap_assert_write_locked(vma->vm_mm); }
> -static inline void vma_mark_detached(struct vm_area_struct *vma,
> -				     bool detached) {}
> +static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> +static inline void vma_mark_detached(struct vm_area_struct *vma) {}
>
>  static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		unsigned long address)
> @@ -878,7 +887,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
> +#endif
>  	vma_numab_state_init(vma);
>  	vma_lock_init(vma);
>  }
> @@ -1073,6 +1085,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
>  	if (unlikely(mas_is_err(&vmi->mas)))
>  		return -ENOMEM;
>
> +	vma_mark_attached(vma);
>  	return 0;
>  }
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7823797e31d2..f0cec673583c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	data_race(memcpy(new, orig, sizeof(*new)));
>  	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	new->detached = true;
> +#endif
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 209885a4134f..d0197a0c0996 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		goto inval;
>
>  	/* Check if the VMA got isolated after we found it */
> -	if (vma->detached) {
> +	if (is_vma_detached(vma)) {
>  		vma_end_read(vma);
>  		count_vm_vma_lock_event(VMA_LOCK_MISS);
>  		/* The area was replaced with another one */
> diff --git a/mm/vma.c b/mm/vma.c
> index 8a454a7bbc80..73104d434567 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -295,7 +295,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>
>  	if (vp->remove) {
>  again:
> -		vma_mark_detached(vp->remove, true);
> +		vma_mark_detached(vp->remove);
>  		if (vp->file) {
>  			uprobe_munmap(vp->remove, vp->remove->vm_start,
>  				      vp->remove->vm_end);
> @@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
>
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		vma_mark_detached(vma, false);
> +		vma_mark_attached(vma);
>
>  	__mt_destroy(mas_detach->tree);
>  }
> @@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		if (error)
>  			goto munmap_gather_failed;
>
> -		vma_mark_detached(next, true);
> +		vma_mark_detached(next);
>  		nrpages = vma_pages(next);
>
>  		vms->nr_pages += nrpages;
> diff --git a/mm/vma.h b/mm/vma.h
> index 388d34748674..2e680f357ace 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -162,6 +162,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
>  	if (unlikely(mas_is_err(&vmi->mas)))
>  		return -ENOMEM;
>
> +	vma_mark_attached(vma);
>  	return 0;
>  }
>
> @@ -385,6 +386,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
>
>  	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
>  	mas_store_prealloc(&vmi->mas, vma);
> +	vma_mark_attached(vma);
>  }
>
>  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 11c2c38ca4e8..2fed366d20ef 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -414,13 +414,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>  	vma->vm_lock_seq = UINT_MAX;
>  }
>
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
> +{
> +	vma->detached = false;
> +}

Yeah I think sensible to just accept that sometimes we are already attached
when we mark attached.

> +
>  static inline void vma_assert_write_locked(struct vm_area_struct *);
> -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> +static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
>  	/* When detaching vma should be write-locked */
> -	if (detached)
> -		vma_assert_write_locked(vma);
> -	vma->detached = detached;
> +	vma_assert_write_locked(vma);
> +	vma->detached = true;
>  }
>
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
> @@ -431,7 +435,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
>  	vma_lock_init(vma);
>  }
>
> @@ -457,6 +462,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	memcpy(new, orig, sizeof(*new));
>  	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	new->detached = true;
>
>  	return new;
>  }
> --
> 2.47.0.338.g60cca15819-goog
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-22 16:46   ` Lorenzo Stoakes
@ 2024-11-22 17:47     ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-22 17:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Fri, Nov 22, 2024 at 8:47 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 19, 2024 at 04:08:24PM -0800, Suren Baghdasaryan wrote:
> > Current implementation does not set detached flag when a VMA is first
> > allocated. This does not represent the real state of the VMA, which is
> > detached until it is added into mm's VMA tree. Fix this by marking new
> > VMAs as detached and resetting detached flag only after VMA is added
> > into a tree.
> > Introduce vma_mark_attached() to make the API more readable and to
> > simplify possible future cleanup when vma->vm_mm might be used to
> > indicate detached vma and vma_mark_attached() will need an additional
> > mm parameter.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> I tested this (whole series) locally and on real hardware and did a kernel
> compile on real hardware just to be sure :)) and all looks good.
>
> The code looks sensible, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks Lorenzo! I'll post a new version today, addressing Matthew's
and other's comments but overall functionality should stay the same.

>
> > ---
> >  include/linux/mm.h               | 27 ++++++++++++++++++++-------
> >  kernel/fork.c                    |  4 ++++
> >  mm/memory.c                      |  2 +-
> >  mm/vma.c                         |  6 +++---
> >  mm/vma.h                         |  2 ++
> >  tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
> >  6 files changed, 42 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 737c003b0a1e..dd1b6190df28 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -808,12 +808,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> >               vma_assert_write_locked(vma);
> >  }
> >
> > -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> > +static inline void vma_mark_attached(struct vm_area_struct *vma)
> > +{
> > +     vma->detached = false;
> > +}
> > +
> > +static inline void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> >       /* When detaching vma should be write-locked */
> > -     if (detached)
> > -             vma_assert_write_locked(vma);
> > -     vma->detached = detached;
> > +     vma_assert_write_locked(vma);
> > +     vma->detached = true;
> > +}
> > +
> > +static inline bool is_vma_detached(struct vm_area_struct *vma)
> > +{
> > +     return vma->detached;
> >  }
> >
> >  static inline void release_fault_lock(struct vm_fault *vmf)
> > @@ -844,8 +853,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >               { mmap_assert_write_locked(vma->vm_mm); }
> > -static inline void vma_mark_detached(struct vm_area_struct *vma,
> > -                                  bool detached) {}
> > +static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> > +static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> >
> >  static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >               unsigned long address)
> > @@ -878,7 +887,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     vma_mark_detached(vma, false);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
> > +#endif
> >       vma_numab_state_init(vma);
> >       vma_lock_init(vma);
> >  }
> > @@ -1073,6 +1085,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
> >       if (unlikely(mas_is_err(&vmi->mas)))
> >               return -ENOMEM;
> >
> > +     vma_mark_attached(vma);
> >       return 0;
> >  }
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 7823797e31d2..f0cec673583c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >       data_race(memcpy(new, orig, sizeof(*new)));
> >       vma_lock_init(new);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     new->detached = true;
> > +#endif
> >       vma_numab_state_init(new);
> >       dup_anon_vma_name(orig, new);
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 209885a4134f..d0197a0c0996 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >               goto inval;
> >
> >       /* Check if the VMA got isolated after we found it */
> > -     if (vma->detached) {
> > +     if (is_vma_detached(vma)) {
> >               vma_end_read(vma);
> >               count_vm_vma_lock_event(VMA_LOCK_MISS);
> >               /* The area was replaced with another one */
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 8a454a7bbc80..73104d434567 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -295,7 +295,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >
> >       if (vp->remove) {
> >  again:
> > -             vma_mark_detached(vp->remove, true);
> > +             vma_mark_detached(vp->remove);
> >               if (vp->file) {
> >                       uprobe_munmap(vp->remove, vp->remove->vm_start,
> >                                     vp->remove->vm_end);
> > @@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
> >
> >       mas_set(mas_detach, 0);
> >       mas_for_each(mas_detach, vma, ULONG_MAX)
> > -             vma_mark_detached(vma, false);
> > +             vma_mark_attached(vma);
> >
> >       __mt_destroy(mas_detach->tree);
> >  }
> > @@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >               if (error)
> >                       goto munmap_gather_failed;
> >
> > -             vma_mark_detached(next, true);
> > +             vma_mark_detached(next);
> >               nrpages = vma_pages(next);
> >
> >               vms->nr_pages += nrpages;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 388d34748674..2e680f357ace 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -162,6 +162,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> >       if (unlikely(mas_is_err(&vmi->mas)))
> >               return -ENOMEM;
> >
> > +     vma_mark_attached(vma);
> >       return 0;
> >  }
> >
> > @@ -385,6 +386,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> >
> >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> >       mas_store_prealloc(&vmi->mas, vma);
> > +     vma_mark_attached(vma);
> >  }
> >
> >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 11c2c38ca4e8..2fed366d20ef 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -414,13 +414,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >       vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> > +static inline void vma_mark_attached(struct vm_area_struct *vma)
> > +{
> > +     vma->detached = false;
> > +}
>
> Yeah I think sensible to just accept that sometimes we are already attached
> when we mark attached.
>
> > +
> >  static inline void vma_assert_write_locked(struct vm_area_struct *);
> > -static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> > +static inline void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> >       /* When detaching vma should be write-locked */
> > -     if (detached)
> > -             vma_assert_write_locked(vma);
> > -     vma->detached = detached;
> > +     vma_assert_write_locked(vma);
> > +     vma->detached = true;
> >  }
> >
> >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> > @@ -431,7 +435,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     vma_mark_detached(vma, false);
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
> >       vma_lock_init(vma);
> >  }
> >
> > @@ -457,6 +462,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >       memcpy(new, orig, sizeof(*new));
> >       vma_lock_init(new);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     new->detached = true;
> >
> >       return new;
> >  }
> > --
> > 2.47.0.338.g60cca15819-goog
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-20  6:37     ` Suren Baghdasaryan
@ 2024-11-22 22:43       ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-22 22:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Nov 19, 2024 at 10:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Nov 19, 2024 at 8:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Nov 19, 2024 at 04:08:25PM -0800, Suren Baghdasaryan wrote:
> > > +static inline void vma_clear(struct vm_area_struct *vma)
> > > +{
> > > +     /* Preserve vma->vm_lock */
> > > +     memset(vma, 0, VMA_BEFORE_LOCK);
> > > +     memset(VMA_LOCK_END(vma), 0, VMA_AFTER_LOCK);
> > > +}
> >
> > This isn't how you're supposed to handle constructors.  You've fixed
> > the immediate problem rather than writing the code in the intended style.
>
> Yeah, I don't like this myself but the only alternative I can think of
> is to set the struct members individually.
>
> >
> > > +static void vm_area_ctor(void *data)
> > > +{
> > > +     vma_lock_init(data);
> > > +}
> >
> > After the ctor has run, the object should be in the same state as
> > it is after it's freed.  If you want to memset the entire thing
> > then you can do it in the ctor.  But there should be no need to
> > do it in vma_init().
>
> IIUC, your suggestion is to memset() the vma and initialize vm_lock
> inside the ctor. Then when it's time to free the vma, we reset all
> members except vm_lock before freeing the vma. As you mention later,
> members like anon_vma_chain, which are already clear, also won't need
> to be reset at this point. Am I understanding your proposal correctly?
>
> BTW, if so, then vma_copy() will have to also copy vma members individually.
>
> >
> > And there's lots of things you can move from vma_init() to the ctor.
> > For example, at free time, anon_vma_chain should be an empty list.
> > So if you init it in the ctor, you can avoid doing it in vma_init().
>
> True.
>
> > I'd suggest that vma_numab_state_free() should be the place which
> > sets vma->numab_state to NULL and we can delete vma_numab_state_init()
> > entirely.
>
> Sounds good to me.

I took a stab at it and the result does not look pretty...
Couple notes:
- vma_init() is used in other places to initialize VMAs allocated on
the stack, so I left it alone for such cases. VMAs like that are not
allocated from vm_area_cachep, can't be reused anyway, therefore we
can override their vm_lock.
- Since vma_init() has to stay, we can't retire vma_numab_state_init()
because it's used in vma_init().
- I think resetting members before freeing might not be such a good
idea because after resetting the object might not be reused at all.

Now, the main point:
I moved initializations of several members into ctor but even with
that the code looks roughly like this:

static void vm_area_ctor(void *data)
{
    struct vm_area_struct *vma = (struct vm_area_struct *)data;

    vma->detached = true;
    INIT_LIST_HEAD(&vma->anon_vma_chain);
    vma_lock_init(vma);
}

struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
    struct vm_area_struct *vma;

    vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
    if (!vma)
        return NULL;

    vma->vm_mm = mm;
    vma->vm_ops = &vma_dummy_vm_ops;
    vma->vm_start = 0;
    vma->vm_end = 0;
    memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
    vm_flags_init(vma, 0);
    vma_numab_state_init(vma);
    memset(&vma->shared, 0, sizeof(vma->shared));
    vma->anon_vma = NULL;
    vma->vm_pgoff = 0;
    vma->vm_file = NULL;
    vma->vm_private_data = NULL;
    memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
#ifdef CONFIG_ANON_VMA_NAME
    vma->anon_name = NULL;
#endif
#ifdef CONFIG_SWAP
    atomic_long_set(&vma->swap_readahead_info, 0);
#endif
#ifndef CONFIG_MMU
    vma->vm_region = NULL;
#endif
#ifdef CONFIG_NUMA
    vma->vm_policy = NULL;
#endif
#ifdef CONFIG_NUMA_BALANCING
    vma->numab_state = NULL;
#endif
    return vma;
}

I can of course add helper functions and get rid of the #ifdef's but still...

Matthew, want to double check if this looks like the solution you were
proposing or am I completely off the target?

>
> Please confirm if I correctly got your idea and I'll update this patch.
> Thanks for the feedback!
>
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-11-22 22:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-20  0:08 [PATCH v4 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-20  0:08 ` [PATCH v4 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-11-20 22:11   ` Shakeel Butt
2024-11-20  0:08 ` [PATCH v4 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-20 23:32   ` Shakeel Butt
2024-11-20 23:44     ` Suren Baghdasaryan
2024-11-21  0:04       ` Shakeel Butt
2024-11-21  0:33         ` Suren Baghdasaryan
2024-11-21  7:01           ` Shakeel Butt
2024-11-21 17:05             ` Suren Baghdasaryan
2024-11-21 18:25               ` Shakeel Butt
2024-11-20  0:08 ` [PATCH v4 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-11-21  0:13   ` Shakeel Butt
2024-11-22 16:46   ` Lorenzo Stoakes
2024-11-22 17:47     ` Suren Baghdasaryan
2024-11-20  0:08 ` [PATCH v4 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-11-20  4:36   ` Matthew Wilcox
2024-11-20  6:37     ` Suren Baghdasaryan
2024-11-22 22:43       ` Suren Baghdasaryan
2024-11-20 10:16   ` Vlastimil Babka
2024-11-20 15:54     ` Suren Baghdasaryan
2024-11-20  0:08 ` [PATCH v4 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-11-20 22:10 ` [PATCH v4 0/5] move per-vma lock into vm_area_struct Shakeel Butt
2024-11-20 23:52   ` Suren Baghdasaryan
2024-11-21  2:00   ` Matthew Wilcox
2024-11-22 11:56     ` Lorenzo Stoakes
2024-11-22 15:06       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox