linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] move per-vma lock into vm_area_struct
@ 2024-11-12 19:46 Suren Baghdasaryan
  2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. This overhead will be addressed in a
separate patchset by replacing rw_semaphore in vma_lock's implementation
with a different type of lock.
Moving vm_lock into vm_area_struct lets us simplify vm_area_free() path,
which in turn allows us to use SLAB_TYPESAFE_BY_RCU for vm_area_struct
cache. This should facilitate vm_area_struct reuse and will minimize the
number of call_rcu() calls.

Suren Baghdasaryan (5):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst | 10 +++--
 include/linux/mm.h                 | 54 +++++++++++++++++-----
 include/linux/mm_types.h           | 16 ++++---
 include/linux/slab.h               |  6 ---
 kernel/fork.c                      | 72 +++++++-----------------------
 mm/memory.c                        |  2 +-
 mm/mmap.c                          |  2 +
 mm/nommu.c                         |  2 +
 mm/userfaultfd.c                   | 14 +++---
 mm/vma.c                           |  3 ++
 tools/testing/vma/vma_internal.h   |  3 +-
 11 files changed, 92 insertions(+), 92 deletions(-)


base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-12 19:46 ` Suren Baghdasaryan
  2024-11-13 14:10   ` Lorenzo Stoakes
  2024-11-12 19:46 ` [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read. Replace direct accesses to vma->vm_lock
with these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 20 ++++++++++++++++++++
 mm/userfaultfd.c   | 14 ++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fecd47239fa9..01ce619f3d17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -722,6 +722,26 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap_read_lock which guarantees that nobody can lock
+ * the vma for write (vma_start_write()) from under us.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap_read_lock which guarantees that nobody can lock
+ * the vma for write (vma_start_write()) from under us.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60a0be33766f..55019c11b5a8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -86,13 +86,11 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma)) {
 		/*
+		 * While holding mmap_lock we can't fail
 		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
+		 * false locked (see comment in vma_start_read()).
 		 */
-		down_read(&vma->vm_lock->lock);
+		vma_start_read_locked(vma);
 	}
 
 	mmap_read_unlock(mm);
@@ -1480,10 +1478,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 		 * See comment in uffd_lock_vma() as to why not using
 		 * vma_start_read() here.
 		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-11-12 19:46 ` Suren Baghdasaryan
  2024-11-13 14:28   ` Lorenzo Stoakes
  2024-11-12 19:46 ` [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 28 +++++++++++++----------
 include/linux/mm_types.h |  6 +++--
 kernel/fork.c            | 49 ++++------------------------------------
 3 files changed, 25 insertions(+), 58 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01ce619f3d17..a5eb0be3e351 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80fef38d9d64..5c4bfdcfac72 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -716,8 +716,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -770,6 +768,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 0061cf2450ef..7823797e31d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2024-11-12 19:46 ` [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-12 19:46 ` Suren Baghdasaryan
  2024-11-13 14:43   ` Lorenzo Stoakes
  2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2024-11-12 19:46 ` [PATCH v2 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  4 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Current implementation does not set detached flag when a VMA is first
allocated. This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree. Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added
into a tree.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 10 +++++++++-
 mm/memory.c                      |  2 +-
 mm/mmap.c                        |  2 ++
 mm/nommu.c                       |  2 ++
 mm/vma.c                         |  3 +++
 tools/testing/vma/vma_internal.h |  3 ++-
 6 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5eb0be3e351..245a85caf4c3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,6 +812,11 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
 	vma->detached = detached;
 }
 
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
+}
+
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
 	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
@@ -874,7 +879,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f..d0197a0c0996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/mmap.c b/mm/mmap.c
index 386429f7db5a..1295c4cedaf4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1570,6 +1570,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
 		goto mas_store_fail;
 
+	vma_mark_detached(vma, false);
 	mm->map_count++;
 	validate_mm(mm);
 	ksm_add_vma(vma);
@@ -1890,6 +1891,7 @@ static struct vm_area_struct *__install_special_mapping(
 	if (ret)
 		goto out;
 
+	vma_mark_detached(vma, false);
 	vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
 
 	perf_event_mmap(vma);
diff --git a/mm/nommu.c b/mm/nommu.c
index 9cb6e99215e2..6afd5c2bd97d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1192,6 +1192,7 @@ unsigned long do_mmap(struct file *file,
 	current->mm->map_count++;
 	/* add the VMA to the tree */
 	vma_iter_store(&vmi, vma);
+	vma_mark_detached(vma, false);
 
 	/* we flush the region from the icache only when the first executable
 	 * mapping of it is made  */
@@ -1357,6 +1358,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	setup_vma_to_mm(vma, mm);
 	setup_vma_to_mm(new, mm);
 	vma_iter_store(vmi, new);
+	vma_mark_detached(new, false);
 	mm->map_count++;
 	return 0;
 
diff --git a/mm/vma.c b/mm/vma.c
index 8a454a7bbc80..1426871fa6e0 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -275,6 +275,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		 * (it may either follow vma or precede it).
 		 */
 		vma_iter_store(vmi, vp->insert);
+		vma_mark_detached(vp->insert, false);
 		mm->map_count++;
 	}
 
@@ -1690,6 +1691,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 
 	vma_start_write(vma);
 	vma_iter_store(&vmi, vma);
+	vma_mark_detached(vma, false);
 	vma_link_file(vma);
 	mm->map_count++;
 	validate_mm(mm);
@@ -2369,6 +2371,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
 	vma_iter_store(vmi, vma);
+	vma_mark_detached(vma, false);
 	map->mm->map_count++;
 	vma_link_file(vma);
 
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 1d9fc97b8e80..fdb60978821f 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -438,7 +438,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2024-11-12 19:46 ` [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-11-12 19:46 ` Suren Baghdasaryan
  2024-11-13  2:57   ` Suren Baghdasaryan
  2024-11-13  8:58   ` Vlastimil Babka
  2024-11-12 19:46 ` [PATCH v2 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  4 siblings, 2 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected inside
lock_vma_under_rcu().
lock_vma_under_rcu() enters RCU read section, finds the vma at the
given address, locks the vma and checks if it got detached or remapped
to cover a different address range. These last checks are there
to ensure that the vma was not modified after we found it but before
locking it. Vma reuse introduces a possibility that in between those
events of finding and locking the vma, it can get detached, reused,
added into a tree and be marked as attached. Current checks will help
detecting cases when:
- vma was reused but not yet added into the tree (detached check)
- vma was reused at a different address range (address check)
If vma is covering a new address range which still includes the address
we were looking for, it's not a problem unless the reused vma was added
into a different address space. Therefore checking that vma->vm_mm is
still the same is the the only missing check to detect vma reuse.
Add this missing check into lock_vma_under_rcu() and change vma cache
to include SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct
reuse and will minimize the number of call_rcu() calls.
Adding vm_freeptr into vm_area_struct avoids bloating that structure.
lock_vma_under_rcu() checks of the detached flag guarantees that vma
is valid and attached to a tree, therefore unioning vm_freeptr with
vm_start/vm_end is not an issue even though lock_vma_under_rcu() is
using them.
As part of this change freeptr_t declaration is moved into mm_types.h
to avoid circular dependencies between mm_types.h and slab.h.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 10 +++++++---
 include/linux/slab.h     |  6 ------
 kernel/fork.c            | 29 +++++++++++++----------------
 mm/memory.c              |  2 +-
 4 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5c4bfdcfac72..37580cc7bec0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -32,6 +32,12 @@
 struct address_space;
 struct mem_cgroup;
 
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -673,9 +679,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b35e2db7eb0e..cb45db2402ac 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -212,12 +212,6 @@ enum _slab_flag_bits {
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
 #endif
 
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/kernel/fork.c b/kernel/fork.c
index 7823797e31d2..946c3f9a9342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -478,25 +478,15 @@ void __vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
+void vm_area_free(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
+#ifdef CONFIG_PER_VMA_LOCK
+	/* The vma should be detached while being destroyed. */
+	VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
 	/* The vma should not be locked while being destroyed. */
 	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
-	__vm_area_free(vma);
-}
 #endif
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
 	__vm_area_free(vma);
-#endif
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
@@ -3115,6 +3105,11 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3131,9 +3126,11 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
 			SLAB_ACCOUNT);
+
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/mm/memory.c b/mm/memory.c
index d0197a0c0996..9c414c81f14a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
+	if (is_vma_detached(vma) || vma->vm_mm != mm) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 5/5] docs/mm: document latest changes to vm_lock
  2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-11-12 19:46 ` Suren Baghdasaryan
  2024-11-12 19:51   ` Suren Baghdasaryan
  4 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:46 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Change the documentation to reflect that vm_lock is integrated into vma.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 Documentation/mm/process_addrs.rst | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index ed74685ffbf2..c8935509173e 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -675,7 +675,11 @@ RCU critical section, then attempts to VMA lock it via
 :c:func:`!vma_start_read`, before releasing the RCU lock via
 :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
+succeed in acquiring VMA read lock.
+
+VMA read locks hold the read lock on the :c:member:`!vma.vm_lock` semaphore for
 their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
 via :c:func:`!vma_end_read`.
 
@@ -739,7 +743,7 @@ keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
 Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
+:c:member:`!vma.vm_lock` read/write semaphore and hold it, while checking that
 the sequence count of the VMA does not match that of the mm.
 
 If it does, the read lock fails. If it does not, we hold the lock, excluding
@@ -749,7 +753,7 @@ Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
+On the write side, we acquire a write lock on the :c:member:`!vma.vm_lock`
 read/write semaphore, before setting the VMA's sequence number under this lock,
 also simultaneously holding the mmap write lock.
 
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/5] docs/mm: document latest changes to vm_lock
  2024-11-12 19:46 ` [PATCH v2 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2024-11-12 19:51   ` Suren Baghdasaryan
  2024-11-13 14:46     ` Lorenzo Stoakes
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 19:51 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, Jonathan Corbet,
	open list:DOCUMENTATION

On Tue, Nov 12, 2024 at 11:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Change the documentation to reflect that vm_lock is integrated into vma.
> Document newly introduced vma_start_read_locked{_nested} functions.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Sorry, forgot to CC Mr. Corbet and linux-doc@vger. Added now.

> ---
>  Documentation/mm/process_addrs.rst | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index ed74685ffbf2..c8935509173e 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -675,7 +675,11 @@ RCU critical section, then attempts to VMA lock it via
>  :c:func:`!vma_start_read`, before releasing the RCU lock via
>  :c:func:`!rcu_read_unlock`.
>
> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
> +succeed in acquiring VMA read lock.
> +
> +VMA read locks hold the read lock on the :c:member:`!vma.vm_lock` semaphore for
>  their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
>  via :c:func:`!vma_end_read`.
>
> @@ -739,7 +743,7 @@ keep VMAs locked across entirely separate write operations. It also maintains
>  correct lock ordering.
>
>  Each time a VMA read lock is acquired, we acquire a read lock on the
> -:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
> +:c:member:`!vma.vm_lock` read/write semaphore and hold it, while checking that
>  the sequence count of the VMA does not match that of the mm.
>
>  If it does, the read lock fails. If it does not, we hold the lock, excluding
> @@ -749,7 +753,7 @@ Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>  are also RCU safe, so the whole read lock operation is guaranteed to function
>  correctly.
>
> -On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
> +On the write side, we acquire a write lock on the :c:member:`!vma.vm_lock`
>  read/write semaphore, before setting the VMA's sequence number under this lock,
>  also simultaneously holding the mmap write lock.
>
> --
> 2.47.0.277.g8800431eea-goog
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-11-13  2:57   ` Suren Baghdasaryan
  2024-11-13  5:08     ` Hugh Dickins
  2024-11-13  8:58   ` Vlastimil Babka
  1 sibling, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13  2:57 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 11:46 AM 'Suren Baghdasaryan' via kernel-team
<kernel-team@android.com> wrote:
>
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it. Vma reuse introduces a possibility that in between those
> events of finding and locking the vma, it can get detached, reused,
> added into a tree and be marked as attached. Current checks will help
> detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check)
> If vma is covering a new address range which still includes the address
> we were looking for, it's not a problem unless the reused vma was added
> into a different address space. Therefore checking that vma->vm_mm is
> still the same is the the only missing check to detect vma reuse.

Thinking about this some more, I don't think this works. I'm relying
on vma_start_read() to stabilize the vma, however the lock I'm taking
is part of the vma which can be reused from under us. So, the lock I'm
taking might be reinitialized after I take the lock...
I need to figure out a way to stabilize the vma in some other manner
before taking this lock.

> Add this missing check into lock_vma_under_rcu() and change vma cache
> to include SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct
> reuse and will minimize the number of call_rcu() calls.
> Adding vm_freeptr into vm_area_struct avoids bloating that structure.
> lock_vma_under_rcu() checks of the detached flag guarantees that vma
> is valid and attached to a tree, therefore unioning vm_freeptr with
> vm_start/vm_end is not an issue even though lock_vma_under_rcu() is
> using them.
> As part of this change freeptr_t declaration is moved into mm_types.h
> to avoid circular dependencies between mm_types.h and slab.h.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm_types.h | 10 +++++++---
>  include/linux/slab.h     |  6 ------
>  kernel/fork.c            | 29 +++++++++++++----------------
>  mm/memory.c              |  2 +-
>  4 files changed, 21 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5c4bfdcfac72..37580cc7bec0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -32,6 +32,12 @@
>  struct address_space;
>  struct mem_cgroup;
>
> +/*
> + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> + */
> +typedef struct { unsigned long v; } freeptr_t;
> +
>  /*
>   * Each physical page in the system has a struct page associated with
>   * it to keep track of whatever it is we are using the page for at the
> @@ -673,9 +679,7 @@ struct vm_area_struct {
>                         unsigned long vm_start;
>                         unsigned long vm_end;
>                 };
> -#ifdef CONFIG_PER_VMA_LOCK
> -               struct rcu_head vm_rcu; /* Used for deferred freeing. */
> -#endif
> +               freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>         };
>
>         /*
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index b35e2db7eb0e..cb45db2402ac 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -212,12 +212,6 @@ enum _slab_flag_bits {
>  #define SLAB_NO_OBJ_EXT                __SLAB_FLAG_UNUSED
>  #endif
>
> -/*
> - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> - */
> -typedef struct { unsigned long v; } freeptr_t;
> -
>  /*
>   * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
>   *
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7823797e31d2..946c3f9a9342 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -478,25 +478,15 @@ void __vm_area_free(struct vm_area_struct *vma)
>         kmem_cache_free(vm_area_cachep, vma);
>  }
>
> -#ifdef CONFIG_PER_VMA_LOCK
> -static void vm_area_free_rcu_cb(struct rcu_head *head)
> +void vm_area_free(struct vm_area_struct *vma)
>  {
> -       struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> -                                                 vm_rcu);
> -
> +#ifdef CONFIG_PER_VMA_LOCK
> +       /* The vma should be detached while being destroyed. */
> +       VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
>         /* The vma should not be locked while being destroyed. */
>         VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> -       __vm_area_free(vma);
> -}
>  #endif
> -
> -void vm_area_free(struct vm_area_struct *vma)
> -{
> -#ifdef CONFIG_PER_VMA_LOCK
> -       call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> -#else
>         __vm_area_free(vma);
> -#endif
>  }
>
>  static void account_kernel_stack(struct task_struct *tsk, int account)
> @@ -3115,6 +3105,11 @@ void __init mm_cache_init(void)
>
>  void __init proc_caches_init(void)
>  {
> +       struct kmem_cache_args args = {
> +               .use_freeptr_offset = true,
> +               .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +       };
> +
>         sighand_cachep = kmem_cache_create("sighand_cache",
>                         sizeof(struct sighand_struct), 0,
>                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3131,9 +3126,11 @@ void __init proc_caches_init(void)
>                         sizeof(struct fs_struct), 0,
>                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>                         NULL);
> -       vm_area_cachep = KMEM_CACHE(vm_area_struct,
> -                       SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +       vm_area_cachep = kmem_cache_create("vm_area_struct",
> +                       sizeof(struct vm_area_struct), &args,
> +                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
>                         SLAB_ACCOUNT);
> +
>         mmap_init();
>         nsproxy_cache_init();
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index d0197a0c0996..9c414c81f14a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>                 goto inval;
>
>         /* Check if the VMA got isolated after we found it */
> -       if (is_vma_detached(vma)) {
> +       if (is_vma_detached(vma) || vma->vm_mm != mm) {
>                 vma_end_read(vma);
>                 count_vm_vma_lock_event(VMA_LOCK_MISS);
>                 /* The area was replaced with another one */
> --
> 2.47.0.277.g8800431eea-goog
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13  2:57   ` Suren Baghdasaryan
@ 2024-11-13  5:08     ` Hugh Dickins
  2024-11-13  6:03       ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Hugh Dickins @ 2024-11-13  5:08 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> 
> Thinking about this some more, I don't think this works. I'm relying
> on vma_start_read() to stabilize the vma, however the lock I'm taking
> is part of the vma which can be reused from under us. So, the lock I'm
> taking might be reinitialized after I take the lock...
> I need to figure out a way to stabilize the vma in some other manner
> before taking this lock.

(I'm not paying attention and following the patches, I just happened
to notice this remark: forgive me if I'm out of context and have
misunderstood, but hope this might help:)

But this is exactly the problem SLAB_TYPESAFE_BY_RCU was invented for.
You just have to be careful that the locks are initialized only when the
slab is first created (allocated from buddy), not reinitialized whenever
a new object is allocated from that slab.

Hugh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13  5:08     ` Hugh Dickins
@ 2024-11-13  6:03       ` Suren Baghdasaryan
  2024-11-13  6:52         ` Hugh Dickins
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13  6:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 9:08 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> >
> > Thinking about this some more, I don't think this works. I'm relying
> > on vma_start_read() to stabilize the vma, however the lock I'm taking
> > is part of the vma which can be reused from under us. So, the lock I'm
> > taking might be reinitialized after I take the lock...
> > I need to figure out a way to stabilize the vma in some other manner
> > before taking this lock.
>
> (I'm not paying attention and following the patches, I just happened
> to notice this remark: forgive me if I'm out of context and have
> misunderstood, but hope this might help:)
>
> But this is exactly the problem SLAB_TYPESAFE_BY_RCU was invented for.
> You just have to be careful that the locks are initialized only when the
> slab is first created (allocated from buddy), not reinitialized whenever
> a new object is allocated from that slab.

Hi Hugh!
I'm looking into SLAB_TYPESAFE_BY_RCU implementation and trying to
figure out if initializing the lock in the ctor() of the cache as
mentioned in the comment here:
https://elixir.bootlin.com/linux/v6.12-rc7/source/include/linux/slab.h#L127
would help my case. I assume that's what you are hinting here?

>
> Hugh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13  6:03       ` Suren Baghdasaryan
@ 2024-11-13  6:52         ` Hugh Dickins
  2024-11-13  8:19           ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Hugh Dickins @ 2024-11-13  6:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hugh Dickins, akpm, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> On Tue, Nov 12, 2024 at 9:08 PM Hugh Dickins <hughd@google.com> wrote:
> > On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> > >
> > > Thinking about this some more, I don't think this works. I'm relying
> > > on vma_start_read() to stabilize the vma, however the lock I'm taking
> > > is part of the vma which can be reused from under us. So, the lock I'm
> > > taking might be reinitialized after I take the lock...
> > > I need to figure out a way to stabilize the vma in some other manner
> > > before taking this lock.
> >
> > (I'm not paying attention and following the patches, I just happened
> > to notice this remark: forgive me if I'm out of context and have
> > misunderstood, but hope this might help:)
> >
> > But this is exactly the problem SLAB_TYPESAFE_BY_RCU was invented for.
> > You just have to be careful that the locks are initialized only when the
> > slab is first created (allocated from buddy), not reinitialized whenever
> > a new object is allocated from that slab.
> 
> Hi Hugh!
> I'm looking into SLAB_TYPESAFE_BY_RCU implementation and trying to
> figure out if initializing the lock in the ctor() of the cache as
> mentioned in the comment here:
> https://elixir.bootlin.com/linux/v6.12-rc7/source/include/linux/slab.h#L127
> would help my case. I assume that's what you are hinting here?

Yes, if I'm "hinting", it's because offhand I forget the right names:
"ctor", yes, that sounds right.

Just grep around for examples of how it is used: there must be plenty
now. but anon_vma is what it was first used for.

But given the title of this patch, I'm surprised it's new to you.

Hugh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13  6:52         ` Hugh Dickins
@ 2024-11-13  8:19           ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13  8:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 10:52 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> > On Tue, Nov 12, 2024 at 9:08 PM Hugh Dickins <hughd@google.com> wrote:
> > > On Tue, 12 Nov 2024, Suren Baghdasaryan wrote:
> > > >
> > > > Thinking about this some more, I don't think this works. I'm relying
> > > > on vma_start_read() to stabilize the vma, however the lock I'm taking
> > > > is part of the vma which can be reused from under us. So, the lock I'm
> > > > taking might be reinitialized after I take the lock...
> > > > I need to figure out a way to stabilize the vma in some other manner
> > > > before taking this lock.
> > >
> > > (I'm not paying attention and following the patches, I just happened
> > > to notice this remark: forgive me if I'm out of context and have
> > > misunderstood, but hope this might help:)
> > >
> > > But this is exactly the problem SLAB_TYPESAFE_BY_RCU was invented for.
> > > You just have to be careful that the locks are initialized only when the
> > > slab is first created (allocated from buddy), not reinitialized whenever
> > > a new object is allocated from that slab.
> >
> > Hi Hugh!
> > I'm looking into SLAB_TYPESAFE_BY_RCU implementation and trying to
> > figure out if initializing the lock in the ctor() of the cache as
> > mentioned in the comment here:
> > https://elixir.bootlin.com/linux/v6.12-rc7/source/include/linux/slab.h#L127
> > would help my case. I assume that's what you are hinting here?
>
> Yes, if I'm "hinting", it's because offhand I forget the right names:
> "ctor", yes, that sounds right.

Just wanted to make sure I understood you correctly. Thanks for confirmation.

>
> Just grep around for examples of how it is used: there must be plenty
> now. but anon_vma is what it was first used for.

Yeah, there are plenty of examples now.

>
> But given the title of this patch, I'm surprised it's new to you.

Thinking about issues arising from possible object reuse is indeed new
to me, that's why I missed the lock reinitialization issue. I think I
know how to fix that now.
Thanks,
Suren.

>
> Hugh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
  2024-11-13  2:57   ` Suren Baghdasaryan
@ 2024-11-13  8:58   ` Vlastimil Babka
  2024-11-13 12:38     ` Liam R. Howlett
  1 sibling, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2024-11-13  8:58 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On 11/12/24 20:46, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it. Vma reuse introduces a possibility that in between those
> events of finding and locking the vma, it can get detached, reused,
> added into a tree and be marked as attached. Current checks will help
> detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check)
> If vma is covering a new address range which still includes the address
> we were looking for, it's not a problem unless the reused vma was added
> into a different address space. Therefore checking that vma->vm_mm is
> still the same is the the only missing check to detect vma reuse.

Hi, I was wondering if we actually need the detached flag. Couldn't
"detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
need a vma that's detached but still has a mm pointer? I'd hope the places
that set detached to false have the mm pointer around so it's not inconvenient.




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13  8:58   ` Vlastimil Babka
@ 2024-11-13 12:38     ` Liam R. Howlett
  2024-11-13 13:57       ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Liam R. Howlett @ 2024-11-13 12:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, akpm, willy, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

* Vlastimil Babka <vbabka@suse.cz> [241113 03:58]:
> On 11/12/24 20:46, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it. Vma reuse introduces a possibility that in between those
> > events of finding and locking the vma, it can get detached, reused,
> > added into a tree and be marked as attached. Current checks will help
> > detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check)
> > If vma is covering a new address range which still includes the address
> > we were looking for, it's not a problem unless the reused vma was added
> > into a different address space. Therefore checking that vma->vm_mm is
> > still the same is the the only missing check to detect vma reuse.
> 
> Hi, I was wondering if we actually need the detached flag. Couldn't
> "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> need a vma that's detached but still has a mm pointer? I'd hope the places
> that set detached to false have the mm pointer around so it's not inconvenient.

I think the gate vmas ruin this plan.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 12:38     ` Liam R. Howlett
@ 2024-11-13 13:57       ` Matthew Wilcox
  2024-11-13 15:22         ` Liam R. Howlett
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2024-11-13 13:57 UTC (permalink / raw)
  To: Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > Hi, I was wondering if we actually need the detached flag. Couldn't
> > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > need a vma that's detached but still has a mm pointer? I'd hope the places
> > that set detached to false have the mm pointer around so it's not inconvenient.
> 
> I think the gate vmas ruin this plan.

But the gate VMAs aren't to be found in the VMA tree.  Used to be that
was because the VMA tree was the injective RB tree and so VMAs could
only be in one tree at a time.  We could change that now!

Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
VMA if we need to distinguish between a detached VMA and a gate VMA.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-11-13 14:10   ` Lorenzo Stoakes
  2024-11-13 15:30     ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 11:46:31AM -0800, Suren Baghdasaryan wrote:
> Introduce helper functions which can be used to read-lock a VMA when
> holding mmap_lock for read. Replace direct accesses to vma->vm_lock
> with these new helpers.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h | 20 ++++++++++++++++++++
>  mm/userfaultfd.c   | 14 ++++++--------
>  2 files changed, 26 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fecd47239fa9..01ce619f3d17 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -722,6 +722,26 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	return true;
>  }
>
> +/*
> + * Use only while holding mmap_read_lock which guarantees that nobody can lock
> + * the vma for write (vma_start_write()) from under us.
> + */
> +static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> +{
> +	mmap_assert_locked(vma->vm_mm);
> +	down_read_nested(&vma->vm_lock->lock, subclass);
> +}
> +
> +/*
> + * Use only while holding mmap_read_lock which guarantees that nobody can lock
> + * the vma for write (vma_start_write()) from under us.
> + */
> +static inline void vma_start_read_locked(struct vm_area_struct *vma)
> +{
> +	mmap_assert_locked(vma->vm_mm);
> +	down_read(&vma->vm_lock->lock);
> +}

Hm, but why would we want to VMA read lock under mmap read lock again? mmap
read lock will exclude VMA writers so it seems sort of redundant to do
this, is it because callers expect to then release the mmap read lock
afterwards or something?

It feels like a quite specialist case that maybe is worth adding an
additional comment to because to me it seems entirely redundant - the whole
point of the VMA locks is to be able to avoid having to take an mmap lock
on read.

Something like 'while the intent of VMA read locks is to avoid having to
take mmap locks at all, there exist certain circumstances in which we would
need to hold the mmap read to begin with, and these helpers provide that
functionality'.

Also, I guess naming is hard, but _locked here strikes me as confusing as
I'd read this as if the locked refer to the VMA lock rather than the mmap
lock.

It also speaks to the need for some additional comment so nobody walks away
thinking they _must_ take a VMA read lock _as well as_ an mmap read lock
before reading from a VMA.

Again, naming, hard :>)

> +
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 60a0be33766f..55019c11b5a8 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -86,13 +86,11 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
>  	vma = find_vma_and_prepare_anon(mm, address);
>  	if (!IS_ERR(vma)) {
>  		/*
> +		 * While holding mmap_lock we can't fail
>  		 * We cannot use vma_start_read() as it may fail due to
> -		 * false locked (see comment in vma_start_read()). We
> -		 * can avoid that by directly locking vm_lock under
> -		 * mmap_lock, which guarantees that nobody can lock the
> -		 * vma for write (vma_start_write()) under us.
> +		 * false locked (see comment in vma_start_read()).

Nit but 'as it may fail due to false locked' reads strangely.

>  		 */
> -		down_read(&vma->vm_lock->lock);
> +		vma_start_read_locked(vma);

Do we even need this while gross 'this is an exception to the rule' comment now
we have new helpers?

Isn't it somewhat self-documenting now and uffd is no longer a special
snowflake?


>  	}
>
>  	mmap_read_unlock(mm);
> @@ -1480,10 +1478,10 @@ static int uffd_move_lock(struct mm_struct *mm,
>  		 * See comment in uffd_lock_vma() as to why not using
>  		 * vma_start_read() here.
>  		 */
> -		down_read(&(*dst_vmap)->vm_lock->lock);
> +		vma_start_read_locked(*dst_vmap);
>  		if (*dst_vmap != *src_vmap)
> -			down_read_nested(&(*src_vmap)->vm_lock->lock,
> -					 SINGLE_DEPTH_NESTING);
> +			vma_start_read_locked_nested(*src_vmap,
> +						SINGLE_DEPTH_NESTING);
>  	}
>  	mmap_read_unlock(mm);
>  	return err;
> --
> 2.47.0.277.g8800431eea-goog
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-12 19:46 ` [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-13 14:28   ` Lorenzo Stoakes
  2024-11-13 14:45     ` Vlastimil Babka
                       ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

I don't see a motivating reason as to why we want to do this? We increase
memory usage here which is not good, but later lock optimisation mitigates
it, but why wouldn't we just do the lock optimisations and use less memory
overall?

> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
>
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...

Pedantry, but it might be worth mentioning how much this can vary by config.

For instance, on my machine:

vm_area_struct    125238 138820    184   44

>
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> addressed later by optimizing the vm_lock.

Yes grabbing this back is of critical importance I'd say! :)

Functionally it looks ok to me but would like to see a stronger
justification in the commit msg! :)

>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h       | 28 +++++++++++++----------
>  include/linux/mm_types.h |  6 +++--
>  kernel/fork.c            | 49 ++++------------------------------------
>  3 files changed, 25 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 01ce619f3d17..a5eb0be3e351 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
>  #endif /* CONFIG_NUMA_BALANCING */
>
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_lock_init(struct vm_area_struct *vma)
> +{
> +	init_rwsem(&vma->vm_lock.lock);
> +	vma->vm_lock_seq = UINT_MAX;
> +}
> +
>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>
> -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> +	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>  		return false;
>
>  	/*
> @@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock->lock);
> +		up_read(&vma->vm_lock.lock);
>  		return false;
>  	}
>  	return true;
> @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read_nested(&vma->vm_lock->lock, subclass);
> +	down_read_nested(&vma->vm_lock.lock, subclass);
>  }
>
>  /*
> @@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
>  static inline void vma_start_read_locked(struct vm_area_struct *vma)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read(&vma->vm_lock->lock);
> +	down_read(&vma->vm_lock.lock);
>  }
>
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> -	up_read(&vma->vm_lock->lock);
> +	up_read(&vma->vm_lock.lock);
>  	rcu_read_unlock();
>  }
>
> @@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	if (__is_vma_write_locked(vma, &mm_lock_seq))
>  		return;
>
> -	down_write(&vma->vm_lock->lock);
> +	down_write(&vma->vm_lock.lock);
>  	/*
>  	 * We should use WRITE_ONCE() here because we can have concurrent reads
>  	 * from the early lockless pessimistic check in vma_start_read().
> @@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>  	 */
>  	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> -	up_write(&vma->vm_lock->lock);
> +	up_write(&vma->vm_lock.lock);
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> @@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock->lock))
> +	if (!rwsem_is_locked(&vma->vm_lock.lock))
>  		vma_assert_write_locked(vma);
>  }
>
> @@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>
>  #else /* CONFIG_PER_VMA_LOCK */
>
> +static inline void vma_lock_init(struct vm_area_struct *vma) {}
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  		{ return false; }
>  static inline void vma_end_read(struct vm_area_struct *vma) {}
> @@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
>
> -/*
> - * WARNING: vma_init does not initialize vma->vm_lock.
> - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> - */
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>  	memset(vma, 0, sizeof(*vma));
> @@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  	vma_mark_detached(vma, false);
>  	vma_numab_state_init(vma);
> +	vma_lock_init(vma);
>  }
>
>  /* Use when VMA is not part of the VMA tree and needs no locking */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 80fef38d9d64..5c4bfdcfac72 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -716,8 +716,6 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	/* Unstable RCU readers are allowed to read this. */
> -	struct vma_lock *vm_lock;
>  #endif
>
>  	/*
> @@ -770,6 +768,10 @@ struct vm_area_struct {
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* Unstable RCU readers are allowed to read this. */
> +	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0061cf2450ef..7823797e31d2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>
> -#ifdef CONFIG_PER_VMA_LOCK
> -
> -/* SLAB cache for vm_area_struct.lock */
> -static struct kmem_cache *vma_lock_cachep;
> -
> -static bool vma_lock_alloc(struct vm_area_struct *vma)
> -{
> -	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> -	if (!vma->vm_lock)
> -		return false;
> -
> -	init_rwsem(&vma->vm_lock->lock);
> -	vma->vm_lock_seq = UINT_MAX;
> -
> -	return true;
> -}
> -
> -static inline void vma_lock_free(struct vm_area_struct *vma)
> -{
> -	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> -}
> -
> -#else /* CONFIG_PER_VMA_LOCK */
> -
> -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> -
> -#endif /* CONFIG_PER_VMA_LOCK */
> -
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  		return NULL;
>
>  	vma_init(vma, mm);
> -	if (!vma_lock_alloc(vma)) {
> -		kmem_cache_free(vm_area_cachep, vma);
> -		return NULL;
> -	}
>
>  	return vma;
>  }
> @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	 * will be reinitialized.
>  	 */
>  	data_race(memcpy(new, orig, sizeof(*new)));
> -	if (!vma_lock_alloc(new)) {
> -		kmem_cache_free(vm_area_cachep, new);
> -		return NULL;
> -	}
> +	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
> @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
>  {
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> -	vma_lock_free(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
>  						  vm_rcu);
>
>  	/* The vma should not be locked while being destroyed. */
> -	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> +	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
>  	__vm_area_free(vma);
>  }
>  #endif
> @@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -
> -	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> -#endif
> +	vm_area_cachep = KMEM_CACHE(vm_area_struct,
> +			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +			SLAB_ACCOUNT);

Why the SLAB_NO_MERGE?

>  	mmap_init();
>  	nsproxy_cache_init();
>  }
> --
> 2.47.0.277.g8800431eea-goog
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-12 19:46 ` [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-11-13 14:43   ` Lorenzo Stoakes
  2024-11-13 15:37     ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 11:46:33AM -0800, Suren Baghdasaryan wrote:
> Current implementation does not set detached flag when a VMA is first
> allocated. This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree. Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added
> into a tree.
>

This seems very sensible.

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Aside from nits/refactoring suggestions below:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h               | 10 +++++++++-
>  mm/memory.c                      |  2 +-
>  mm/mmap.c                        |  2 ++
>  mm/nommu.c                       |  2 ++
>  mm/vma.c                         |  3 +++
>  tools/testing/vma/vma_internal.h |  3 ++-

Just want to say THANK YOU for taking the time to update the testing stubs :)
Much appreciated!

>  6 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a5eb0be3e351..245a85caf4c3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -812,6 +812,11 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
>  	vma->detached = detached;
>  }
>
> +static inline bool is_vma_detached(struct vm_area_struct *vma)
> +{
> +	return vma->detached;
> +}
> +
>  static inline void release_fault_lock(struct vm_fault *vmf)
>  {
>  	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> @@ -874,7 +879,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);

How did this work before? Oh I guess we initialised the VMA lock earlier right?

> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
> +#endif
>  	vma_numab_state_init(vma);
>  	vma_lock_init(vma);
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 209885a4134f..d0197a0c0996 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  		goto inval;
>
>  	/* Check if the VMA got isolated after we found it */
> -	if (vma->detached) {
> +	if (is_vma_detached(vma)) {
>  		vma_end_read(vma);
>  		count_vm_vma_lock_event(VMA_LOCK_MISS);
>  		/* The area was replaced with another one */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 386429f7db5a..1295c4cedaf4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1570,6 +1570,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
>  		goto mas_store_fail;
>
> +	vma_mark_detached(vma, false);
>  	mm->map_count++;
>  	validate_mm(mm);
>  	ksm_add_vma(vma);
> @@ -1890,6 +1891,7 @@ static struct vm_area_struct *__install_special_mapping(
>  	if (ret)
>  		goto out;
>
> +	vma_mark_detached(vma, false);

similar to vma_iter_store() comment, maybe worht putting in insert_vm_struct()?

>  	vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
>
>  	perf_event_mmap(vma);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 9cb6e99215e2..6afd5c2bd97d 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1192,6 +1192,7 @@ unsigned long do_mmap(struct file *file,
>  	current->mm->map_count++;
>  	/* add the VMA to the tree */
>  	vma_iter_store(&vmi, vma);
> +	vma_mark_detached(vma, false);

Since we to seem always to do this with vma_iter_store() do we want to put this
there? Or maybe make a wrapper around the two if that seems not to separate
concerns enough?

>
>  	/* we flush the region from the icache only when the first executable
>  	 * mapping of it is made  */
> @@ -1357,6 +1358,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	setup_vma_to_mm(vma, mm);
>  	setup_vma_to_mm(new, mm);
>  	vma_iter_store(vmi, new);
> +	vma_mark_detached(new, false);
>  	mm->map_count++;
>  	return 0;
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 8a454a7bbc80..1426871fa6e0 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -275,6 +275,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		 * (it may either follow vma or precede it).
>  		 */
>  		vma_iter_store(vmi, vp->insert);
> +		vma_mark_detached(vp->insert, false);
>  		mm->map_count++;
>  	}
>
> @@ -1690,6 +1691,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
>
>  	vma_start_write(vma);
>  	vma_iter_store(&vmi, vma);
> +	vma_mark_detached(vma, false);
>  	vma_link_file(vma);
>  	mm->map_count++;
>  	validate_mm(mm);
> @@ -2369,6 +2371,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
>  	vma_iter_store(vmi, vma);
> +	vma_mark_detached(vma, false);
>  	map->mm->map_count++;
>  	vma_link_file(vma);
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 1d9fc97b8e80..fdb60978821f 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -438,7 +438,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_mark_detached(vma, false);
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;

You're the best :)

>  }
>
>  static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> --
> 2.47.0.277.g8800431eea-goog
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:28   ` Lorenzo Stoakes
@ 2024-11-13 14:45     ` Vlastimil Babka
  2024-11-13 14:58       ` Lorenzo Stoakes
  2024-11-13 14:53     ` Mateusz Guzik
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2024-11-13 14:45 UTC (permalink / raw)
  To: Lorenzo Stoakes, Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, linux-mm, linux-kernel, kernel-team

On 11/13/24 15:28, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
>> Back when per-vma locks were introduces, vm_lock was moved out of
>> vm_area_struct in [1] because of the performance regression caused by
>> false cacheline sharing. Recent investigation [2] revealed that the
>> regressions is limited to a rather old Broadwell microarchitecture and
>> even there it can be mitigated by disabling adjacent cacheline
>> prefetching, see [3].
> 
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates

I'd say we don't normally split logically single structures into multiple
structures just because they might pack better in multiple slabs vs single
slab. Because that means more complicated management, extra pointer
dereferences etc. And if that split away part is a lock, it even complicates
things further. So the only motivation for doing that split was that it
appeared to perform better, but that was found to be misleading.

But sure it can be described better, and include the new
SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
likely impossible to do with a split away lock.

> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?

If the lock is made much smaller then the packing benefit by split might
disappear, as is the case here.

>> This patchset moves vm_lock back into vm_area_struct, aligning it at the
>> cacheline boundary and changing the cache to be cache-aligned as well.
>> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
>> (vm_lock) bytes to 256 bytes:
>>
>>     slabinfo before:
>>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>>      vma_lock         ...     40  102    1 : ...
>>      vm_area_struct   ...    160   51    2 : ...
> 
> Pedantry, but it might be worth mentioning how much this can vary by config.
> 
> For instance, on my machine:
> 
> vm_area_struct    125238 138820    184   44
> 
>>
>>     slabinfo after moving vm_lock:
>>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>>      vm_area_struct   ...    256   32    2 : ...
>>
>> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
>> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
>> addressed later by optimizing the vm_lock.
> 
> Yes grabbing this back is of critical importance I'd say! :)

I doubt it's that critical. We'll have to weight that against introducing
another non-standard locking primitive.

> Functionally it looks ok to me but would like to see a stronger
> justification in the commit msg! :)
> 
>>
>> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
>> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
>> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>>
>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> ---
>>  include/linux/mm.h       | 28 +++++++++++++----------
>>  include/linux/mm_types.h |  6 +++--
>>  kernel/fork.c            | 49 ++++------------------------------------
>>  3 files changed, 25 insertions(+), 58 deletions(-)
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/5] docs/mm: document latest changes to vm_lock
  2024-11-12 19:51   ` Suren Baghdasaryan
@ 2024-11-13 14:46     ` Lorenzo Stoakes
  0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team,
	Jonathan Corbet, open list:DOCUMENTATION

On Tue, Nov 12, 2024 at 11:51:47AM -0800, Suren Baghdasaryan wrote:
> On Tue, Nov 12, 2024 at 11:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Change the documentation to reflect that vm_lock is integrated into vma.
> > Document newly introduced vma_start_read_locked{_nested} functions.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

>
> Sorry, forgot to CC Mr. Corbet and linux-doc@vger. Added now.
>
> > ---
> >  Documentation/mm/process_addrs.rst | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > index ed74685ffbf2..c8935509173e 100644
> > --- a/Documentation/mm/process_addrs.rst
> > +++ b/Documentation/mm/process_addrs.rst
> > @@ -675,7 +675,11 @@ RCU critical section, then attempts to VMA lock it via
> >  :c:func:`!vma_start_read`, before releasing the RCU lock via
> >  :c:func:`!rcu_read_unlock`.
> >
> > -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> > +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> > +and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
> > +succeed in acquiring VMA read lock.
> > +
> > +VMA read locks hold the read lock on the :c:member:`!vma.vm_lock` semaphore for
> >  their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> >  via :c:func:`!vma_end_read`.
> >
> > @@ -739,7 +743,7 @@ keep VMAs locked across entirely separate write operations. It also maintains
> >  correct lock ordering.
> >
> >  Each time a VMA read lock is acquired, we acquire a read lock on the
> > -:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
> > +:c:member:`!vma.vm_lock` read/write semaphore and hold it, while checking that
> >  the sequence count of the VMA does not match that of the mm.
> >
> >  If it does, the read lock fails. If it does not, we hold the lock, excluding
> > @@ -749,7 +753,7 @@ Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
> >  are also RCU safe, so the whole read lock operation is guaranteed to function
> >  correctly.
> >
> > -On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
> > +On the write side, we acquire a write lock on the :c:member:`!vma.vm_lock`
> >  read/write semaphore, before setting the VMA's sequence number under this lock,
> >  also simultaneously holding the mmap write lock.
> >
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:28   ` Lorenzo Stoakes
  2024-11-13 14:45     ` Vlastimil Babka
@ 2024-11-13 14:53     ` Mateusz Guzik
  2024-11-13 14:59       ` Lorenzo Stoakes
  2024-11-13 15:01     ` Lorenzo Stoakes
  2024-11-13 15:42     ` Suren Baghdasaryan
  3 siblings, 1 reply; 39+ messages in thread
From: Mateusz Guzik @ 2024-11-13 14:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, akpm, willy, liam.howlett, mhocko, vbabka,
	hannes, oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> 
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?
> 

Where would you put the lock in that case though?

With the patchset it sticks with the affected vma, so no false-sharing
woes concerning other instances of the same struct.

If you make them separately allocated and packed, they false-share
between different vmas using them (in fact this is currently happening).
If you make sure to pad them, that's 64 bytes per obj, majority of which
is empty space.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:45     ` Vlastimil Babka
@ 2024-11-13 14:58       ` Lorenzo Stoakes
  2024-11-13 15:09         ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, akpm, willy, liam.howlett, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 03:45:24PM +0100, Vlastimil Babka wrote:
> On 11/13/24 15:28, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> >> Back when per-vma locks were introduces, vm_lock was moved out of
> >> vm_area_struct in [1] because of the performance regression caused by
> >> false cacheline sharing. Recent investigation [2] revealed that the
> >> regressions is limited to a rather old Broadwell microarchitecture and
> >> even there it can be mitigated by disabling adjacent cacheline
> >> prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
>
> I'd say we don't normally split logically single structures into multiple
> structures just because they might pack better in multiple slabs vs single
> slab. Because that means more complicated management, extra pointer
> dereferences etc. And if that split away part is a lock, it even complicates
> things further. So the only motivation for doing that split was that it
> appeared to perform better, but that was found to be misleading.
>
> But sure it can be described better, and include the new
> SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
> likely impossible to do with a split away lock.

Right, my point is that there is no justification given here at all, and we
should give one. I understand the provenance of why we split the lock, but
there has to be a motivating reason if everything is working fine right
now.

The SLAB_TYPESAFE_BY_RCU one seems to be the key one, but also something
along the lines of complicated management, concern about ordering of
allocating/freeing things, etc.

Just needs some extra explanation here.

>
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
>
> If the lock is made much smaller then the packing benefit by split might
> disappear, as is the case here.
>

Yeah, but the lock would be smaller so we'd save space still right?


> >> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> >> cacheline boundary and changing the cache to be cache-aligned as well.
> >> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> >> (vm_lock) bytes to 256 bytes:
> >>
> >>     slabinfo before:
> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >>      vma_lock         ...     40  102    1 : ...
> >>      vm_area_struct   ...    160   51    2 : ...
> >
> > Pedantry, but it might be worth mentioning how much this can vary by config.
> >
> > For instance, on my machine:
> >
> > vm_area_struct    125238 138820    184   44
> >
> >>
> >>     slabinfo after moving vm_lock:
> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >>      vm_area_struct   ...    256   32    2 : ...
> >>
> >> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> >> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> >> addressed later by optimizing the vm_lock.
> >
> > Yes grabbing this back is of critical importance I'd say! :)
>
> I doubt it's that critical. We'll have to weight that against introducing
> another non-standard locking primitive.

Avoiding unnecessary VMA overhead is important, and a strong part of the
motivation for the guard pages series for instance, so I don't think we
should be unconcerned about unnecessary extra memory usage.

I'm guessing from what you say you're not in favour of the subsequent
'non-standard' locking changes...

>
> > Functionally it looks ok to me but would like to see a stronger
> > justification in the commit msg! :)
> >
> >>
> >> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> >> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> >> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >>
> >> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> ---
> >>  include/linux/mm.h       | 28 +++++++++++++----------
> >>  include/linux/mm_types.h |  6 +++--
> >>  kernel/fork.c            | 49 ++++------------------------------------
> >>  3 files changed, 25 insertions(+), 58 deletions(-)
> >>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:53     ` Mateusz Guzik
@ 2024-11-13 14:59       ` Lorenzo Stoakes
  0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 14:59 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Suren Baghdasaryan, akpm, willy, liam.howlett, mhocko, vbabka,
	hannes, oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 03:53:54PM +0100, Mateusz Guzik wrote:
> On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
> >
>
> Where would you put the lock in that case though?
>
> With the patchset it sticks with the affected vma, so no false-sharing
> woes concerning other instances of the same struct.
>
> If you make them separately allocated and packed, they false-share
> between different vmas using them (in fact this is currently happening).
> If you make sure to pad them, that's 64 bytes per obj, majority of which
> is empty space.

'I don't see a motivating reason' = I don't see it in the commit message.

I'm saying put motivating reasons, like the above, in the commit message.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:28   ` Lorenzo Stoakes
  2024-11-13 14:45     ` Vlastimil Babka
  2024-11-13 14:53     ` Mateusz Guzik
@ 2024-11-13 15:01     ` Lorenzo Stoakes
  2024-11-13 15:45       ` Suren Baghdasaryan
  2024-11-13 15:42     ` Suren Baghdasaryan
  3 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2024-11-13 15:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?

I worded this badly. To clarify:

I don't see a motivating reason _in the commit message_ as to why we want
to do this.

I am certain there are, in fact Mateusz and Vlastimil have provided them.

So my review is - let's just put these there :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:58       ` Lorenzo Stoakes
@ 2024-11-13 15:09         ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2024-11-13 15:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, akpm, willy, liam.howlett, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On 11/13/24 15:58, Lorenzo Stoakes wrote:
> On Wed, Nov 13, 2024 at 03:45:24PM +0100, Vlastimil Babka wrote:
>> On 11/13/24 15:28, Lorenzo Stoakes wrote:
>> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
>> >> Back when per-vma locks were introduces, vm_lock was moved out of
>> >> vm_area_struct in [1] because of the performance regression caused by
>> >> false cacheline sharing. Recent investigation [2] revealed that the
>> >> regressions is limited to a rather old Broadwell microarchitecture and
>> >> even there it can be mitigated by disabling adjacent cacheline
>> >> prefetching, see [3].
>> >
>> > I don't see a motivating reason as to why we want to do this? We increase
>> > memory usage here which is not good, but later lock optimisation mitigates
>>
>> I'd say we don't normally split logically single structures into multiple
>> structures just because they might pack better in multiple slabs vs single
>> slab. Because that means more complicated management, extra pointer
>> dereferences etc. And if that split away part is a lock, it even complicates
>> things further. So the only motivation for doing that split was that it
>> appeared to perform better, but that was found to be misleading.
>>
>> But sure it can be described better, and include the new
>> SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
>> likely impossible to do with a split away lock.
> 
> Right, my point is that there is no justification given here at all, and we
> should give one. I understand the provenance of why we split the lock, but
> there has to be a motivating reason if everything is working fine right
> now.
> 
> The SLAB_TYPESAFE_BY_RCU one seems to be the key one, but also something
> along the lines of complicated management, concern about ordering of
> allocating/freeing things, etc.
> 
> Just needs some extra explanation here.

Right.

>>
>> > it, but why wouldn't we just do the lock optimisations and use less memory
>> > overall?
>>
>> If the lock is made much smaller then the packing benefit by split might
>> disappear, as is the case here.
>>
> 
> Yeah, but the lock would be smaller so we'd save space still right?

If it becomes small enough that it makes the vma including the lock fit in
192 bytes, then we're no longer saving space by keeping it in a separate
cache. Yes it would make the motivation even more obvious by shrinking it
first and then saying "look, we can even save more space by not having it
separate anymore". But there's the downside of nonstandard locking so I
think the order of changes that first unsplit the lock and only then
consider the pros and cons of shrinking makes more sense.
>> >> This patchset moves vm_lock back into vm_area_struct, aligning it at the
>> >> cacheline boundary and changing the cache to be cache-aligned as well.
>> >> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
>> >> (vm_lock) bytes to 256 bytes:
>> >>
>> >>     slabinfo before:
>> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>> >>      vma_lock         ...     40  102    1 : ...
>> >>      vm_area_struct   ...    160   51    2 : ...
>> >
>> > Pedantry, but it might be worth mentioning how much this can vary by config.
>> >
>> > For instance, on my machine:
>> >
>> > vm_area_struct    125238 138820    184   44
>> >
>> >>
>> >>     slabinfo after moving vm_lock:
>> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>> >>      vm_area_struct   ...    256   32    2 : ...
>> >>
>> >> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
>> >> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
>> >> addressed later by optimizing the vm_lock.
>> >
>> > Yes grabbing this back is of critical importance I'd say! :)
>>
>> I doubt it's that critical. We'll have to weight that against introducing
>> another non-standard locking primitive.
> 
> Avoiding unnecessary VMA overhead is important, and a strong part of the
> motivation for the guard pages series for instance, so I don't think we
> should be unconcerned about unnecessary extra memory usage.

Yeah guard pages eliminate whole VMAs from existence so they are great :)

> I'm guessing from what you say you're not in favour of the subsequent
> 'non-standard' locking changes...

True.

>>
>> > Functionally it looks ok to me but would like to see a stronger
>> > justification in the commit msg! :)
>> >
>> >>
>> >> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
>> >> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
>> >> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>> >>
>> >> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> >> ---
>> >>  include/linux/mm.h       | 28 +++++++++++++----------
>> >>  include/linux/mm_types.h |  6 +++--
>> >>  kernel/fork.c            | 49 ++++------------------------------------
>> >>  3 files changed, 25 insertions(+), 58 deletions(-)
>> >>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 13:57       ` Matthew Wilcox
@ 2024-11-13 15:22         ` Liam R. Howlett
  2024-11-13 15:25           ` Suren Baghdasaryan
  2024-11-13 16:44           ` Jann Horn
  0 siblings, 2 replies; 39+ messages in thread
From: Liam R. Howlett @ 2024-11-13 15:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Suren Baghdasaryan, akpm, lorenzo.stoakes,
	mhocko, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

* Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > that set detached to false have the mm pointer around so it's not inconvenient.
> > 
> > I think the gate vmas ruin this plan.
> 
> But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> was because the VMA tree was the injective RB tree and so VMAs could
> only be in one tree at a time.  We could change that now!

\o/

> 
> Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> VMA if we need to distinguish between a detached VMA and a gate VMA.

I was thinking a pointer to itself vma->vm_mm = vma, then a check for
this, instead of null like we do today.

Either way, we should make it a function so it's easier to reuse for
whatever we need in the future, wdyt?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 15:22         ` Liam R. Howlett
@ 2024-11-13 15:25           ` Suren Baghdasaryan
  2024-11-13 15:29             ` Liam R. Howlett
  2024-11-13 16:44           ` Jann Horn
  1 sibling, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:25 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Vlastimil Babka,
	Suren Baghdasaryan, akpm, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 7:23 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > >
> > > I think the gate vmas ruin this plan.
> >
> > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > was because the VMA tree was the injective RB tree and so VMAs could
> > only be in one tree at a time.  We could change that now!
>
> \o/
>
> >
> > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > VMA if we need to distinguish between a detached VMA and a gate VMA.
>
> I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> this, instead of null like we do today.

The motivation for having a separate detached flag was that vma->vm_mm
is used when read/write locking the vma, so it has to stay valid even
when vma gets detached. Maybe we can be more cautious in
vma_start_read()/vma_start_write() about it but I don't recall if
those were the only places that was an issue.

>
> Either way, we should make it a function so it's easier to reuse for
> whatever we need in the future, wdyt?
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 15:25           ` Suren Baghdasaryan
@ 2024-11-13 15:29             ` Liam R. Howlett
  2024-11-13 15:47               ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Liam R. Howlett @ 2024-11-13 15:29 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Vlastimil Babka, akpm, lorenzo.stoakes, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241113 10:25]:
> On Wed, Nov 13, 2024 at 7:23 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > > >
> > > > I think the gate vmas ruin this plan.
> > >
> > > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > > was because the VMA tree was the injective RB tree and so VMAs could
> > > only be in one tree at a time.  We could change that now!
> >
> > \o/
> >
> > >
> > > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > > VMA if we need to distinguish between a detached VMA and a gate VMA.
> >
> > I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> > this, instead of null like we do today.
> 
> The motivation for having a separate detached flag was that vma->vm_mm
> is used when read/write locking the vma, so it has to stay valid even
> when vma gets detached. Maybe we can be more cautious in
> vma_start_read()/vma_start_write() about it but I don't recall if
> those were the only places that was an issue.

We have the mm form the callers though, so it could be passed in?

> 
> >
> > Either way, we should make it a function so it's easier to reuse for
> > whatever we need in the future, wdyt?
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-13 14:10   ` Lorenzo Stoakes
@ 2024-11-13 15:30     ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 6:10 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 12, 2024 at 11:46:31AM -0800, Suren Baghdasaryan wrote:
> > Introduce helper functions which can be used to read-lock a VMA when
> > holding mmap_lock for read. Replace direct accesses to vma->vm_lock
> > with these new helpers.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h | 20 ++++++++++++++++++++
> >  mm/userfaultfd.c   | 14 ++++++--------
> >  2 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index fecd47239fa9..01ce619f3d17 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -722,6 +722,26 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       return true;
> >  }
> >
> > +/*
> > + * Use only while holding mmap_read_lock which guarantees that nobody can lock
> > + * the vma for write (vma_start_write()) from under us.
> > + */
> > +static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > +{
> > +     mmap_assert_locked(vma->vm_mm);
> > +     down_read_nested(&vma->vm_lock->lock, subclass);
> > +}
> > +
> > +/*
> > + * Use only while holding mmap_read_lock which guarantees that nobody can lock
> > + * the vma for write (vma_start_write()) from under us.
> > + */
> > +static inline void vma_start_read_locked(struct vm_area_struct *vma)
> > +{
> > +     mmap_assert_locked(vma->vm_mm);
> > +     down_read(&vma->vm_lock->lock);
> > +}
>
> Hm, but why would we want to VMA read lock under mmap read lock again? mmap
> read lock will exclude VMA writers so it seems sort of redundant to do
> this, is it because callers expect to then release the mmap read lock
> afterwards or something?

Yes, that's the pattern used there. They kinda downgrade from mmap to
vma lock to make it more local.

>
> It feels like a quite specialist case that maybe is worth adding an
> additional comment to because to me it seems entirely redundant - the whole
> point of the VMA locks is to be able to avoid having to take an mmap lock
> on read.
>
> Something like 'while the intent of VMA read locks is to avoid having to
> take mmap locks at all, there exist certain circumstances in which we would
> need to hold the mmap read to begin with, and these helpers provide that
> functionality'.

Ok, I'll try documenting these functions better.

>
> Also, I guess naming is hard, but _locked here strikes me as confusing as
> I'd read this as if the locked refer to the VMA lock rather than the mmap
> lock.
>
> It also speaks to the need for some additional comment so nobody walks away
> thinking they _must_ take a VMA read lock _as well as_ an mmap read lock
> before reading from a VMA.
>
> Again, naming, hard :>)

I'm open to suggestions.

>
> > +
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> >  {
> >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 60a0be33766f..55019c11b5a8 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -86,13 +86,11 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
> >       vma = find_vma_and_prepare_anon(mm, address);
> >       if (!IS_ERR(vma)) {
> >               /*
> > +              * While holding mmap_lock we can't fail
> >                * We cannot use vma_start_read() as it may fail due to
> > -              * false locked (see comment in vma_start_read()). We
> > -              * can avoid that by directly locking vm_lock under
> > -              * mmap_lock, which guarantees that nobody can lock the
> > -              * vma for write (vma_start_write()) under us.
> > +              * false locked (see comment in vma_start_read()).
>
> Nit but 'as it may fail due to false locked' reads strangely.

replace with "overflow"?

>
> >                */
> > -             down_read(&vma->vm_lock->lock);
> > +             vma_start_read_locked(vma);
>
> Do we even need this while gross 'this is an exception to the rule' comment now
> we have new helpers?
>
> Isn't it somewhat self-documenting now and uffd is no longer a special
> snowflake?

Ack.

>
>
> >       }
> >
> >       mmap_read_unlock(mm);
> > @@ -1480,10 +1478,10 @@ static int uffd_move_lock(struct mm_struct *mm,
> >                * See comment in uffd_lock_vma() as to why not using
> >                * vma_start_read() here.
> >                */
> > -             down_read(&(*dst_vmap)->vm_lock->lock);
> > +             vma_start_read_locked(*dst_vmap);
> >               if (*dst_vmap != *src_vmap)
> > -                     down_read_nested(&(*src_vmap)->vm_lock->lock,
> > -                                      SINGLE_DEPTH_NESTING);
> > +                     vma_start_read_locked_nested(*src_vmap,
> > +                                             SINGLE_DEPTH_NESTING);
> >       }
> >       mmap_read_unlock(mm);
> >       return err;
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree
  2024-11-13 14:43   ` Lorenzo Stoakes
@ 2024-11-13 15:37     ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 6:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 12, 2024 at 11:46:33AM -0800, Suren Baghdasaryan wrote:
> > Current implementation does not set detached flag when a VMA is first
> > allocated. This does not represent the real state of the VMA, which is
> > detached until it is added into mm's VMA tree. Fix this by marking new
> > VMAs as detached and resetting detached flag only after VMA is added
> > into a tree.
> >
>
> This seems very sensible.
>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Aside from nits/refactoring suggestions below:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  include/linux/mm.h               | 10 +++++++++-
> >  mm/memory.c                      |  2 +-
> >  mm/mmap.c                        |  2 ++
> >  mm/nommu.c                       |  2 ++
> >  mm/vma.c                         |  3 +++
> >  tools/testing/vma/vma_internal.h |  3 ++-
>
> Just want to say THANK YOU for taking the time to update the testing stubs :)
> Much appreciated!
>
> >  6 files changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a5eb0be3e351..245a85caf4c3 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -812,6 +812,11 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> >       vma->detached = detached;
> >  }
> >
> > +static inline bool is_vma_detached(struct vm_area_struct *vma)
> > +{
> > +     return vma->detached;
> > +}
> > +
> >  static inline void release_fault_lock(struct vm_fault *vmf)
> >  {
> >       if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > @@ -874,7 +879,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     vma_mark_detached(vma, false);
>
> How did this work before? Oh I guess we initialised the VMA lock earlier right?

Yes.

>
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
> > +#endif
> >       vma_numab_state_init(vma);
> >       vma_lock_init(vma);
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 209885a4134f..d0197a0c0996 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6279,7 +6279,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >               goto inval;
> >
> >       /* Check if the VMA got isolated after we found it */
> > -     if (vma->detached) {
> > +     if (is_vma_detached(vma)) {
> >               vma_end_read(vma);
> >               count_vm_vma_lock_event(VMA_LOCK_MISS);
> >               /* The area was replaced with another one */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 386429f7db5a..1295c4cedaf4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1570,6 +1570,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> >               goto mas_store_fail;
> >
> > +     vma_mark_detached(vma, false);
> >       mm->map_count++;
> >       validate_mm(mm);
> >       ksm_add_vma(vma);
> > @@ -1890,6 +1891,7 @@ static struct vm_area_struct *__install_special_mapping(
> >       if (ret)
> >               goto out;
> >
> > +     vma_mark_detached(vma, false);
>
> similar to vma_iter_store() comment, maybe worht putting in insert_vm_struct()?

Ah, this one I think is not needed because we already have
insert_vm_struct() -> vma_link() -> vma_mark_detached(vma, false)

>
> >       vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
> >
> >       perf_event_mmap(vma);
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 9cb6e99215e2..6afd5c2bd97d 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1192,6 +1192,7 @@ unsigned long do_mmap(struct file *file,
> >       current->mm->map_count++;
> >       /* add the VMA to the tree */
> >       vma_iter_store(&vmi, vma);
> > +     vma_mark_detached(vma, false);
>
> Since we to seem always to do this with vma_iter_store() do we want to put this
> there? Or maybe make a wrapper around the two if that seems not to separate
> concerns enough?

I think wrapper would be helpful. I'll try that and see if that looks better.

>
> >
> >       /* we flush the region from the icache only when the first executable
> >        * mapping of it is made  */
> > @@ -1357,6 +1358,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       setup_vma_to_mm(vma, mm);
> >       setup_vma_to_mm(new, mm);
> >       vma_iter_store(vmi, new);
> > +     vma_mark_detached(new, false);
> >       mm->map_count++;
> >       return 0;
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 8a454a7bbc80..1426871fa6e0 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -275,6 +275,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >                * (it may either follow vma or precede it).
> >                */
> >               vma_iter_store(vmi, vp->insert);
> > +             vma_mark_detached(vp->insert, false);
> >               mm->map_count++;
> >       }
> >
> > @@ -1690,6 +1691,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> >
> >       vma_start_write(vma);
> >       vma_iter_store(&vmi, vma);
> > +     vma_mark_detached(vma, false);
> >       vma_link_file(vma);
> >       mm->map_count++;
> >       validate_mm(mm);
> > @@ -2369,6 +2371,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >       /* Lock the VMA since it is modified after insertion into VMA tree */
> >       vma_start_write(vma);
> >       vma_iter_store(vmi, vma);
> > +     vma_mark_detached(vma, false);
> >       map->mm->map_count++;
> >       vma_link_file(vma);
> >
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 1d9fc97b8e80..fdb60978821f 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -438,7 +438,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     vma_mark_detached(vma, false);
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
>
> You're the best :)

Thanks!

>
> >  }
> >
> >  static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 14:28   ` Lorenzo Stoakes
                       ` (2 preceding siblings ...)
  2024-11-13 15:01     ` Lorenzo Stoakes
@ 2024-11-13 15:42     ` Suren Baghdasaryan
  3 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 6:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?
>
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
>
> Pedantry, but it might be worth mentioning how much this can vary by config.
>
> For instance, on my machine:
>
> vm_area_struct    125238 138820    184   44

Ack.

>
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> > addressed later by optimizing the vm_lock.
>
> Yes grabbing this back is of critical importance I'd say! :)
>
> Functionally it looks ok to me but would like to see a stronger
> justification in the commit msg! :)
>
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h       | 28 +++++++++++++----------
> >  include/linux/mm_types.h |  6 +++--
> >  kernel/fork.c            | 49 ++++------------------------------------
> >  3 files changed, 25 insertions(+), 58 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 01ce619f3d17..a5eb0be3e351 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_lock_init(struct vm_area_struct *vma)
> > +{
> > +     init_rwsem(&vma->vm_lock.lock);
> > +     vma->vm_lock_seq = UINT_MAX;
> > +}
> > +
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >               return false;
> >
> >       /*
> > @@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +             up_read(&vma->vm_lock.lock);
> >               return false;
> >       }
> >       return true;
> > @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >  static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read_nested(&vma->vm_lock->lock, subclass);
> > +     down_read_nested(&vma->vm_lock.lock, subclass);
> >  }
> >
> >  /*
> > @@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> >  static inline void vma_start_read_locked(struct vm_area_struct *vma)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read(&vma->vm_lock->lock);
> > +     down_read(&vma->vm_lock.lock);
> >  }
> >
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> >  {
> >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > -     up_read(&vma->vm_lock->lock);
> > +     up_read(&vma->vm_lock.lock);
> >       rcu_read_unlock();
> >  }
> >
> > @@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> >       if (__is_vma_write_locked(vma, &mm_lock_seq))
> >               return;
> >
> > -     down_write(&vma->vm_lock->lock);
> > +     down_write(&vma->vm_lock.lock);
> >       /*
> >        * We should use WRITE_ONCE() here because we can have concurrent reads
> >        * from the early lockless pessimistic check in vma_start_read().
> > @@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> >        * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >        */
> >       WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > -     up_write(&vma->vm_lock->lock);
> > +     up_write(&vma->vm_lock.lock);
> >  }
> >
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > @@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock->lock))
> > +     if (!rwsem_is_locked(&vma->vm_lock.lock))
> >               vma_assert_write_locked(vma);
> >  }
> >
> > @@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> > +static inline void vma_lock_init(struct vm_area_struct *vma) {}
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >               { return false; }
> >  static inline void vma_end_read(struct vm_area_struct *vma) {}
> > @@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> >
> >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> >
> > -/*
> > - * WARNING: vma_init does not initialize vma->vm_lock.
> > - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> > - */
> >  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  {
> >       memset(vma, 0, sizeof(*vma));
> > @@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> >       vma_mark_detached(vma, false);
> >       vma_numab_state_init(vma);
> > +     vma_lock_init(vma);
> >  }
> >
> >  /* Use when VMA is not part of the VMA tree and needs no locking */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 80fef38d9d64..5c4bfdcfac72 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -716,8 +716,6 @@ struct vm_area_struct {
> >        * slowpath.
> >        */
> >       unsigned int vm_lock_seq;
> > -     /* Unstable RCU readers are allowed to read this. */
> > -     struct vma_lock *vm_lock;
> >  #endif
> >
> >       /*
> > @@ -770,6 +768,10 @@ struct vm_area_struct {
> >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* Unstable RCU readers are allowed to read this. */
> > +     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +#endif
> >  } __randomize_layout;
> >
> >  #ifdef CONFIG_NUMA
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0061cf2450ef..7823797e31d2 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
> >  /* SLAB cache for mm_struct structures (tsk->mm) */
> >  static struct kmem_cache *mm_cachep;
> >
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -
> > -/* SLAB cache for vm_area_struct.lock */
> > -static struct kmem_cache *vma_lock_cachep;
> > -
> > -static bool vma_lock_alloc(struct vm_area_struct *vma)
> > -{
> > -     vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> > -     if (!vma->vm_lock)
> > -             return false;
> > -
> > -     init_rwsem(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = UINT_MAX;
> > -
> > -     return true;
> > -}
> > -
> > -static inline void vma_lock_free(struct vm_area_struct *vma)
> > -{
> > -     kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> > -}
> > -
> > -#else /* CONFIG_PER_VMA_LOCK */
> > -
> > -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> > -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> > -
> > -#endif /* CONFIG_PER_VMA_LOCK */
> > -
> >  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >  {
> >       struct vm_area_struct *vma;
> > @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >               return NULL;
> >
> >       vma_init(vma, mm);
> > -     if (!vma_lock_alloc(vma)) {
> > -             kmem_cache_free(vm_area_cachep, vma);
> > -             return NULL;
> > -     }
> >
> >       return vma;
> >  }
> > @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >        * will be reinitialized.
> >        */
> >       data_race(memcpy(new, orig, sizeof(*new)));
> > -     if (!vma_lock_alloc(new)) {
> > -             kmem_cache_free(vm_area_cachep, new);
> > -             return NULL;
> > -     }
> > +     vma_lock_init(new);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> >       vma_numab_state_init(new);
> >       dup_anon_vma_name(orig, new);
> > @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> >  {
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> > -     vma_lock_free(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
> >                                                 vm_rcu);
> >
> >       /* The vma should not be locked while being destroyed. */
> > -     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> >       __vm_area_free(vma);
> >  }
> >  #endif
> > @@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
> >                       sizeof(struct fs_struct), 0,
> >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> >                       NULL);
> > -
> > -     vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> > -#endif
> > +     vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > +                     SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > +                     SLAB_ACCOUNT);
>
> Why the SLAB_NO_MERGE?

Ah, I had it there for convenience to be able to see them separately
in /proc/slabinfo. With SLAB_HWCACHE_ALIGN I don't think it matters
for cacheline sharing...
Once we add SLAB_TYPESAFE_BY_RCU this flag won't matter anyway because
it will prevent slab merging.

>
> >       mmap_init();
> >       nsproxy_cache_init();
> >  }
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
  2024-11-13 15:01     ` Lorenzo Stoakes
@ 2024-11-13 15:45       ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 7:02 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
>
> I worded this badly. To clarify:
>
> I don't see a motivating reason _in the commit message_ as to why we want
> to do this.
>
> I am certain there are, in fact Mateusz and Vlastimil have provided them.
>
> So my review is - let's just put these there :)

Yeah, I had trouble wording all the reasons because in my head it was
simply "the right thing to do". Now with all your input my job has
become much easier :) Thanks folks!


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 15:29             ` Liam R. Howlett
@ 2024-11-13 15:47               ` Suren Baghdasaryan
  2024-11-13 19:05                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 15:47 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Vlastimil Babka, akpm, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 7:29 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241113 10:25]:
> > On Wed, Nov 13, 2024 at 7:23 AM 'Liam R. Howlett' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > > > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > > > >
> > > > > I think the gate vmas ruin this plan.
> > > >
> > > > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > > > was because the VMA tree was the injective RB tree and so VMAs could
> > > > only be in one tree at a time.  We could change that now!
> > >
> > > \o/
> > >
> > > >
> > > > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > > > VMA if we need to distinguish between a detached VMA and a gate VMA.
> > >
> > > I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> > > this, instead of null like we do today.
> >
> > The motivation for having a separate detached flag was that vma->vm_mm
> > is used when read/write locking the vma, so it has to stay valid even
> > when vma gets detached. Maybe we can be more cautious in
> > vma_start_read()/vma_start_write() about it but I don't recall if
> > those were the only places that was an issue.
>
> We have the mm form the callers though, so it could be passed in?

Let me try and see if something else blows up. When I was implementing
per-vma locks I thought about using vma->vm_mm to indicate detached
state but there were some issues that caused me reconsider.

>
> >
> > >
> > > Either way, we should make it a function so it's easier to reuse for
> > > whatever we need in the future, wdyt?
> > >
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 15:22         ` Liam R. Howlett
  2024-11-13 15:25           ` Suren Baghdasaryan
@ 2024-11-13 16:44           ` Jann Horn
  2024-11-13 20:59             ` Matthew Wilcox
  1 sibling, 1 reply; 39+ messages in thread
From: Jann Horn @ 2024-11-13 16:44 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Vlastimil Babka,
	Suren Baghdasaryan, akpm, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 4:23 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > >
> > > I think the gate vmas ruin this plan.
> >
> > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > was because the VMA tree was the injective RB tree and so VMAs could
> > only be in one tree at a time.  We could change that now!
>
> \o/
>
> >
> > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > VMA if we need to distinguish between a detached VMA and a gate VMA.
>
> I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> this, instead of null like we do today.

Sidenote:
Something like NULL or (void*)1 is fine with me but please don't do
pointer-to-itself - we shouldn't unnecessarily store a pointer to an
object of one type in a pointer field of an incompatible type, that
increases the risk of creating type confusion issues (both in the
memory corruption sense and in the Spectre sense). I know MM already
has several places where similar stuff can happen (in particular
page->mapping), but here it seems like unnecessary risk to me.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 15:47               ` Suren Baghdasaryan
@ 2024-11-13 19:05                 ` Suren Baghdasaryan
  2024-11-14 16:18                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-13 19:05 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Vlastimil Babka, akpm, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 7:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Nov 13, 2024 at 7:29 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [241113 10:25]:
> > > On Wed, Nov 13, 2024 at 7:23 AM 'Liam R. Howlett' via kernel-team
> > > <kernel-team@android.com> wrote:
> > > >
> > > > * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > > > > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > > > > >
> > > > > > I think the gate vmas ruin this plan.
> > > > >
> > > > > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > > > > was because the VMA tree was the injective RB tree and so VMAs could
> > > > > only be in one tree at a time.  We could change that now!
> > > >
> > > > \o/
> > > >
> > > > >
> > > > > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > > > > VMA if we need to distinguish between a detached VMA and a gate VMA.
> > > >
> > > > I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> > > > this, instead of null like we do today.
> > >
> > > The motivation for having a separate detached flag was that vma->vm_mm
> > > is used when read/write locking the vma, so it has to stay valid even
> > > when vma gets detached. Maybe we can be more cautious in
> > > vma_start_read()/vma_start_write() about it but I don't recall if
> > > those were the only places that was an issue.
> >
> > We have the mm form the callers though, so it could be passed in?
>
> Let me try and see if something else blows up. When I was implementing
> per-vma locks I thought about using vma->vm_mm to indicate detached
> state but there were some issues that caused me reconsider.

Yeah, a quick change reveals the first mine explosion:

[    2.838900] BUG: kernel NULL pointer dereference, address: 0000000000000480
[    2.840671] #PF: supervisor read access in kernel mode
[    2.841958] #PF: error_code(0x0000) - not-present page
[    2.843248] PGD 800000010835a067 P4D 800000010835a067 PUD 10835b067 PMD 0
[    2.844920] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[    2.846078] CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
6.12.0-rc6-00258-ga587fcd91b06-dirty #111
[    2.848277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    2.850673] RIP: 0010:unmap_vmas+0x84/0x190
[    2.851717] Code: 00 00 00 00 48 c7 44 24 48 00 00 00 00 48 c7 44
24 18 00 00 00 00 48 89 44 24 28 4c 89 44 24 38 e8 b1 c0 d1 00 48 8b
44 24 28 <48> 83 b8 80 04 00 00 00 0f 85 dd 00 00 00 45 0f b6 ed 49 83
ec 01
[    2.856287] RSP: 0000:ffffa298c0017a18 EFLAGS: 00010246
[    2.857599] RAX: 0000000000000000 RBX: 00007f48ccbb4000 RCX: 00007f48ccbb4000
[    2.859382] RDX: ffff8918c26401e0 RSI: ffffa298c0017b98 RDI: ffffa298c0017ab0
[    2.861156] RBP: 00007f48ccdb6000 R08: 00007f48ccdb6000 R09: 0000000000000001
[    2.862941] R10: 0000000000000040 R11: ffff8918c2637108 R12: 0000000000000001
[    2.864719] R13: 0000000000000001 R14: ffff8918c26401e0 R15: ffffa298c0017b98
[    2.866472] FS:  0000000000000000(0000) GS:ffff8927bf080000(0000)
knlGS:0000000000000000
[    2.868439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.869877] CR2: 0000000000000480 CR3: 000000010263e000 CR4: 0000000000750ef0
[    2.871661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.873419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.875185] PKRU: 55555554
[    2.875871] Call Trace:
[    2.876503]  <TASK>
[    2.877047]  ? __die+0x1e/0x60
[    2.877824]  ? page_fault_oops+0x17b/0x4a0
[    2.878857]  ? exc_page_fault+0x6b/0x150
[    2.879841]  ? asm_exc_page_fault+0x26/0x30
[    2.880886]  ? unmap_vmas+0x84/0x190
[    2.881783]  ? unmap_vmas+0x7f/0x190
[    2.882680]  vms_clear_ptes+0x106/0x160
[    2.883621]  vms_complete_munmap_vmas+0x53/0x170
[    2.884762]  do_vmi_align_munmap+0x15e/0x1d0
[    2.885838]  do_vmi_munmap+0xcb/0x160
[    2.886760]  __vm_munmap+0xa4/0x150
[    2.887637]  elf_load+0x1c4/0x250
[    2.888473]  load_elf_binary+0xabb/0x1680
[    2.889476]  ? __kernel_read+0x111/0x320
[    2.890458]  ? load_misc_binary+0x1bc/0x2c0
[    2.891510]  bprm_execve+0x23e/0x5e0
[    2.892408]  kernel_execve+0xf3/0x140
[    2.893331]  ? __pfx_kernel_init+0x10/0x10
[    2.894356]  kernel_init+0xe5/0x1c0
[    2.895241]  ret_from_fork+0x2c/0x50
[    2.896141]  ? __pfx_kernel_init+0x10/0x10
[    2.897164]  ret_from_fork_asm+0x1a/0x30
[    2.898148]  </TASK>

Looks like we are detaching VMAs and then unmapping them, where
vms_clear_ptes() uses vms->vma->vm_mm. I'll try to clean up this and
other paths and will see how many changes are required to make this
work.

>
> >
> > >
> > > >
> > > > Either way, we should make it a function so it's easier to reuse for
> > > > whatever we need in the future, wdyt?
> > > >
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > > >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 16:44           ` Jann Horn
@ 2024-11-13 20:59             ` Matthew Wilcox
  2024-11-13 21:23               ` Jann Horn
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2024-11-13 20:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, minchan, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 05:44:00PM +0100, Jann Horn wrote:
> Something like NULL or (void*)1 is fine with me but please don't do
> pointer-to-itself - we shouldn't unnecessarily store a pointer to an
> object of one type in a pointer field of an incompatible type, that
> increases the risk of creating type confusion issues (both in the
> memory corruption sense and in the Spectre sense). I know MM already
> has several places where similar stuff can happen (in particular
> page->mapping), but here it seems like unnecessary risk to me.

Hm?  I don't think page->mapping can ever point at page.  As far as I
know, we have four cases, discriminated by the bottom two bits:

0 - NULL or address_space
1 - anon_vma
2 - movable_ops
3 - ksm_stable_node

In fact, we're almost done eliminating page->mapping.  Just a few
filesystems and device drivers left to go.

Would it be halpful if we did:

-	struct address_space *mapping;
+	union {
+		struct address_space *mapping;
+		unsigned long raw_mapping;
+	};

and had non-filesystems use raw_mapping and do the masking?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 20:59             ` Matthew Wilcox
@ 2024-11-13 21:23               ` Jann Horn
  0 siblings, 0 replies; 39+ messages in thread
From: Jann Horn @ 2024-11-13 21:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, minchan, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 9:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Nov 13, 2024 at 05:44:00PM +0100, Jann Horn wrote:
> > Something like NULL or (void*)1 is fine with me but please don't do
> > pointer-to-itself - we shouldn't unnecessarily store a pointer to an
> > object of one type in a pointer field of an incompatible type, that
> > increases the risk of creating type confusion issues (both in the
> > memory corruption sense and in the Spectre sense). I know MM already
> > has several places where similar stuff can happen (in particular
> > page->mapping), but here it seems like unnecessary risk to me.
>
> Hm?  I don't think page->mapping can ever point at page.  As far as I
> know, we have four cases, discriminated by the bottom two bits:
>
> 0 - NULL or address_space
> 1 - anon_vma
> 2 - movable_ops
> 3 - ksm_stable_node

Oh, I didn't even know about cases 2 and 3.

Ah, yes, I didn't mean it can point at itself, I meant the pattern of
having a pointer declared as "points to type A" ("struct address_space
*mapping") while it actually points at other types sometimes.

> In fact, we're almost done eliminating page->mapping.  Just a few
> filesystems and device drivers left to go.

Ah, you mean by replacing it with folio->mapping as in
https://lore.kernel.org/all/20241025190822.1319162-4-willy@infradead.org/
?

> Would it be halpful if we did:
>
> -       struct address_space *mapping;
> +       union {
> +               struct address_space *mapping;
> +               unsigned long raw_mapping;
> +       };
>
> and had non-filesystems use raw_mapping and do the masking?

While I think that would look a tiny bit tidier, I don't think it
would make a significant difference for page->mapping or
folio->mapping. In the context of the VMA field, the question is about
whether we make it possible for the same memory location to contain
values of different types, which I think increases the risk of
programmer mistakes or speculative confusions; while in the context of
the page->mapping field, the same memory location can contain values
of different types either way. So while aesthetically I think having a
union for the mapping field would look a little nicer, I don't
actually see much benefit in making such a change.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-13 19:05                 ` Suren Baghdasaryan
@ 2024-11-14 16:18                   ` Suren Baghdasaryan
  2024-11-14 16:21                     ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2024-11-14 16:18 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox,
	Vlastimil Babka, akpm, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Wed, Nov 13, 2024 at 11:05 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Nov 13, 2024 at 7:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Nov 13, 2024 at 7:29 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [241113 10:25]:
> > > > On Wed, Nov 13, 2024 at 7:23 AM 'Liam R. Howlett' via kernel-team
> > > > <kernel-team@android.com> wrote:
> > > > >
> > > > > * Matthew Wilcox <willy@infradead.org> [241113 08:57]:
> > > > > > On Wed, Nov 13, 2024 at 07:38:02AM -0500, Liam R. Howlett wrote:
> > > > > > > > Hi, I was wondering if we actually need the detached flag. Couldn't
> > > > > > > > "detached" simply mean vma->vm_mm == NULL and we save 4 bytes? Do we ever
> > > > > > > > need a vma that's detached but still has a mm pointer? I'd hope the places
> > > > > > > > that set detached to false have the mm pointer around so it's not inconvenient.
> > > > > > >
> > > > > > > I think the gate vmas ruin this plan.
> > > > > >
> > > > > > But the gate VMAs aren't to be found in the VMA tree.  Used to be that
> > > > > > was because the VMA tree was the injective RB tree and so VMAs could
> > > > > > only be in one tree at a time.  We could change that now!
> > > > >
> > > > > \o/
> > > > >
> > > > > >
> > > > > > Anyway, we could use (void *)1 instead of NULL to indicate a "detached"
> > > > > > VMA if we need to distinguish between a detached VMA and a gate VMA.
> > > > >
> > > > > I was thinking a pointer to itself vma->vm_mm = vma, then a check for
> > > > > this, instead of null like we do today.
> > > >
> > > > The motivation for having a separate detached flag was that vma->vm_mm
> > > > is used when read/write locking the vma, so it has to stay valid even
> > > > when vma gets detached. Maybe we can be more cautious in
> > > > vma_start_read()/vma_start_write() about it but I don't recall if
> > > > those were the only places that was an issue.
> > >
> > > We have the mm form the callers though, so it could be passed in?
> >
> > Let me try and see if something else blows up. When I was implementing
> > per-vma locks I thought about using vma->vm_mm to indicate detached
> > state but there were some issues that caused me reconsider.
>
> Yeah, a quick change reveals the first mine explosion:
>
> [    2.838900] BUG: kernel NULL pointer dereference, address: 0000000000000480
> [    2.840671] #PF: supervisor read access in kernel mode
> [    2.841958] #PF: error_code(0x0000) - not-present page
> [    2.843248] PGD 800000010835a067 P4D 800000010835a067 PUD 10835b067 PMD 0
> [    2.844920] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> [    2.846078] CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
> 6.12.0-rc6-00258-ga587fcd91b06-dirty #111
> [    2.848277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [    2.850673] RIP: 0010:unmap_vmas+0x84/0x190
> [    2.851717] Code: 00 00 00 00 48 c7 44 24 48 00 00 00 00 48 c7 44
> 24 18 00 00 00 00 48 89 44 24 28 4c 89 44 24 38 e8 b1 c0 d1 00 48 8b
> 44 24 28 <48> 83 b8 80 04 00 00 00 0f 85 dd 00 00 00 45 0f b6 ed 49 83
> ec 01
> [    2.856287] RSP: 0000:ffffa298c0017a18 EFLAGS: 00010246
> [    2.857599] RAX: 0000000000000000 RBX: 00007f48ccbb4000 RCX: 00007f48ccbb4000
> [    2.859382] RDX: ffff8918c26401e0 RSI: ffffa298c0017b98 RDI: ffffa298c0017ab0
> [    2.861156] RBP: 00007f48ccdb6000 R08: 00007f48ccdb6000 R09: 0000000000000001
> [    2.862941] R10: 0000000000000040 R11: ffff8918c2637108 R12: 0000000000000001
> [    2.864719] R13: 0000000000000001 R14: ffff8918c26401e0 R15: ffffa298c0017b98
> [    2.866472] FS:  0000000000000000(0000) GS:ffff8927bf080000(0000)
> knlGS:0000000000000000
> [    2.868439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.869877] CR2: 0000000000000480 CR3: 000000010263e000 CR4: 0000000000750ef0
> [    2.871661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.873419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.875185] PKRU: 55555554
> [    2.875871] Call Trace:
> [    2.876503]  <TASK>
> [    2.877047]  ? __die+0x1e/0x60
> [    2.877824]  ? page_fault_oops+0x17b/0x4a0
> [    2.878857]  ? exc_page_fault+0x6b/0x150
> [    2.879841]  ? asm_exc_page_fault+0x26/0x30
> [    2.880886]  ? unmap_vmas+0x84/0x190
> [    2.881783]  ? unmap_vmas+0x7f/0x190
> [    2.882680]  vms_clear_ptes+0x106/0x160
> [    2.883621]  vms_complete_munmap_vmas+0x53/0x170
> [    2.884762]  do_vmi_align_munmap+0x15e/0x1d0
> [    2.885838]  do_vmi_munmap+0xcb/0x160
> [    2.886760]  __vm_munmap+0xa4/0x150
> [    2.887637]  elf_load+0x1c4/0x250
> [    2.888473]  load_elf_binary+0xabb/0x1680
> [    2.889476]  ? __kernel_read+0x111/0x320
> [    2.890458]  ? load_misc_binary+0x1bc/0x2c0
> [    2.891510]  bprm_execve+0x23e/0x5e0
> [    2.892408]  kernel_execve+0xf3/0x140
> [    2.893331]  ? __pfx_kernel_init+0x10/0x10
> [    2.894356]  kernel_init+0xe5/0x1c0
> [    2.895241]  ret_from_fork+0x2c/0x50
> [    2.896141]  ? __pfx_kernel_init+0x10/0x10
> [    2.897164]  ret_from_fork_asm+0x1a/0x30
> [    2.898148]  </TASK>
>
> Looks like we are detaching VMAs and then unmapping them, where
> vms_clear_ptes() uses vms->vma->vm_mm. I'll try to clean up this and
> other paths and will see how many changes are required to make this
> work.

Ok, my vma->detached deprecation effort got to the point that my QEMU
boots. The change is not pretty and I'm quite sure I did not cover all
cases yet (like hugepages):

 arch/arm/kernel/process.c             |   2 +-
 arch/arm64/kernel/vdso.c              |   4 +-
 arch/loongarch/kernel/vdso.c          |   2 +-
 arch/powerpc/kernel/vdso.c            |   2 +-
 arch/powerpc/platforms/pseries/vas.c  |   2 +-
 arch/riscv/kernel/vdso.c              |   4 +-
 arch/s390/kernel/vdso.c               |   2 +-
 arch/s390/mm/gmap.c                   |   2 +-
 arch/x86/entry/vdso/vma.c             |   2 +-
 arch/x86/kernel/cpu/sgx/encl.c        |   2 +-
 arch/x86/um/mem_32.c                  |   2 +-
 drivers/android/binder_alloc.c        |   2 +-
 drivers/gpu/drm/i915/i915_mm.c        |   4 +-
 drivers/infiniband/core/uverbs_main.c |   4 +-
 drivers/misc/sgi-gru/grumain.c        |   2 +-
 fs/exec.c                             |   2 +-
 fs/hugetlbfs/inode.c                  |   3 +-
 include/linux/mm.h                    | 111 +++++++++++++++++---------
 include/linux/mm_types.h              |   6 --
 kernel/bpf/arena.c                    |   2 +-
 kernel/fork.c                         |   6 +-
 mm/debug_vm_pgtable.c                 |   2 +-
 mm/internal.h                         |   2 +-
 mm/madvise.c                          |   4 +-
 mm/memory.c                           |  39 ++++-----
 mm/mmap.c                             |   9 +--
 mm/nommu.c                            |   6 +-
 mm/oom_kill.c                         |   2 +-
 mm/vma.c                              |  62 +++++++-------
 mm/vma.h                              |   2 +-
 net/ipv4/tcp.c                        |   4 +-
 31 files changed, 164 insertions(+), 136 deletions(-)

Many of the unmap_* and zap_* functions should get an `mm` parameter
to make this work.
So, if we take this route, it should definitely be a separate patch,
which will likely cause some instability issues for some time until
all the edge cases are ironed out. I would like to proceed with this
patch series first before attempting to deprecate vma->detached. Let
me know if you have objections to this plan.

>
> >
> > >
> > > >
> > > > >
> > > > > Either way, we should make it a function so it's easier to reuse for
> > > > > whatever we need in the future, wdyt?
> > > > >
> > > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > > > >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-11-14 16:18                   ` Suren Baghdasaryan
@ 2024-11-14 16:21                     ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2024-11-14 16:21 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Matthew Wilcox, akpm,
	lorenzo.stoakes, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On 11/14/24 17:18, Suren Baghdasaryan wrote:
> 
> Ok, my vma->detached deprecation effort got to the point that my QEMU
> boots. The change is not pretty and I'm quite sure I did not cover all
> cases yet (like hugepages):
> 
>  arch/arm/kernel/process.c             |   2 +-
>  arch/arm64/kernel/vdso.c              |   4 +-
>  arch/loongarch/kernel/vdso.c          |   2 +-
>  arch/powerpc/kernel/vdso.c            |   2 +-
>  arch/powerpc/platforms/pseries/vas.c  |   2 +-
>  arch/riscv/kernel/vdso.c              |   4 +-
>  arch/s390/kernel/vdso.c               |   2 +-
>  arch/s390/mm/gmap.c                   |   2 +-
>  arch/x86/entry/vdso/vma.c             |   2 +-
>  arch/x86/kernel/cpu/sgx/encl.c        |   2 +-
>  arch/x86/um/mem_32.c                  |   2 +-
>  drivers/android/binder_alloc.c        |   2 +-
>  drivers/gpu/drm/i915/i915_mm.c        |   4 +-
>  drivers/infiniband/core/uverbs_main.c |   4 +-
>  drivers/misc/sgi-gru/grumain.c        |   2 +-
>  fs/exec.c                             |   2 +-
>  fs/hugetlbfs/inode.c                  |   3 +-
>  include/linux/mm.h                    | 111 +++++++++++++++++---------
>  include/linux/mm_types.h              |   6 --
>  kernel/bpf/arena.c                    |   2 +-
>  kernel/fork.c                         |   6 +-
>  mm/debug_vm_pgtable.c                 |   2 +-
>  mm/internal.h                         |   2 +-
>  mm/madvise.c                          |   4 +-
>  mm/memory.c                           |  39 ++++-----
>  mm/mmap.c                             |   9 +--
>  mm/nommu.c                            |   6 +-
>  mm/oom_kill.c                         |   2 +-
>  mm/vma.c                              |  62 +++++++-------
>  mm/vma.h                              |   2 +-
>  net/ipv4/tcp.c                        |   4 +-
>  31 files changed, 164 insertions(+), 136 deletions(-)
> 
> Many of the unmap_* and zap_* functions should get an `mm` parameter
> to make this work.

Ouch, thanks for the attempt!

> So, if we take this route, it should definitely be a separate patch,
> which will likely cause some instability issues for some time until
> all the edge cases are ironed out. I would like to proceed with this
> patch series first before attempting to deprecate vma->detached. Let
> me know if you have objections to this plan.

No objections!

>>
>> >
>> > >
>> > > >
>> > > > >
>> > > > > Either way, we should make it a function so it's easier to reuse for
>> > > > > whatever we need in the future, wdyt?
>> > > > >
>> > > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>> > > > >



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-12-05 15:23 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-12 19:46 [PATCH v2 0/5] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-11-13 14:10   ` Lorenzo Stoakes
2024-11-13 15:30     ` Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-13 14:28   ` Lorenzo Stoakes
2024-11-13 14:45     ` Vlastimil Babka
2024-11-13 14:58       ` Lorenzo Stoakes
2024-11-13 15:09         ` Vlastimil Babka
2024-11-13 14:53     ` Mateusz Guzik
2024-11-13 14:59       ` Lorenzo Stoakes
2024-11-13 15:01     ` Lorenzo Stoakes
2024-11-13 15:45       ` Suren Baghdasaryan
2024-11-13 15:42     ` Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-11-13 14:43   ` Lorenzo Stoakes
2024-11-13 15:37     ` Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-11-13  2:57   ` Suren Baghdasaryan
2024-11-13  5:08     ` Hugh Dickins
2024-11-13  6:03       ` Suren Baghdasaryan
2024-11-13  6:52         ` Hugh Dickins
2024-11-13  8:19           ` Suren Baghdasaryan
2024-11-13  8:58   ` Vlastimil Babka
2024-11-13 12:38     ` Liam R. Howlett
2024-11-13 13:57       ` Matthew Wilcox
2024-11-13 15:22         ` Liam R. Howlett
2024-11-13 15:25           ` Suren Baghdasaryan
2024-11-13 15:29             ` Liam R. Howlett
2024-11-13 15:47               ` Suren Baghdasaryan
2024-11-13 19:05                 ` Suren Baghdasaryan
2024-11-14 16:18                   ` Suren Baghdasaryan
2024-11-14 16:21                     ` Vlastimil Babka
2024-11-13 16:44           ` Jann Horn
2024-11-13 20:59             ` Matthew Wilcox
2024-11-13 21:23               ` Jann Horn
2024-11-12 19:46 ` [PATCH v2 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-11-12 19:51   ` Suren Baghdasaryan
2024-11-13 14:46     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox