[PATCH 0/4] move per-vma lock into vm_area

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] move per-vma lock into vm_area_struct
@ 2024-11-11 20:55 Suren Baghdasaryan
  2024-11-11 20:55 ` [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 20:55 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.
To minimize memory overhead, vm_lock implementation is changed from
using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
vm_area_struct members are moved into the last cacheline, resulting
in a less fragmented structure:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		struct callback_head vm_rcu ;            /*     0    16 */
	} __attribute__((__aligned__(8)));               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	bool                       detached;             /*    40     1 */

	/* XXX 3 bytes hole, try to pack */

	unsigned int               vm_lock_seq;          /*    44     4 */
	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb (__aligned__(8));  /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

	/* size: 192, cachelines: 3, members: 17 */
	/* sum members: 153, holes: 3, sum holes: 15 */
	/* padding: 24 */
	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
} __attribute__((__aligned__(64)));

Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
to the 50 pages in the baseline:

    slabinfo after vm_area_struct changes:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    192   42    2 : ...

Performance measurements using pft test on x86 do not show considerable
difference, on Pixel 6 running Android it results in 3-5% improvement in
faults per second.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Suren Baghdasaryan (4):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: replace rw_semaphore with atomic_t in vma_lock
  mm: move lesser used vma_area_struct members into the last cacheline

 include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
 include/linux/mm_types.h  |  59 +++++++++-----
 include/linux/mmap_lock.h |   3 +
 kernel/fork.c             |  50 ++----------
 mm/init-mm.c              |   2 +
 mm/userfaultfd.c          |  14 ++--
 6 files changed, 205 insertions(+), 86 deletions(-)


base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-11 20:55 ` Suren Baghdasaryan
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 20:55 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read. Replace direct accesses to vma->vm_lock
with these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 20 ++++++++++++++++++++
 mm/userfaultfd.c   | 14 ++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fecd47239fa9..01ce619f3d17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -722,6 +722,26 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap_read_lock which guarantees that nobody can lock
+ * the vma for write (vma_start_write()) from under us.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap_read_lock which guarantees that nobody can lock
+ * the vma for write (vma_start_write()) from under us.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60a0be33766f..55019c11b5a8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -86,13 +86,11 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma)) {
 		/*
+		 * While holding mmap_lock we can't fail
 		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
+		 * false locked (see comment in vma_start_read()).
 		 */
-		down_read(&vma->vm_lock->lock);
+		vma_start_read_locked(vma);
 	}
 
 	mmap_read_unlock(mm);
@@ -1480,10 +1478,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 		 * See comment in uffd_lock_vma() as to why not using
 		 * vma_start_read() here.
 		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-11 20:55 ` [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-11-11 20:55 ` Suren Baghdasaryan
  2024-11-12  9:36   ` kernel test robot
                     ` (3 more replies)
  2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 20:55 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. This memory consumption growth will be
addressed in the patches that follow.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/T/#m861679f3fe0e22c945d6334b88dc996fef5ea6cc
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 27 ++++++++++++----------
 include/linux/mm_types.h |  6 +++--
 kernel/fork.c            | 50 +++++-----------------------------------
 3 files changed, 25 insertions(+), 58 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01ce619f3d17..c1c2899464db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,6 +684,11 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vma_lock *vm_lock)
+{
+	init_rwsem(&vm_lock->lock);
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -701,7 +706,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -716,7 +721,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -729,7 +734,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -739,13 +744,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -774,7 +779,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -782,7 +787,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -794,7 +799,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -861,10 +866,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -873,6 +874,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(&vma->vm_lock);
+	vma->vm_lock_seq = UINT_MAX;
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80fef38d9d64..5c4bfdcfac72 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -716,8 +716,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -770,6 +768,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 0061cf2450ef..9e504105f24f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(&new->vm_lock);
+	new->vm_lock_seq = UINT_MAX;
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3168,11 +3132,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-11 20:55 ` [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-11 20:55 ` Suren Baghdasaryan
  2024-11-12  0:06   ` Andrew Morton
                     ` (2 more replies)
  2024-11-11 20:55 ` [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 20:55 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic, unless the
top two bits are set indicating a writer is present.
When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there
are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock
and puts itself onto newly introduced mm.vma_writer_wait. Since all
writers take mmap_lock in write mode first, there can be only one writer
at a time. The last reader to release the lock will signal the writer
to wake up.
atomic_t might overflow if there are many competing readers, therefore
vma_start_read() implements an overflow check and if that occurs it
exits with a failure to lock. vma_start_read_locked{_nested} may cause
an overflow but it is later handled by __vma_end_read().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h        | 142 ++++++++++++++++++++++++++++++++++----
 include/linux/mm_types.h  |  18 ++++-
 include/linux/mmap_lock.h |   3 +
 kernel/fork.c             |   2 +-
 mm/init-mm.c              |   2 +
 5 files changed, 151 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1c2899464db..27c0e9ba81c4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -686,7 +686,41 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #ifdef CONFIG_PER_VMA_LOCK
 static inline void vma_lock_init(struct vma_lock *vm_lock)
 {
-	init_rwsem(&vm_lock->lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vm_lock->dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+	atomic_set(&vm_lock->count, VMA_LOCK_UNLOCKED);
+}
+
+static inline unsigned int vma_lock_reader_count(unsigned int counter)
+{
+	return counter & VMA_LOCK_RD_COUNT_MASK;
+}
+
+static inline void __vma_end_read(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	unsigned int count, prev, new;
+
+	count = (unsigned int)atomic_read(&vma->vm_lock.count);
+	for (;;) {
+		if (unlikely(vma_lock_reader_count(count) == 0)) {
+			/*
+			 * Overflow was possible in vma_start_read_locked().
+			 * When detected, wrap around preserving writer bits.
+			 */
+			new = count | ~(VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT);
+		} else
+			new = count - 1;
+		prev = atomic_cmpxchg(&vma->vm_lock.count, count, new);
+		if (prev == count)
+			break;
+		count = prev;
+	}
+	rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
+	if (vma_lock_reader_count(new) == 0 && (new & VMA_LOCK_WR_WAIT))
+		wake_up(&mm->vma_writer_wait);
 }
 
 /*
@@ -696,6 +730,9 @@ static inline void vma_lock_init(struct vma_lock *vm_lock)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned int count, prev, new;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -703,11 +740,35 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
-		return false;
+	rwsem_acquire_read(&vma->vm_lock.dep_map, 0, 0, _RET_IP_);
+	count = (unsigned int)atomic_read(&vma->vm_lock.count);
+	for (;;) {
+		/* Is VMA is write-locked or writer is waiting? */
+		if (count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) {
+			rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
+			return false;
+		}
+
+		new = count + 1;
+		/* If atomic_t overflows, fail to lock. */
+		if (new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) {
+			rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
+			return false;
+		}
+
+		/*
+		 * Atomic RMW will provide implicit mb on success to pair with smp_wmb in
+		 * vma_start_write, on failure we retry.
+		 */
+		prev = atomic_cmpxchg(&vma->vm_lock.count, count, new);
+		if (prev == count)
+			break;
+		count = prev;
+	}
+	lock_acquired(&vma->vm_lock.dep_map, _RET_IP_);
 
 	/*
 	 * Overflow might produce false locked result.
@@ -720,8 +781,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
+		__vma_end_read(mm, vma);
 		return false;
 	}
 	return true;
@@ -733,8 +794,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned int count, prev, new;
+
+	mmap_assert_locked(mm);
+
+	rwsem_acquire_read(&vma->vm_lock.dep_map, subclass, 0, _RET_IP_);
+	count = (unsigned int)atomic_read(&vma->vm_lock.count);
+	for (;;) {
+		/* We are holding mmap_lock, no active or waiting writers are possible. */
+		VM_BUG_ON_VMA(count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT), vma);
+		new = count + 1;
+		/* Unlikely but if atomic_t overflows, wrap around to. */
+		if (WARN_ON(new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)))
+			new = 0;
+		/*
+		 * Atomic RMW will provide implicit mb on success to pair with smp_wmb in
+		 * vma_start_write, on failure we retry.
+		 */
+		prev = atomic_cmpxchg(&vma->vm_lock.count, count, new);
+		if (prev == count)
+			break;
+		count = prev;
+	}
+	lock_acquired(&vma->vm_lock.dep_map, _RET_IP_);
 }
 
 /*
@@ -743,14 +826,15 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
+	vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
+	struct mm_struct *mm = vma->vm_mm;
+
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
+	__vma_end_read(mm, vma);
 	rcu_read_unlock();
 }
 
@@ -774,12 +858,34 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
  */
 static inline void vma_start_write(struct vm_area_struct *vma)
 {
+	unsigned int count, prev, new;
 	unsigned int mm_lock_seq;
 
+	might_sleep();
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock.lock);
+	rwsem_acquire(&vma->vm_lock.dep_map, 0, 0, _RET_IP_);
+	count = (unsigned int)atomic_read(&vma->vm_lock.count);
+	for (;;) {
+		if (vma_lock_reader_count(count) > 0)
+			new = count | VMA_LOCK_WR_WAIT;
+		else
+			new = count | VMA_LOCK_WR_LOCKED;
+		prev = atomic_cmpxchg(&vma->vm_lock.count, count, new);
+		if (prev == count)
+			break;
+		count = prev;
+	}
+	if (new & VMA_LOCK_WR_WAIT) {
+		lock_contended(&vma->vm_lock.dep_map, _RET_IP_);
+		wait_event(vma->vm_mm->vma_writer_wait,
+			   atomic_cmpxchg(&vma->vm_lock.count, VMA_LOCK_WR_WAIT,
+					  VMA_LOCK_WR_LOCKED) == VMA_LOCK_WR_WAIT);
+
+	}
+	lock_acquired(&vma->vm_lock.dep_map, _RET_IP_);
+
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -787,7 +893,10 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	/* Write barrier to ensure vm_lock_seq change is visible before count */
+	smp_wmb();
+	rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
+	atomic_set(&vma->vm_lock.count, VMA_LOCK_UNLOCKED);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -797,9 +906,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 	VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma);
 }
 
+static inline bool is_vma_read_locked(struct vm_area_struct *vma)
+{
+	return vma_lock_reader_count((unsigned int)atomic_read(&vma->vm_lock.count)) > 0;
+}
+
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (!is_vma_read_locked(vma))
 		vma_assert_write_locked(vma);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5c4bfdcfac72..789bccc05520 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -615,8 +615,23 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
+#define VMA_LOCK_UNLOCKED		0
+#define VMA_LOCK_WR_LOCKED		(1 << 31)
+#define VMA_LOCK_WR_WAIT		(1 << 30)
+
+#define VMA_LOCK_RD_COUNT_MASK		(VMA_LOCK_WR_WAIT - 1)
+
 struct vma_lock {
-	struct rw_semaphore lock;
+	/*
+	 * count & VMA_LOCK_RD_COUNT_MASK > 0 ==> read-locked with 'count' number of readers
+	 * count & VMA_LOCK_WR_LOCKED != 0 ==> write-locked
+	 * count & VMA_LOCK_WR_WAIT != 0 ==> writer is waiting
+	 * count = 0 ==> unlocked
+	 */
+	atomic_t count;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
 };
 
 struct vma_numab_state {
@@ -883,6 +898,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct wait_queue_head vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 58dde2e35f7e..769ab97fff3e 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -121,6 +121,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	init_waitqueue_head(&mm->vma_writer_wait);
+#endif
 }
 
 static inline void mmap_write_lock(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e504105f24f..726050c557e2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -486,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
+	VM_BUG_ON_VMA(is_vma_read_locked(vma), vma);
 	__vm_area_free(vma);
 }
 #endif
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..db058873ba18 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,8 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait =
+		__WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
@ 2024-11-11 20:55 ` Suren Baghdasaryan
  2024-11-12 10:07   ` Vlastimil Babka
  2024-11-11 21:41 ` [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 20:55 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team, surenb

Move several vma_area_struct members which are rarely or never used
during page fault handling into the last cacheline to better pack
vm_area_struct. As a result vm_area_struct will fit into 3 cachelines
as opposed to 4 cachelines before this change. New vm_area_struct layout:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		struct callback_head vm_rcu ;            /*     0    16 */
	} __attribute__((__aligned__(8)));               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	bool                       detached;             /*    40     1 */

	/* XXX 3 bytes hole, try to pack */

	unsigned int               vm_lock_seq;          /*    44     4 */
	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb (__aligned__(8));  /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

	/* size: 192, cachelines: 3, members: 17 */
	/* sum members: 153, holes: 3, sum holes: 15 */
	/* padding: 24 */
	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
} __attribute__((__aligned__(64)));


Memory consumption per 1000 VMAs becomes 48 pages:

    slabinfo after vm_area_struct changes:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    192   42    2 : ...


Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 37 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 789bccc05520..c3755b680911 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -733,16 +733,6 @@ struct vm_area_struct {
 	unsigned int vm_lock_seq;
 #endif
 
-	/*
-	 * For areas with an address space and backing store,
-	 * linkage into the address_space->i_mmap interval tree.
-	 *
-	 */
-	struct {
-		struct rb_node rb;
-		unsigned long rb_subtree_last;
-	} shared;
-
 	/*
 	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
 	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
@@ -762,14 +752,6 @@ struct vm_area_struct {
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 
-#ifdef CONFIG_ANON_VMA_NAME
-	/*
-	 * For private and shared anonymous mappings, a pointer to a null
-	 * terminated string containing the name given to the vma, or NULL if
-	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
-	 */
-	struct anon_vma_name *anon_name;
-#endif
 #ifdef CONFIG_SWAP
 	atomic_long_t swap_readahead_info;
 #endif
@@ -782,11 +764,28 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA_BALANCING
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
-	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
 	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
 #endif
+	/*
+	 * For areas with an address space and backing store,
+	 * linkage into the address_space->i_mmap interval tree.
+	 *
+	 */
+	struct {
+		struct rb_node rb;
+		unsigned long rb_subtree_last;
+	} shared;
+#ifdef CONFIG_ANON_VMA_NAME
+	/*
+	 * For private and shared anonymous mappings, a pointer to a null
+	 * terminated string containing the name given to the vma, or NULL if
+	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
+	 */
+	struct anon_vma_name *anon_name;
+#endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
-- 
2.47.0.277.g8800431eea-goog



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2024-11-11 20:55 ` [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
@ 2024-11-11 21:41 ` Suren Baghdasaryan
  2024-11-12  2:48   ` Liam R. Howlett
  2024-11-11 22:18 ` Davidlohr Bueso
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 21:41 UTC (permalink / raw)
  To: akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
>
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
>
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.
> To minimize memory overhead, vm_lock implementation is changed from
> using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> vm_area_struct members are moved into the last cacheline, resulting
> in a less fragmented structure:
>
> struct vm_area_struct {
>         union {
>                 struct {
>                         long unsigned int vm_start;      /*     0     8 */
>                         long unsigned int vm_end;        /*     8     8 */
>                 };                                       /*     0    16 */
>                 struct callback_head vm_rcu ;            /*     0    16 */
>         } __attribute__((__aligned__(8)));               /*     0    16 */
>         struct mm_struct *         vm_mm;                /*    16     8 */
>         pgprot_t                   vm_page_prot;         /*    24     8 */
>         union {
>                 const vm_flags_t   vm_flags;             /*    32     8 */
>                 vm_flags_t         __vm_flags;           /*    32     8 */
>         };                                               /*    32     8 */
>         bool                       detached;             /*    40     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         unsigned int               vm_lock_seq;          /*    44     4 */
>         struct list_head           anon_vma_chain;       /*    48    16 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         struct anon_vma *          anon_vma;             /*    64     8 */
>         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
>         long unsigned int          vm_pgoff;             /*    80     8 */
>         struct file *              vm_file;              /*    88     8 */
>         void *                     vm_private_data;      /*    96     8 */
>         atomic_long_t              swap_readahead_info;  /*   104     8 */
>         struct mempolicy *         vm_policy;            /*   112     8 */
>
>         /* XXX 8 bytes hole, try to pack */
>
>         /* --- cacheline 2 boundary (128 bytes) --- */
>         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
>
>         /* XXX 4 bytes hole, try to pack */
>
>         struct {
>                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
>                 long unsigned int  rb_subtree_last;      /*   160     8 */
>         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
>         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
>         /* size: 192, cachelines: 3, members: 17 */
>         /* sum members: 153, holes: 3, sum holes: 15 */
>         /* padding: 24 */
>         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
>
> Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> to the 50 pages in the baseline:
>
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
>
> Performance measurements using pft test on x86 do not show considerable
> difference, on Pixel 6 running Android it results in 3-5% improvement in
> faults per second.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

And of course I forgot to update Lorenzo's new locking documentation :/
Will add that in the next version.

>
> Suren Baghdasaryan (4):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: replace rw_semaphore with atomic_t in vma_lock
>   mm: move lesser used vma_area_struct members into the last cacheline
>
>  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
>  include/linux/mm_types.h  |  59 +++++++++-----
>  include/linux/mmap_lock.h |   3 +
>  kernel/fork.c             |  50 ++----------
>  mm/init-mm.c              |   2 +
>  mm/userfaultfd.c          |  14 ++--
>  6 files changed, 205 insertions(+), 86 deletions(-)
>
>
> base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> --
> 2.47.0.277.g8800431eea-goog
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2024-11-11 21:41 ` [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-11 22:18 ` Davidlohr Bueso
  2024-11-11 23:19   ` Suren Baghdasaryan
  2024-11-12  2:35 ` Liam R. Howlett
  2024-11-12  9:39 ` Lorenzo Stoakes
  7 siblings, 1 reply; 49+ messages in thread
From: Davidlohr Bueso @ 2024-11-11 22:18 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:

>To minimize memory overhead, vm_lock implementation is changed from
>using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
>vm_area_struct members are moved into the last cacheline, resulting
>in a less fragmented structure:

I am not a fan of building a custom lock, replacing a standard one.
How much do we really care about this? rwsems are quite optimized
and are known to heavily affect mm performance altogether.

...

>Performance measurements using pft test on x86 do not show considerable
>difference, on Pixel 6 running Android it results in 3-5% improvement in
>faults per second.

pft is a very micro benchmark, these results do not justify this change, imo.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 22:18 ` Davidlohr Bueso
@ 2024-11-11 23:19   ` Suren Baghdasaryan
  2024-11-12  0:03     ` Davidlohr Bueso
  2024-11-12  0:11     ` Andrew Morton
  0 siblings, 2 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-11 23:19 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
>
> >To minimize memory overhead, vm_lock implementation is changed from
> >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> >vm_area_struct members are moved into the last cacheline, resulting
> >in a less fragmented structure:
>
> I am not a fan of building a custom lock, replacing a standard one.

Understandable.

> How much do we really care about this?

In the Android world I got complaints after introducing per-vma locks.
More specifically, moving from 5.15 to 6.1 kernel, where we first
started using these locks, the memory usage increased by 10MB on
average. Not a huge deal but if we can trim it without too much
complexity, that would be definitely appreciated.

>  rwsems are quite optimized and are known to heavily affect mm performance altogether.

I know, that's why I spent an additional week profiling the new
implementation. I asked Oliver (CC'ed) to rerun the tests he used in
https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ to
confirm no regressions.

>
> ...
>
> >Performance measurements using pft test on x86 do not show considerable
> >difference, on Pixel 6 running Android it results in 3-5% improvement in
> >faults per second.
>
> pft is a very micro benchmark, these results do not justify this change, imo.

I'm not really trying to claim performance gains here. I just want to
make sure there are no regressions.
Thanks,
Suren.

>
> Thanks,
> Davidlohr

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 23:19   ` Suren Baghdasaryan
@ 2024-11-12  0:03     ` Davidlohr Bueso
  2024-11-12  0:43       ` Suren Baghdasaryan
  2024-11-12  0:11     ` Andrew Morton
  1 sibling, 1 reply; 49+ messages in thread
From: Davidlohr Bueso @ 2024-11-12  0:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:

>I'm not really trying to claim performance gains here. I just want to
>make sure there are no regressions.

You might also fine tune the atomics with acquire/release standard locking
semantics, you will probably see better numbers in Android than what you
currently have in patch 3 with full barriers - and not particularly risky
as callers expect that behaviour already.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
@ 2024-11-12  0:06   ` Andrew Morton
  2024-11-12  0:52     ` Suren Baghdasaryan
  2024-11-12  0:35   ` Davidlohr Bueso
  2024-11-12  4:58   ` Matthew Wilcox
  2 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2024-11-12  0:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, 11 Nov 2024 12:55:05 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

>  include/linux/mm.h        | 142 ++++++++++++++++++++++++++++++++++----

There's soooo much inlining happening here.  Perhaps we should have a
"default to uninlined, unless a benefit is demonstrated" guideline?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 23:19   ` Suren Baghdasaryan
  2024-11-12  0:03     ` Davidlohr Bueso
@ 2024-11-12  0:11     ` Andrew Morton
  2024-11-12  0:48       ` Suren Baghdasaryan
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2024-11-12  0:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Mon, 11 Nov 2024 15:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
> >
> > >To minimize memory overhead, vm_lock implementation is changed from
> > >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > >vm_area_struct members are moved into the last cacheline, resulting
> > >in a less fragmented structure:
> >
> > I am not a fan of building a custom lock, replacing a standard one.
> 
> Understandable.

If we're going to invent a new lock type, I'm thinking we should do
that - make it a standaline thing, add full lockdep support, etc.

I wonder if we could remove the lock from the vma altogeher and use an
old-fashioned hashed lock.  An array of locks indexed by the vma
address.  It might work well enough, although sizing the array would be
difficult.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
  2024-11-12  0:06   ` Andrew Morton
@ 2024-11-12  0:35   ` Davidlohr Bueso
  2024-11-12  0:59     ` Suren Baghdasaryan
  2024-11-12  4:58   ` Matthew Wilcox
  2 siblings, 1 reply; 49+ messages in thread
From: Davidlohr Bueso @ 2024-11-12  0:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:

>@@ -787,7 +893,10 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>	 */
>	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>-	up_write(&vma->vm_lock.lock);
>+	/* Write barrier to ensure vm_lock_seq change is visible before count */
>+	smp_wmb();
>+	rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
>+	atomic_set(&vma->vm_lock.count, VMA_LOCK_UNLOCKED);

Too many barriers here. Just do atomic_set_release and remove that
smp_wmb(). And what you are doing is really ensuring nothing leaks out
of the critical region, so that comment should also be more generic.

I would expect regression testing to catch this sort of thing.

...

> #ifdef CONFIG_PER_VMA_LOCK
>+		struct wait_queue_head vma_writer_wait;

You might want to use rcuwait here instead, which is much more
optimized for the single waiter requirement vmas have.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-12  0:03     ` Davidlohr Bueso
@ 2024-11-12  0:43       ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  0:43 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 4:03 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
>
> >I'm not really trying to claim performance gains here. I just want to
> >make sure there are no regressions.
>
> You might also fine tune the atomics with acquire/release standard locking
> semantics, you will probably see better numbers in Android than what you
> currently have in patch 3 with full barriers - and not particularly risky
> as callers expect that behaviour already.

Ack. Will try that. Thanks!

>
> Thanks,
> Davidlohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-12  0:11     ` Andrew Morton
@ 2024-11-12  0:48       ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  0:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 4:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 15:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> > >
> > > On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
> > >
> > > >To minimize memory overhead, vm_lock implementation is changed from
> > > >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > > >vm_area_struct members are moved into the last cacheline, resulting
> > > >in a less fragmented structure:
> > >
> > > I am not a fan of building a custom lock, replacing a standard one.
> >
> > Understandable.
>
> If we're going to invent a new lock type, I'm thinking we should do
> that - make it a standaline thing, add full lockdep support, etc.

Yeah, that will make it easy to experiment and replace it with a
different lock type if needed.

>
> I wonder if we could remove the lock from the vma altogeher and use an
> old-fashioned hashed lock.  An array of locks indexed by the vma
> address.  It might work well enough, although sizing the array would be
> difficult.

Ok, sounds like I'll need to experiment a bit with different lock
implementations.
I'll post a new version without the last two patches, keeping
rw_semaphore for now.
Thanks!

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-12  0:06   ` Andrew Morton
@ 2024-11-12  0:52     ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  0:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 4:07 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 12:55:05 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> >  include/linux/mm.h        | 142 ++++++++++++++++++++++++++++++++++----
>
> There's soooo much inlining happening here.  Perhaps we should have a
> "default to uninlined, unless a benefit is demonstrated" guideline?

Implementing this lock as a separate type as you suggested should help
with the amount of inlining here. I'll try to keep it sane once I
finalize that.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-12  0:35   ` Davidlohr Bueso
@ 2024-11-12  0:59     ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  0:59 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 4:35 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
>
> >@@ -787,7 +893,10 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> >        * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >        */
> >       WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >-      up_write(&vma->vm_lock.lock);
> >+      /* Write barrier to ensure vm_lock_seq change is visible before count */
> >+      smp_wmb();
> >+      rwsem_release(&vma->vm_lock.dep_map, _RET_IP_);
> >+      atomic_set(&vma->vm_lock.count, VMA_LOCK_UNLOCKED);
>
> Too many barriers here. Just do atomic_set_release and remove that
> smp_wmb(). And what you are doing is really ensuring nothing leaks out
> of the critical region, so that comment should also be more generic.

Uh, yes. I missed that.

>
> I would expect regression testing to catch this sort of thing.

Well, it's in vma_start_write(), which is in the write-locking path.
Maybe that's why it's not very visible.

>
> ...
>
> > #ifdef CONFIG_PER_VMA_LOCK
> >+              struct wait_queue_head vma_writer_wait;
>
> You might want to use rcuwait here instead, which is much more
> optimized for the single waiter requirement vmas have.

Thanks for the suggestion! I'll give it a try.

>
> Thanks,
> Davidlohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2024-11-11 22:18 ` Davidlohr Bueso
@ 2024-11-12  2:35 ` Liam R. Howlett
  2024-11-12  3:23   ` Suren Baghdasaryan
  2024-11-12  9:39 ` Lorenzo Stoakes
  7 siblings, 1 reply; 49+ messages in thread
From: Liam R. Howlett @ 2024-11-12  2:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241111 15:55]:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
> 
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
> 
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.
> To minimize memory overhead, vm_lock implementation is changed from
> using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> vm_area_struct members are moved into the last cacheline, resulting
> in a less fragmented structure:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		struct callback_head vm_rcu ;            /*     0    16 */
> 	} __attribute__((__aligned__(8)));               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	bool                       detached;             /*    40     1 */
> 
> 	/* XXX 3 bytes hole, try to pack */
> 
> 	unsigned int               vm_lock_seq;          /*    44     4 */ 
> 	struct list_head           anon_vma_chain;       /*    48    16 */ 40 + 16
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */ 56 + 8
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 
> 	/* XXX 8 bytes hole, try to pack */
> 
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb (__aligned__(8));  /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

This is 8 bytes on my compile, I guess you have userfaultfd disabled?

Considering this will end up being 256 anyways these changes may not
matter, but we can pack this better.
1. move vm_lock to after anon_vma (ends up at 64B in the end)
2. move vm_lock_seq to after vm_lock
4. move struct to the new 112 offset (which is 8B aligned at 112)
3. move detached to the end of the structure

This should limit the holes and maintain the alignments.

The down side is detached is now in the last used cache line and away
from what would probably be used with it, but maybe it doesn't matter
considering prefetch.

But maybe you have other reasons for the placement?

> 
> 	/* size: 192, cachelines: 3, members: 17 */
> 	/* sum members: 153, holes: 3, sum holes: 15 */
> 	/* padding: 24 */
> 	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
> 
> Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> to the 50 pages in the baseline:
> 
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
> 
> Performance measurements using pft test on x86 do not show considerable
> difference, on Pixel 6 running Android it results in 3-5% improvement in
> faults per second.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> Suren Baghdasaryan (4):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: replace rw_semaphore with atomic_t in vma_lock
>   mm: move lesser used vma_area_struct members into the last cacheline
> 
>  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
>  include/linux/mm_types.h  |  59 +++++++++-----
>  include/linux/mmap_lock.h |   3 +
>  kernel/fork.c             |  50 ++----------
>  mm/init-mm.c              |   2 +
>  mm/userfaultfd.c          |  14 ++--
>  6 files changed, 205 insertions(+), 86 deletions(-)
> 
> 
> base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> -- 
> 2.47.0.277.g8800431eea-goog
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 21:41 ` [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-12  2:48   ` Liam R. Howlett
  2024-11-12  3:27     ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Liam R. Howlett @ 2024-11-12  2:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241111 16:41]:
> On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:

Wait a second, this is taking 40B down to 8B, but the alignment of the
vma will surely absorb that 32B difference?  The struct sum is 153B
according to what you have below so we won't go over 192B.  What am I
missing?

> >
> > struct vm_area_struct {
> >         union {
> >                 struct {
> >                         long unsigned int vm_start;      /*     0     8 */
> >                         long unsigned int vm_end;        /*     8     8 */
> >                 };                                       /*     0    16 */
> >                 struct callback_head vm_rcu ;            /*     0    16 */
> >         } __attribute__((__aligned__(8)));               /*     0    16 */
> >         struct mm_struct *         vm_mm;                /*    16     8 */
> >         pgprot_t                   vm_page_prot;         /*    24     8 */
> >         union {
> >                 const vm_flags_t   vm_flags;             /*    32     8 */
> >                 vm_flags_t         __vm_flags;           /*    32     8 */
> >         };                                               /*    32     8 */
> >         bool                       detached;             /*    40     1 */
> >
> >         /* XXX 3 bytes hole, try to pack */
> >
> >         unsigned int               vm_lock_seq;          /*    44     4 */
> >         struct list_head           anon_vma_chain;       /*    48    16 */
> >         /* --- cacheline 1 boundary (64 bytes) --- */
> >         struct anon_vma *          anon_vma;             /*    64     8 */
> >         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >         long unsigned int          vm_pgoff;             /*    80     8 */
> >         struct file *              vm_file;              /*    88     8 */
> >         void *                     vm_private_data;      /*    96     8 */
> >         atomic_long_t              swap_readahead_info;  /*   104     8 */
> >         struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >         /* XXX 8 bytes hole, try to pack */
> >
> >         /* --- cacheline 2 boundary (128 bytes) --- */
> >         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >         /* XXX 4 bytes hole, try to pack */
> >
> >         struct {
> >                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >                 long unsigned int  rb_subtree_last;      /*   160     8 */
> >         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
> >
> >         /* size: 192, cachelines: 3, members: 17 */
> >         /* sum members: 153, holes: 3, sum holes: 15 */
> >         /* padding: 24 */
> >         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> And of course I forgot to update Lorenzo's new locking documentation :/
> Will add that in the next version.
> 
> >
> > Suren Baghdasaryan (4):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: replace rw_semaphore with atomic_t in vma_lock
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >
> >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> >  include/linux/mm_types.h  |  59 +++++++++-----
> >  include/linux/mmap_lock.h |   3 +
> >  kernel/fork.c             |  50 ++----------
> >  mm/init-mm.c              |   2 +
> >  mm/userfaultfd.c          |  14 ++--
> >  6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-12  2:35 ` Liam R. Howlett
@ 2024-11-12  3:23   ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  3:23 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 6:35 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241111 15:55]:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:
> >
> > struct vm_area_struct {
> >       union {
> >               struct {
> >                       long unsigned int vm_start;      /*     0     8 */
> >                       long unsigned int vm_end;        /*     8     8 */
> >               };                                       /*     0    16 */
> >               struct callback_head vm_rcu ;            /*     0    16 */
> >       } __attribute__((__aligned__(8)));               /*     0    16 */
> >       struct mm_struct *         vm_mm;                /*    16     8 */
> >       pgprot_t                   vm_page_prot;         /*    24     8 */
> >       union {
> >               const vm_flags_t   vm_flags;             /*    32     8 */
> >               vm_flags_t         __vm_flags;           /*    32     8 */
> >       };                                               /*    32     8 */
> >       bool                       detached;             /*    40     1 */
> >
> >       /* XXX 3 bytes hole, try to pack */
> >
> >       unsigned int               vm_lock_seq;          /*    44     4 */
> >       struct list_head           anon_vma_chain;       /*    48    16 */ 40 + 16
> >       /* --- cacheline 1 boundary (64 bytes) --- */
> >       struct anon_vma *          anon_vma;             /*    64     8 */ 56 + 8
> >       const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >       long unsigned int          vm_pgoff;             /*    80     8 */
> >       struct file *              vm_file;              /*    88     8 */
> >       void *                     vm_private_data;      /*    96     8 */
> >       atomic_long_t              swap_readahead_info;  /*   104     8 */
> >       struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >       /* XXX 8 bytes hole, try to pack */
> >
> >       /* --- cacheline 2 boundary (128 bytes) --- */
> >       struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >       /* XXX 4 bytes hole, try to pack */
> >
> >       struct {
> >               struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >               long unsigned int  rb_subtree_last;      /*   160     8 */
> >       } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >       struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
> This is 8 bytes on my compile, I guess you have userfaultfd disabled?

Yeah, I show here results with the defconfig. After I move things
around at the end we will have space for to keep everything under 3
cachelines.

>
> Considering this will end up being 256 anyways these changes may not
> matter, but we can pack this better.
> 1. move vm_lock to after anon_vma (ends up at 64B in the end)
> 2. move vm_lock_seq to after vm_lock

Nope, for performance reasons it's important to keep vm_lock_seq in
the first cacheline. It's used very extensively when read-locking the
VMA.

> 4. move struct to the new 112 offset (which is 8B aligned at 112)
> 3. move detached to the end of the structure

detached also should stay in the first cacheline, otherwise we will
get performance regression. I spent a week experimenting with what we
can and can't move :)

>
> This should limit the holes and maintain the alignments.
>
> The down side is detached is now in the last used cache line and away
> from what would probably be used with it, but maybe it doesn't matter
> considering prefetch.

It very much does matter :)

>
> But maybe you have other reasons for the placement?
>
> >
> >       /* size: 192, cachelines: 3, members: 17 */
> >       /* sum members: 153, holes: 3, sum holes: 15 */
> >       /* padding: 24 */
> >       /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Suren Baghdasaryan (4):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: replace rw_semaphore with atomic_t in vma_lock
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >
> >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> >  include/linux/mm_types.h  |  59 +++++++++-----
> >  include/linux/mmap_lock.h |   3 +
> >  kernel/fork.c             |  50 ++----------
> >  mm/init-mm.c              |   2 +
> >  mm/userfaultfd.c          |  14 ++--
> >  6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-12  2:48   ` Liam R. Howlett
@ 2024-11-12  3:27     ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  3:27 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 6:48 PM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241111 16:41]:
> > On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > > cacheline boundary and changing the cache to be cache-aligned as well.
> > > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > > (vm_lock) bytes to 256 bytes:
> > >
> > >     slabinfo before:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vma_lock         ...     40  102    1 : ...
> > >      vm_area_struct   ...    160   51    2 : ...
> > >
> > >     slabinfo after moving vm_lock:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    256   32    2 : ...
> > >
> > > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > > which is 5.5MB per 100000 VMAs.
> > > To minimize memory overhead, vm_lock implementation is changed from
> > > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > > vm_area_struct members are moved into the last cacheline, resulting
> > > in a less fragmented structure:
>
> Wait a second, this is taking 40B down to 8B, but the alignment of the
> vma will surely absorb that 32B difference?  The struct sum is 153B
> according to what you have below so we won't go over 192B.  What am I
> missing?

Take a look at the last patch in the series called "[PATCH 4/4] mm:
move lesser used vma_area_struct members into the last cacheline". I
move some struct members from the earlier cachelines into cacheline #4
where the vm_lock is staying.
>
> > >
> > > struct vm_area_struct {
> > >         union {
> > >                 struct {
> > >                         long unsigned int vm_start;      /*     0     8 */
> > >                         long unsigned int vm_end;        /*     8     8 */
> > >                 };                                       /*     0    16 */
> > >                 struct callback_head vm_rcu ;            /*     0    16 */
> > >         } __attribute__((__aligned__(8)));               /*     0    16 */
> > >         struct mm_struct *         vm_mm;                /*    16     8 */
> > >         pgprot_t                   vm_page_prot;         /*    24     8 */
> > >         union {
> > >                 const vm_flags_t   vm_flags;             /*    32     8 */
> > >                 vm_flags_t         __vm_flags;           /*    32     8 */
> > >         };                                               /*    32     8 */
> > >         bool                       detached;             /*    40     1 */
> > >
> > >         /* XXX 3 bytes hole, try to pack */
> > >
> > >         unsigned int               vm_lock_seq;          /*    44     4 */
> > >         struct list_head           anon_vma_chain;       /*    48    16 */
> > >         /* --- cacheline 1 boundary (64 bytes) --- */
> > >         struct anon_vma *          anon_vma;             /*    64     8 */
> > >         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> > >         long unsigned int          vm_pgoff;             /*    80     8 */
> > >         struct file *              vm_file;              /*    88     8 */
> > >         void *                     vm_private_data;      /*    96     8 */
> > >         atomic_long_t              swap_readahead_info;  /*   104     8 */
> > >         struct mempolicy *         vm_policy;            /*   112     8 */
> > >
> > >         /* XXX 8 bytes hole, try to pack */
> > >
> > >         /* --- cacheline 2 boundary (128 bytes) --- */
> > >         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> > >
> > >         /* XXX 4 bytes hole, try to pack */
> > >
> > >         struct {
> > >                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
> > >                 long unsigned int  rb_subtree_last;      /*   160     8 */
> > >         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> > >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
> > >
> > >         /* size: 192, cachelines: 3, members: 17 */
> > >         /* sum members: 153, holes: 3, sum holes: 15 */
> > >         /* padding: 24 */
> > >         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > > } __attribute__((__aligned__(64)));
> > >
> > > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > > to the 50 pages in the baseline:
> > >
> > >     slabinfo after vm_area_struct changes:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    192   42    2 : ...
> > >
> > > Performance measurements using pft test on x86 do not show considerable
> > > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > > faults per second.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > And of course I forgot to update Lorenzo's new locking documentation :/
> > Will add that in the next version.
> >
> > >
> > > Suren Baghdasaryan (4):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: replace rw_semaphore with atomic_t in vma_lock
> > >   mm: move lesser used vma_area_struct members into the last cacheline
> > >
> > >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> > >  include/linux/mm_types.h  |  59 +++++++++-----
> > >  include/linux/mmap_lock.h |   3 +
> > >  kernel/fork.c             |  50 ++----------
> > >  mm/init-mm.c              |   2 +
> > >  mm/userfaultfd.c          |  14 ++--
> > >  6 files changed, 205 insertions(+), 86 deletions(-)
> > >
> > >
> > > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > > --
> > > 2.47.0.277.g8800431eea-goog
> > >
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
  2024-11-12  0:06   ` Andrew Morton
  2024-11-12  0:35   ` Davidlohr Bueso
@ 2024-11-12  4:58   ` Matthew Wilcox
  2024-11-12 15:18     ` Suren Baghdasaryan
  2 siblings, 1 reply; 49+ messages in thread
From: Matthew Wilcox @ 2024-11-12  4:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 12:55:05PM -0800, Suren Baghdasaryan wrote:
> When a reader takes read lock, it increments the atomic, unless the
> top two bits are set indicating a writer is present.
> When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there
> are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock
> and puts itself onto newly introduced mm.vma_writer_wait. Since all
> writers take mmap_lock in write mode first, there can be only one writer
> at a time. The last reader to release the lock will signal the writer
> to wake up.

I don't think you need two bits.  You can do it this way:

0x8000'0000 - No readers, no writers
0x1-7fff'ffff - Some number of readers
0x0 - Writer held
0x8000'0001-0xffff'ffff - Reader held, writer waiting

A prospective writer subtracts 0x8000'0000.  If the result is 0, it got
the lock, otherwise it sleeps until it is 0.

A writer unlocks by adding 0x8000'0000 (not by setting the value to
0x8000'0000).

A reader unlocks by adding 1.  If the result is 0, it wakes the writer.

A prospective reader subtracts 1.  If the result is positive, it got the
lock, otherwise it does the unlock above (this might be the one which
wakes the writer).

And ... that's it.  See how we use the CPU arithmetic flags to tell us
everything we need to know without doing arithmetic separately?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-11-12  9:36   ` kernel test robot
  2024-11-12  9:47   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: kernel test robot @ 2024-11-12  9:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: llvm, oe-kbuild-all, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team, surenb

Hi Suren,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 931086f2a88086319afb57cd3925607e8cda0a9f]

url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241112-050531
base:   931086f2a88086319afb57cd3925607e8cda0a9f
patch link:    https://lore.kernel.org/r/20241111205506.3404479-3-surenb%40google.com
patch subject: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
config: um-defconfig (https://download.01.org/0day-ci/archive/20241112/202411121742.VK3YF84e-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241112/202411121742.VK3YF84e-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411121742.VK3YF84e-lkp@intel.com/

All warnings (new ones prefixed by >>):

         |         ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:255:10: warning: array index 1 is past the end of the array (that has type 'unsigned long[1]') [-Warray-bounds]
     255 |         case 2: set->sig[1] = -1;
         |                 ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:2234:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from ipc/msg.c:33:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from ipc/msg.c:33:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from ipc/msg.c:33:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     693 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     701 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     709 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     718 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     727 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     736 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> ipc/msg.c:497:20: warning: implicit conversion from 'int' to 'unsigned short' changes value from 32768000 to 0 [-Wconstant-conversion]
     497 |         msginfo->msgseg = MSGSEG;
         |                         ~ ^~~~~~
   include/uapi/linux/msg.h:87:38: note: expanded from macro 'MSGSEG'
      87 | #define MSGSEG (__MSGSEG <= 0xffff ? __MSGSEG : 0xffff)
         |                                      ^~~~~~~~
   include/uapi/linux/msg.h:86:36: note: expanded from macro '__MSGSEG'
      86 | #define __MSGSEG ((MSGPOOL * 1024) / MSGSSZ) /* max no. of segments */
         |                   ~~~~~~~~~~~~~~~~~^~~~~~~~
   63 warnings and 3 errors generated.
--
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:8:
   include/linux/mm.h:877:2: error: call to undeclared function 'vma_lock_init'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     877 |         vma_lock_init(&vma->vm_lock);
         |         ^
   include/linux/mm.h:877:2: note: did you mean 'osq_lock_init'?
   include/linux/osq_lock.h:23:20: note: 'osq_lock_init' declared here
      23 | static inline void osq_lock_init(struct optimistic_spin_queue *lock)
         |                    ^
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:8:
   include/linux/mm.h:877:22: error: no member named 'vm_lock' in 'struct vm_area_struct'
     877 |         vma_lock_init(&vma->vm_lock);
         |                        ~~~  ^
   include/linux/mm.h:878:7: error: no member named 'vm_lock_seq' in 'struct vm_area_struct'
     878 |         vma->vm_lock_seq = UINT_MAX;
         |         ~~~  ^
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:2234:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from drivers/tty/tty_port.c:8:
   In file included from include/linux/tty.h:11:
   In file included from include/linux/tty_port.h:5:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     693 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     701 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     709 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     718 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     727 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     736 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> drivers/tty/tty_port.c:266:2: warning: implicit conversion from 'unsigned long' to 'unsigned int' changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]
     266 |         INIT_KFIFO(port->xmit_fifo);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/kfifo.h:136:69: note: expanded from macro 'INIT_KFIFO'
     136 |         __kfifo->mask = __is_kfifo_ptr(__tmp) ? 0 : ARRAY_SIZE(__tmp->buf) - 1;\
         |                       ~                             ~~~~~~~~~~~~~~~~~~~~~~~^~~
   14 warnings and 3 errors generated.


vim +497 ipc/msg.c

a0d092fc2df845 Pierre Peiffer  2008-04-29  476  
156d9ed1260ee5 Al Viro         2017-07-09  477  static int msgctl_info(struct ipc_namespace *ns, int msqid,
156d9ed1260ee5 Al Viro         2017-07-09  478  			 int cmd, struct msginfo *msginfo)
a0d092fc2df845 Pierre Peiffer  2008-04-29  479  {
2cafed30f150f7 Davidlohr Bueso 2013-07-08  480  	int err;
27c331a1746142 Manfred Spraul  2018-08-21  481  	int max_idx;
5a06a363ef4844 Ingo Molnar     2006-07-30  482  
5a06a363ef4844 Ingo Molnar     2006-07-30  483  	/*
5a06a363ef4844 Ingo Molnar     2006-07-30  484  	 * We must not return kernel stack data.
^1da177e4c3f41 Linus Torvalds  2005-04-16  485  	 * due to padding, it's not enough
^1da177e4c3f41 Linus Torvalds  2005-04-16  486  	 * to set all member fields.
^1da177e4c3f41 Linus Torvalds  2005-04-16  487  	 */
^1da177e4c3f41 Linus Torvalds  2005-04-16  488  	err = security_msg_queue_msgctl(NULL, cmd);
^1da177e4c3f41 Linus Torvalds  2005-04-16  489  	if (err)
^1da177e4c3f41 Linus Torvalds  2005-04-16  490  		return err;
^1da177e4c3f41 Linus Torvalds  2005-04-16  491  
156d9ed1260ee5 Al Viro         2017-07-09  492  	memset(msginfo, 0, sizeof(*msginfo));
156d9ed1260ee5 Al Viro         2017-07-09  493  	msginfo->msgmni = ns->msg_ctlmni;
156d9ed1260ee5 Al Viro         2017-07-09  494  	msginfo->msgmax = ns->msg_ctlmax;
156d9ed1260ee5 Al Viro         2017-07-09  495  	msginfo->msgmnb = ns->msg_ctlmnb;
156d9ed1260ee5 Al Viro         2017-07-09  496  	msginfo->msgssz = MSGSSZ;
156d9ed1260ee5 Al Viro         2017-07-09 @497  	msginfo->msgseg = MSGSEG;
d9a605e40b1376 Davidlohr Bueso 2013-09-11  498  	down_read(&msg_ids(ns).rwsem);
72d1e611082eda Jiebin Sun      2022-09-14  499  	if (cmd == MSG_INFO)
156d9ed1260ee5 Al Viro         2017-07-09  500  		msginfo->msgpool = msg_ids(ns).in_use;
72d1e611082eda Jiebin Sun      2022-09-14  501  	max_idx = ipc_get_maxidx(&msg_ids(ns));
72d1e611082eda Jiebin Sun      2022-09-14  502  	up_read(&msg_ids(ns).rwsem);
72d1e611082eda Jiebin Sun      2022-09-14  503  	if (cmd == MSG_INFO) {
72d1e611082eda Jiebin Sun      2022-09-14  504  		msginfo->msgmap = min_t(int,
72d1e611082eda Jiebin Sun      2022-09-14  505  				     percpu_counter_sum(&ns->percpu_msg_hdrs),
72d1e611082eda Jiebin Sun      2022-09-14  506  				     INT_MAX);
72d1e611082eda Jiebin Sun      2022-09-14  507  		msginfo->msgtql = min_t(int,
72d1e611082eda Jiebin Sun      2022-09-14  508  		                     percpu_counter_sum(&ns->percpu_msg_bytes),
72d1e611082eda Jiebin Sun      2022-09-14  509  				     INT_MAX);
^1da177e4c3f41 Linus Torvalds  2005-04-16  510  	} else {
156d9ed1260ee5 Al Viro         2017-07-09  511  		msginfo->msgmap = MSGMAP;
156d9ed1260ee5 Al Viro         2017-07-09  512  		msginfo->msgpool = MSGPOOL;
156d9ed1260ee5 Al Viro         2017-07-09  513  		msginfo->msgtql = MSGTQL;
^1da177e4c3f41 Linus Torvalds  2005-04-16  514  	}
27c331a1746142 Manfred Spraul  2018-08-21  515  	return (max_idx < 0) ? 0 : max_idx;
^1da177e4c3f41 Linus Torvalds  2005-04-16  516  }
2cafed30f150f7 Davidlohr Bueso 2013-07-08  517  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2024-11-12  2:35 ` Liam R. Howlett
@ 2024-11-12  9:39 ` Lorenzo Stoakes
  2024-11-12 15:38   ` Suren Baghdasaryan
  7 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2024-11-12  9:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 12:55:02PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

Sorry to nag + add workload, but could you also add changes to the
Documentation/mm/process_addrs.rst file to reflect this as part of the
series? It'd be really cool to have the change as part of your series, so
when it lands we update the documentation with it.

This change fundamentally changes internals that are documented there so it
makes sense to :)

Thanks!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-12  9:36   ` kernel test robot
@ 2024-11-12  9:47   ` kernel test robot
  2024-11-12 10:18   ` kernel test robot
  2024-11-12 15:57   ` Vlastimil Babka
  3 siblings, 0 replies; 49+ messages in thread
From: kernel test robot @ 2024-11-12  9:47 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: llvm, oe-kbuild-all, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team, surenb

Hi Suren,

kernel test robot noticed the following build errors:

[auto build test ERROR on 931086f2a88086319afb57cd3925607e8cda0a9f]

url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241112-050531
base:   931086f2a88086319afb57cd3925607e8cda0a9f
patch link:    https://lore.kernel.org/r/20241111205506.3404479-3-surenb%40google.com
patch subject: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
config: um-x86_64_defconfig (https://download.01.org/0day-ci/archive/20241112/202411121745.VmaDStMg-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241112/202411121745.VmaDStMg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411121745.VmaDStMg-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from ipc/util.c:47:
>> include/linux/mm.h:877:2: error: call to undeclared function 'vma_lock_init'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
           vma_lock_init(&vma->vm_lock);
           ^
   include/linux/mm.h:877:2: note: did you mean 'osq_lock_init'?
   include/linux/osq_lock.h:23:20: note: 'osq_lock_init' declared here
   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
                      ^
   In file included from ipc/util.c:47:
   include/linux/mm.h:877:22: error: no member named 'vm_lock' in 'struct vm_area_struct'
           vma_lock_init(&vma->vm_lock);
                          ~~~  ^
   include/linux/mm.h:878:7: error: no member named 'vm_lock_seq' in 'struct vm_area_struct'
           vma->vm_lock_seq = UINT_MAX;
           ~~~  ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                                         ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:99:4: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
                           set->sig[1] | set->sig[0]) == 0;
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:101:11: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[1] | set->sig[0]) == 0;
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                            ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                                            ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/util.c:47:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
--
   In file included from ipc/msgutil.c:9:
   In file included from include/linux/security.h:33:
>> include/linux/mm.h:877:2: error: call to undeclared function 'vma_lock_init'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
           vma_lock_init(&vma->vm_lock);
           ^
   include/linux/mm.h:877:2: note: did you mean 'osq_lock_init'?
   include/linux/osq_lock.h:23:20: note: 'osq_lock_init' declared here
   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
                      ^
   In file included from ipc/msgutil.c:9:
   In file included from include/linux/security.h:33:
   include/linux/mm.h:877:22: error: no member named 'vm_lock' in 'struct vm_area_struct'
           vma_lock_init(&vma->vm_lock);
                          ~~~  ^
   include/linux/mm.h:878:7: error: no member named 'vm_lock_seq' in 'struct vm_area_struct'
           vma->vm_lock_seq = UINT_MAX;
           ~~~  ^
   In file included from ipc/msgutil.c:9:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from ipc/msgutil.c:9:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from ipc/msgutil.c:9:
   In file included from include/linux/security.h:35:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:8:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   12 warnings and 3 errors generated.
--
   In file included from ipc/msg.c:30:
>> include/linux/mm.h:877:2: error: call to undeclared function 'vma_lock_init'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
           vma_lock_init(&vma->vm_lock);
           ^
   include/linux/mm.h:877:2: note: did you mean 'osq_lock_init'?
   include/linux/osq_lock.h:23:20: note: 'osq_lock_init' declared here
   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
                      ^
   In file included from ipc/msg.c:30:
   include/linux/mm.h:877:22: error: no member named 'vm_lock' in 'struct vm_area_struct'
           vma_lock_init(&vma->vm_lock);
                          ~~~  ^
   include/linux/mm.h:878:7: error: no member named 'vm_lock_seq' in 'struct vm_area_struct'
           vma->vm_lock_seq = UINT_MAX;
           ~~~  ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                                         ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:99:4: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
                           set->sig[1] | set->sig[0]) == 0;
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:101:11: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return (set->sig[1] | set->sig[0]) == 0;
                           ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                            ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                                            ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from ipc/msg.c:30:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
..


vim +/vma_lock_init +877 include/linux/mm.h

   868	
   869	static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
   870	{
   871		memset(vma, 0, sizeof(*vma));
   872		vma->vm_mm = mm;
   873		vma->vm_ops = &vma_dummy_vm_ops;
   874		INIT_LIST_HEAD(&vma->anon_vma_chain);
   875		vma_mark_detached(vma, false);
   876		vma_numab_state_init(vma);
 > 877		vma_lock_init(&vma->vm_lock);
   878		vma->vm_lock_seq = UINT_MAX;
   879	}
   880	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline
  2024-11-11 20:55 ` [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
@ 2024-11-12 10:07   ` Vlastimil Babka
  2024-11-12 15:45     ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2024-11-12 10:07 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On 11/11/24 21:55, Suren Baghdasaryan wrote:
> Move several vma_area_struct members which are rarely or never used
> during page fault handling into the last cacheline to better pack
> vm_area_struct. As a result vm_area_struct will fit into 3 cachelines
> as opposed to 4 cachelines before this change. New vm_area_struct layout:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		struct callback_head vm_rcu ;            /*     0    16 */
> 	} __attribute__((__aligned__(8)));               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	bool                       detached;             /*    40     1 */
> 
> 	/* XXX 3 bytes hole, try to pack */
> 
> 	unsigned int               vm_lock_seq;          /*    44     4 */
> 	struct list_head           anon_vma_chain;       /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 
> 	/* XXX 8 bytes hole, try to pack */
> 
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb (__aligned__(8));  /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

I don't see anon_name in the output, I thought it was added for Android? :)

> 
> 	/* size: 192, cachelines: 3, members: 17 */
> 	/* sum members: 153, holes: 3, sum holes: 15 */
> 	/* padding: 24 */

Instead you seem to have padding so an attempt to use SLAB_TYPESAFE_BY_RCU
should use that and not add more up to 256 pages.
Perhaps this pahole output wasn't generated with a fully representative config?

> 	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
> 
> 
> Memory consumption per 1000 VMAs becomes 48 pages:
> 
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
> 
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm_types.h | 37 ++++++++++++++++++-------------------
>  1 file changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 789bccc05520..c3755b680911 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -733,16 +733,6 @@ struct vm_area_struct {
>  	unsigned int vm_lock_seq;
>  #endif
>  
> -	/*
> -	 * For areas with an address space and backing store,
> -	 * linkage into the address_space->i_mmap interval tree.
> -	 *
> -	 */
> -	struct {
> -		struct rb_node rb;
> -		unsigned long rb_subtree_last;
> -	} shared;
> -
>  	/*
>  	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
>  	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
> @@ -762,14 +752,6 @@ struct vm_area_struct {
>  	struct file * vm_file;		/* File we map to (can be NULL). */
>  	void * vm_private_data;		/* was vm_pte (shared mem) */
>  
> -#ifdef CONFIG_ANON_VMA_NAME
> -	/*
> -	 * For private and shared anonymous mappings, a pointer to a null
> -	 * terminated string containing the name given to the vma, or NULL if
> -	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> -	 */
> -	struct anon_vma_name *anon_name;
> -#endif
>  #ifdef CONFIG_SWAP
>  	atomic_long_t swap_readahead_info;
>  #endif
> @@ -782,11 +764,28 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA_BALANCING
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
> -	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  #ifdef CONFIG_PER_VMA_LOCK
>  	/* Unstable RCU readers are allowed to read this. */
>  	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
>  #endif
> +	/*
> +	 * For areas with an address space and backing store,
> +	 * linkage into the address_space->i_mmap interval tree.
> +	 *
> +	 */
> +	struct {
> +		struct rb_node rb;
> +		unsigned long rb_subtree_last;
> +	} shared;
> +#ifdef CONFIG_ANON_VMA_NAME
> +	/*
> +	 * For private and shared anonymous mappings, a pointer to a null
> +	 * terminated string containing the name given to the vma, or NULL if
> +	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> +	 */
> +	struct anon_vma_name *anon_name;
> +#endif
> +	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  } __randomize_layout;
>  
>  #ifdef CONFIG_NUMA



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-11-12  9:36   ` kernel test robot
  2024-11-12  9:47   ` kernel test robot
@ 2024-11-12 10:18   ` kernel test robot
  2024-11-12 15:57   ` Vlastimil Babka
  3 siblings, 0 replies; 49+ messages in thread
From: kernel test robot @ 2024-11-12 10:18 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: llvm, oe-kbuild-all, willy, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	minchan, jannh, shakeel.butt, souravpanda, pasha.tatashin,
	linux-mm, linux-kernel, kernel-team, surenb

Hi Suren,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 931086f2a88086319afb57cd3925607e8cda0a9f]

url:    https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/mm-introduce-vma_start_read_locked-_nested-helpers/20241112-050531
base:   931086f2a88086319afb57cd3925607e8cda0a9f
patch link:    https://lore.kernel.org/r/20241111205506.3404479-3-surenb%40google.com
patch subject: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
config: hexagon-randconfig-002-20241112 (https://download.01.org/0day-ci/archive/20241112/202411121840.hE2wZKgE-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241112/202411121840.hE2wZKgE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411121840.hE2wZKgE-lkp@intel.com/

All warnings (new ones prefixed by >>):

         |                            ^
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
      62 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:10: note: expanded from macro '_SIG_SET_OP'
     174 |         case 4: set->sig[3] = op(set->sig[3]);                          \
         |                 ^        ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
      62 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:20: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                                  ^        ~
   include/linux/signal.h:186:24: note: expanded from macro '_sig_not'
     186 | #define _sig_not(x)     (~(x))
         |                            ^
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
      62 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:1143:
   In file included from include/linux/huge_mm.h:7:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:3: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                 ^        ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
      62 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:2234:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:13:
   In file included from arch/hexagon/include/asm/dma.h:9:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:13:
   In file included from arch/hexagon/include/asm/dma.h:9:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:13:
   In file included from arch/hexagon/include/asm/dma.h:9:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:52: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                    ^~~~
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:61: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                             ^~~~
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:86: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                                     ~~~~~~~~~~~~~~~~~^~~~~
   include/linux/compiler.h:68:3: note: expanded from macro '__trace_if_value'
      68 |         (cond) ?                                        \
         |          ^~~~
   38 warnings and 3 errors generated.
--
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:12:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:12:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:12:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/hexagon/include/asm/io.h:328:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
   include/linux/mm.h:877:2: error: call to undeclared function 'vma_lock_init'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     877 |         vma_lock_init(&vma->vm_lock);
         |         ^
   include/linux/mm.h:877:2: note: did you mean 'osq_lock_init'?
   include/linux/osq_lock.h:23:20: note: 'osq_lock_init' declared here
      23 | static inline void osq_lock_init(struct optimistic_spin_queue *lock)
         |                    ^
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
   include/linux/mm.h:877:22: error: no member named 'vm_lock' in 'struct vm_area_struct'
     877 |         vma_lock_init(&vma->vm_lock);
         |                        ~~~  ^
   include/linux/mm.h:878:7: error: no member named 'vm_lock_seq' in 'struct vm_area_struct'
     878 |         vma->vm_lock_seq = UINT_MAX;
         |         ~~~  ^
   In file included from drivers/iio/adc/fsl-imx25-gcq.c:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
   In file included from include/linux/mm.h:2234:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> drivers/iio/adc/fsl-imx25-gcq.c:116:8: warning: shift count is negative [-Wshift-count-negative]
     116 |                      MX25_ADCQ_ITEM(0, chan->channel));
         |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mfd/imx25-tsadc.h:54:3: note: expanded from macro 'MX25_ADCQ_ITEM'
      54 |                 _MX25_ADCQ_ITEM((item) - 8, (x)) : _MX25_ADCQ_ITEM((item), (x)))
         |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mfd/imx25-tsadc.h:52:39: note: expanded from macro '_MX25_ADCQ_ITEM'
      52 | #define _MX25_ADCQ_ITEM(item, x)        ((x) << ((item) * 4))
         |                                              ^  ~~~~~~~~~~~~
   8 warnings and 3 errors generated.


vim +147 kernel/dma/direct.c

aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  117  
26749b3201ab05e kernel/dma/direct.c Christoph Hellwig      2020-06-15  118  static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  119  		gfp_t gfp, bool allow_highmem)
a8463d4b0e47d1f lib/dma-noop.c      Christian Borntraeger  2016-02-02  120  {
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  121  	int node = dev_to_node(dev);
080321d3b3139b3 lib/dma-direct.c    Christoph Hellwig      2017-12-22  122  	struct page *page = NULL;
a7ba70f1787f977 kernel/dma/direct.c Nicolas Saenz Julienne 2019-11-21  123  	u64 phys_limit;
a8463d4b0e47d1f lib/dma-noop.c      Christian Borntraeger  2016-02-02  124  
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  125  	WARN_ON_ONCE(!PAGE_ALIGNED(size));
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  126  
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  127  	if (is_swiotlb_for_alloc(dev))
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  128  		return dma_direct_alloc_swiotlb(dev, size);
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  129  
25a4ce564921db0 kernel/dma/direct.c Petr Tesarik           2023-02-20  130  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  131  	page = dma_alloc_contiguous(dev, size, gfp);
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  132  	if (page) {
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  133  		if (!dma_coherent_ok(dev, page_to_phys(page), size) ||
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  134  		    (!allow_highmem && PageHighMem(page))) {
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  135  			dma_free_contiguous(dev, page, size);
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  136  			page = NULL;
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  137  		}
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  138  	}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  139  again:
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  140  	if (!page)
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  141  		page = alloc_pages_node(node, gfp, get_order(size));
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  142  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
f689a3ab7b8ece9 kernel/dma/direct.c Chen Yu                2024-08-31  143  		__free_pages(page, get_order(size));
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  144  		page = NULL;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  145  
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  146  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
a7ba70f1787f977 kernel/dma/direct.c Nicolas Saenz Julienne 2019-11-21 @147  		    phys_limit < DMA_BIT_MASK(64) &&
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  148  		    !(gfp & (GFP_DMA32 | GFP_DMA))) {
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  149  			gfp |= GFP_DMA32;
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  150  			goto again;
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  151  		}
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  152  
fbce251baa6e357 kernel/dma/direct.c Christoph Hellwig      2019-02-13  153  		if (IS_ENABLED(CONFIG_ZONE_DMA) && !(gfp & GFP_DMA)) {
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  154  			gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  155  			goto again;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  156  		}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  157  	}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  158  
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  159  	return page;
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  160  }
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  161  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-12  4:58   ` Matthew Wilcox
@ 2024-11-12 15:18     ` Suren Baghdasaryan
  2024-12-10 22:38       ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 15:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Mon, Nov 11, 2024 at 8:58 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 11, 2024 at 12:55:05PM -0800, Suren Baghdasaryan wrote:
> > When a reader takes read lock, it increments the atomic, unless the
> > top two bits are set indicating a writer is present.
> > When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there
> > are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock
> > and puts itself onto newly introduced mm.vma_writer_wait. Since all
> > writers take mmap_lock in write mode first, there can be only one writer
> > at a time. The last reader to release the lock will signal the writer
> > to wake up.
>
> I don't think you need two bits.  You can do it this way:
>
> 0x8000'0000 - No readers, no writers
> 0x1-7fff'ffff - Some number of readers
> 0x0 - Writer held
> 0x8000'0001-0xffff'ffff - Reader held, writer waiting
>
> A prospective writer subtracts 0x8000'0000.  If the result is 0, it got
> the lock, otherwise it sleeps until it is 0.
>
> A writer unlocks by adding 0x8000'0000 (not by setting the value to
> 0x8000'0000).
>
> A reader unlocks by adding 1.  If the result is 0, it wakes the writer.
>
> A prospective reader subtracts 1.  If the result is positive, it got the
> lock, otherwise it does the unlock above (this might be the one which
> wakes the writer).
>
> And ... that's it.  See how we use the CPU arithmetic flags to tell us
> everything we need to know without doing arithmetic separately?

Yes, this is neat! You are using the fact that write-locked == no
readers to eliminate unnecessary state. I'll give that a try. Thanks!

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/4] move per-vma lock into vm_area_struct
  2024-11-12  9:39 ` Lorenzo Stoakes
@ 2024-11-12 15:38   ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 15:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, willy, liam.howlett, mhocko, vbabka, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 1:39 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Nov 11, 2024 at 12:55:02PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> Sorry to nag + add workload, but could you also add changes to the
> Documentation/mm/process_addrs.rst file to reflect this as part of the
> series? It'd be really cool to have the change as part of your series, so
> when it lands we update the documentation with it.
>
> This change fundamentally changes internals that are documented there so it
> makes sense to :)

Yep, I already prepared the documentation patch to be included in v2.
Sorry for missing it the first time around and thanks for the
reminder.

>
> Thanks!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline
  2024-11-12 10:07   ` Vlastimil Babka
@ 2024-11-12 15:45     ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 15:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 2:07 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/11/24 21:55, Suren Baghdasaryan wrote:
> > Move several vma_area_struct members which are rarely or never used
> > during page fault handling into the last cacheline to better pack
> > vm_area_struct. As a result vm_area_struct will fit into 3 cachelines
> > as opposed to 4 cachelines before this change. New vm_area_struct layout:
> >
> > struct vm_area_struct {
> >       union {
> >               struct {
> >                       long unsigned int vm_start;      /*     0     8 */
> >                       long unsigned int vm_end;        /*     8     8 */
> >               };                                       /*     0    16 */
> >               struct callback_head vm_rcu ;            /*     0    16 */
> >       } __attribute__((__aligned__(8)));               /*     0    16 */
> >       struct mm_struct *         vm_mm;                /*    16     8 */
> >       pgprot_t                   vm_page_prot;         /*    24     8 */
> >       union {
> >               const vm_flags_t   vm_flags;             /*    32     8 */
> >               vm_flags_t         __vm_flags;           /*    32     8 */
> >       };                                               /*    32     8 */
> >       bool                       detached;             /*    40     1 */
> >
> >       /* XXX 3 bytes hole, try to pack */
> >
> >       unsigned int               vm_lock_seq;          /*    44     4 */
> >       struct list_head           anon_vma_chain;       /*    48    16 */
> >       /* --- cacheline 1 boundary (64 bytes) --- */
> >       struct anon_vma *          anon_vma;             /*    64     8 */
> >       const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >       long unsigned int          vm_pgoff;             /*    80     8 */
> >       struct file *              vm_file;              /*    88     8 */
> >       void *                     vm_private_data;      /*    96     8 */
> >       atomic_long_t              swap_readahead_info;  /*   104     8 */
> >       struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >       /* XXX 8 bytes hole, try to pack */
> >
> >       /* --- cacheline 2 boundary (128 bytes) --- */
> >       struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >       /* XXX 4 bytes hole, try to pack */
> >
> >       struct {
> >               struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >               long unsigned int  rb_subtree_last;      /*   160     8 */
> >       } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >       struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
> I don't see anon_name in the output, I thought it was added for Android? :)

Yes, this output is generated with defconfig. That's why you see some
holes in this structure. On my x86 machine I have non-zero
vm_userfaultfd_ctx and numab_state, on Android I have
vm_userfaultfd_ctx and anon_name.

>
> >
> >       /* size: 192, cachelines: 3, members: 17 */
> >       /* sum members: 153, holes: 3, sum holes: 15 */
> >       /* padding: 24 */
>
> Instead you seem to have padding so an attempt to use SLAB_TYPESAFE_BY_RCU
> should use that and not add more up to 256 pages.

Yes, thanks for the tip about SLAB_TYPESAFE_BY_RCU freelist. In actual
configurations where I saw SLAB_TYPESAFE_BY_RCU causing this structure
to grow I had less padding at the end.

> Perhaps this pahole output wasn't generated with a fully representative config?

You are right. I'll replace it with the actual output from my x86
setup (Android probably has a smaller interested audience).

>
> >       /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> >
> > Memory consumption per 1000 VMAs becomes 48 pages:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm_types.h | 37 ++++++++++++++++++-------------------
> >  1 file changed, 18 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 789bccc05520..c3755b680911 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -733,16 +733,6 @@ struct vm_area_struct {
> >       unsigned int vm_lock_seq;
> >  #endif
> >
> > -     /*
> > -      * For areas with an address space and backing store,
> > -      * linkage into the address_space->i_mmap interval tree.
> > -      *
> > -      */
> > -     struct {
> > -             struct rb_node rb;
> > -             unsigned long rb_subtree_last;
> > -     } shared;
> > -
> >       /*
> >        * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
> >        * list, after a COW of one of the file pages.  A MAP_SHARED vma
> > @@ -762,14 +752,6 @@ struct vm_area_struct {
> >       struct file * vm_file;          /* File we map to (can be NULL). */
> >       void * vm_private_data;         /* was vm_pte (shared mem) */
> >
> > -#ifdef CONFIG_ANON_VMA_NAME
> > -     /*
> > -      * For private and shared anonymous mappings, a pointer to a null
> > -      * terminated string containing the name given to the vma, or NULL if
> > -      * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> > -      */
> > -     struct anon_vma_name *anon_name;
> > -#endif
> >  #ifdef CONFIG_SWAP
> >       atomic_long_t swap_readahead_info;
> >  #endif
> > @@ -782,11 +764,28 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA_BALANCING
> >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> >  #endif
> > -     struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       /* Unstable RCU readers are allowed to read this. */
> >       struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> >  #endif
> > +     /*
> > +      * For areas with an address space and backing store,
> > +      * linkage into the address_space->i_mmap interval tree.
> > +      *
> > +      */
> > +     struct {
> > +             struct rb_node rb;
> > +             unsigned long rb_subtree_last;
> > +     } shared;
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     /*
> > +      * For private and shared anonymous mappings, a pointer to a null
> > +      * terminated string containing the name given to the vma, or NULL if
> > +      * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> > +      */
> > +     struct anon_vma_name *anon_name;
> > +#endif
> > +     struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> >  } __randomize_layout;
> >
> >  #ifdef CONFIG_NUMA
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                     ` (2 preceding siblings ...)
  2024-11-12 10:18   ` kernel test robot
@ 2024-11-12 15:57   ` Vlastimil Babka
  2024-11-12 16:08     ` Suren Baghdasaryan
  3 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2024-11-12 15:57 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
	oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, linux-mm, linux-kernel, kernel-team

On 11/11/24 21:55, Suren Baghdasaryan wrote:
> @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma)
>  {
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> -	vma_lock_free(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }

Have you investigated if this allows to perform vma_numab_state_free() and
free_anon_vma_name() immediately, and only kfree_rcu() the vma itself,
instead of performing all this in a call_rcu() callback?

Of course if we succeed converting vma's to SLAB_TYPESAFE_RCU this immediate
freeing of numab state and anon_vma_name would be implied, but maybe it's an
useful intermediate step on its own.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-12 15:57   ` Vlastimil Babka
@ 2024-11-12 16:08     ` Suren Baghdasaryan
  2024-11-12 16:58       ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 16:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 7:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/11/24 21:55, Suren Baghdasaryan wrote:
> > @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> >  {
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> > -     vma_lock_free(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
>
> Have you investigated if this allows to perform vma_numab_state_free() and
> free_anon_vma_name() immediately, and only kfree_rcu() the vma itself,
> instead of performing all this in a call_rcu() callback?

Yes, it should be fine to free them immediately. lock_vma_under_rcu()
does not use neither vma->numab_state, nor vma->anon_name.

>
> Of course if we succeed converting vma's to SLAB_TYPESAFE_RCU this immediate
> freeing of numab state and anon_vma_name would be implied, but maybe it's an
> useful intermediate step on its own.

I'm thinking maybe I should post SLAB_TYPESAFE_RCU conversion before
anything else. It's simple and quite uncontroversial. I will probably
do that today.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/4] mm: move per-vma lock into vm_area_struct
  2024-11-12 16:08     ` Suren Baghdasaryan
@ 2024-11-12 16:58       ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12 16:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 8:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Nov 12, 2024 at 7:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 11/11/24 21:55, Suren Baghdasaryan wrote:
> > > @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> > >  {
> > >       vma_numab_state_free(vma);
> > >       free_anon_vma_name(vma);
> > > -     vma_lock_free(vma);
> > >       kmem_cache_free(vm_area_cachep, vma);
> > >  }
> >
> > Have you investigated if this allows to perform vma_numab_state_free() and
> > free_anon_vma_name() immediately, and only kfree_rcu() the vma itself,
> > instead of performing all this in a call_rcu() callback?
>
> Yes, it should be fine to free them immediately. lock_vma_under_rcu()
> does not use neither vma->numab_state, nor vma->anon_name.
>
> >
> > Of course if we succeed converting vma's to SLAB_TYPESAFE_RCU this immediate
> > freeing of numab state and anon_vma_name would be implied, but maybe it's an
> > useful intermediate step on its own.
>
> I'm thinking maybe I should post SLAB_TYPESAFE_RCU conversion before
> anything else. It's simple and quite uncontroversial. I will probably
> do that today.

Uh, I forgot that I can't post SLAB_TYPESAFE_RCU until I eliminate
this vma_lock_free() call inside __vm_area_free(). So, I have to
bundle moving vm_lock into vm_area_struct with SLAB_TYPESAFE_RCU.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-11-12 15:18     ` Suren Baghdasaryan
@ 2024-12-10 22:38       ` Peter Zijlstra
  2024-12-10 23:37         ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-10 22:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Nov 12, 2024 at 07:18:45AM -0800, Suren Baghdasaryan wrote:
> On Mon, Nov 11, 2024 at 8:58 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 11, 2024 at 12:55:05PM -0800, Suren Baghdasaryan wrote:
> > > When a reader takes read lock, it increments the atomic, unless the
> > > top two bits are set indicating a writer is present.
> > > When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there
> > > are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock
> > > and puts itself onto newly introduced mm.vma_writer_wait. Since all
> > > writers take mmap_lock in write mode first, there can be only one writer
> > > at a time. The last reader to release the lock will signal the writer
> > > to wake up.
> >
> > I don't think you need two bits.  You can do it this way:
> >
> > 0x8000'0000 - No readers, no writers
> > 0x1-7fff'ffff - Some number of readers
> > 0x0 - Writer held
> > 0x8000'0001-0xffff'ffff - Reader held, writer waiting
> >
> > A prospective writer subtracts 0x8000'0000.  If the result is 0, it got
> > the lock, otherwise it sleeps until it is 0.
> >
> > A writer unlocks by adding 0x8000'0000 (not by setting the value to
> > 0x8000'0000).
> >
> > A reader unlocks by adding 1.  If the result is 0, it wakes the writer.
> >
> > A prospective reader subtracts 1.  If the result is positive, it got the
> > lock, otherwise it does the unlock above (this might be the one which
> > wakes the writer).
> >
> > And ... that's it.  See how we use the CPU arithmetic flags to tell us
> > everything we need to know without doing arithmetic separately?
> 
> Yes, this is neat! You are using the fact that write-locked == no
> readers to eliminate unnecessary state. I'll give that a try. Thanks!

The reason I got here is that Vlastimil poked me about the whole
TYPESAFE_BY_RCU thing.

So the normal way those things work is with a refcount, if the refcount
is non-zero, the identifying fields should be stable and you can
determine if you have the right object, otherwise tough luck.

And I was thinking that since you abuse this rwsem you have, you might
as well turn that into a refcount with some extra.

So I would propose a slightly different solution.

Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0
-- that is, attach sets refcount to 1 to indicate it is part of the mas,
detached is the final 'put'.

RCU lookup does the inc_not_zero thing, when increment succeeds, compare
mm/addr to validate.

vma_start_write() already relies on mmap_lock being held for writing,
and thus does not have to worry about writer-vs-writer contention, that
is fully resolved by mmap_sem. This means we only need to wait for
readers to drop out.

vma_start_write()
	add(0x8000'0001); // could fetch_add and double check the high
			  // bit wasn't already set.
	wait-until(refcnt == 0x8000'0002); // mas + writer ref
	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
	sub(0x8000'0000);

vma_end_write()
	put();

vma_start_read() then becomes something like:

	if (vm_lock_seq == mm_lock_seq)
	  return false;

	cnt = fetch_inc(1);
	if (cnt & msb || vm_lock_seq == mm_lock_seq) {
	  put();
	  return false;
	}

	return true;
	
vma_end_read() then becomes:
	put();


and the down_read() from uffffffd requires mmap_read_lock() and thus
does not have to worry about writers, it can simpy be inc() and put(),
no?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-10 22:38       ` Peter Zijlstra
@ 2024-12-10 23:37         ` Suren Baghdasaryan
  2024-12-11  8:25           ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 23:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 10, 2024 at 2:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 12, 2024 at 07:18:45AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Nov 11, 2024 at 8:58 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Nov 11, 2024 at 12:55:05PM -0800, Suren Baghdasaryan wrote:
> > > > When a reader takes read lock, it increments the atomic, unless the
> > > > top two bits are set indicating a writer is present.
> > > > When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there
> > > > are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock
> > > > and puts itself onto newly introduced mm.vma_writer_wait. Since all
> > > > writers take mmap_lock in write mode first, there can be only one writer
> > > > at a time. The last reader to release the lock will signal the writer
> > > > to wake up.
> > >
> > > I don't think you need two bits.  You can do it this way:
> > >
> > > 0x8000'0000 - No readers, no writers
> > > 0x1-7fff'ffff - Some number of readers
> > > 0x0 - Writer held
> > > 0x8000'0001-0xffff'ffff - Reader held, writer waiting
> > >
> > > A prospective writer subtracts 0x8000'0000.  If the result is 0, it got
> > > the lock, otherwise it sleeps until it is 0.
> > >
> > > A writer unlocks by adding 0x8000'0000 (not by setting the value to
> > > 0x8000'0000).
> > >
> > > A reader unlocks by adding 1.  If the result is 0, it wakes the writer.
> > >
> > > A prospective reader subtracts 1.  If the result is positive, it got the
> > > lock, otherwise it does the unlock above (this might be the one which
> > > wakes the writer).
> > >
> > > And ... that's it.  See how we use the CPU arithmetic flags to tell us
> > > everything we need to know without doing arithmetic separately?
> >
> > Yes, this is neat! You are using the fact that write-locked == no
> > readers to eliminate unnecessary state. I'll give that a try. Thanks!
>
> The reason I got here is that Vlastimil poked me about the whole
> TYPESAFE_BY_RCU thing.
>
> So the normal way those things work is with a refcount, if the refcount
> is non-zero, the identifying fields should be stable and you can
> determine if you have the right object, otherwise tough luck.
>
> And I was thinking that since you abuse this rwsem you have, you might
> as well turn that into a refcount with some extra.
>
> So I would propose a slightly different solution.
>
> Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0
> -- that is, attach sets refcount to 1 to indicate it is part of the mas,
> detached is the final 'put'.

I need to double-check if we ever write-lock a detached vma. I don't
think we do but better be safe. If we do then that wait-until() should
accept 0x8000'0001 as well.

>
> RCU lookup does the inc_not_zero thing, when increment succeeds, compare
> mm/addr to validate.
>
> vma_start_write() already relies on mmap_lock being held for writing,
> and thus does not have to worry about writer-vs-writer contention, that
> is fully resolved by mmap_sem. This means we only need to wait for
> readers to drop out.
>
> vma_start_write()
>         add(0x8000'0001); // could fetch_add and double check the high
>                           // bit wasn't already set.
>         wait-until(refcnt == 0x8000'0002); // mas + writer ref
>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>         sub(0x8000'0000);
>
> vma_end_write()
>         put();

We don't really have vma_end_write(). Instead it's vma_end_write_all()
which increments mm_lock_seq unlocking all write-locked VMAs.
Therefore in vma_start_write() I think we can sub(0x8000'0001) at the
end.

>
> vma_start_read() then becomes something like:
>
>         if (vm_lock_seq == mm_lock_seq)
>           return false;
>
>         cnt = fetch_inc(1);
>         if (cnt & msb || vm_lock_seq == mm_lock_seq) {
>           put();
>           return false;
>         }
>
>         return true;
>
> vma_end_read() then becomes:
>         put();
>
>
> and the down_read() from uffffffd requires mmap_read_lock() and thus
> does not have to worry about writers, it can simpy be inc() and put(),
> no?

I think your proposal should work. Let me try to code it and see if
something breaks.
Thanks Peter!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-10 23:37         ` Suren Baghdasaryan
@ 2024-12-11  8:25           ` Peter Zijlstra
  2024-12-11 15:20             ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-11  8:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 10, 2024 at 03:37:50PM -0800, Suren Baghdasaryan wrote:

> > Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0
> > -- that is, attach sets refcount to 1 to indicate it is part of the mas,
> > detached is the final 'put'.
> 
> I need to double-check if we ever write-lock a detached vma. I don't
> think we do but better be safe. If we do then that wait-until() should
> accept 0x8000'0001 as well.

vma_start_write()
  __is_vma_write_locked()
    mmap_assert_write_locked(vma->vm_mm);

So this really should hold afaict.

> > RCU lookup does the inc_not_zero thing, when increment succeeds, compare
> > mm/addr to validate.
> >
> > vma_start_write() already relies on mmap_lock being held for writing,
> > and thus does not have to worry about writer-vs-writer contention, that
> > is fully resolved by mmap_sem. This means we only need to wait for
> > readers to drop out.
> >
> > vma_start_write()
> >         add(0x8000'0001); // could fetch_add and double check the high
> >                           // bit wasn't already set.
> >         wait-until(refcnt == 0x8000'0002); // mas + writer ref
> >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >         sub(0x8000'0000);
> >
> > vma_end_write()
> >         put();
> 
> We don't really have vma_end_write(). Instead it's vma_end_write_all()
> which increments mm_lock_seq unlocking all write-locked VMAs.
> Therefore in vma_start_write() I think we can sub(0x8000'0001) at the
> end.

Right, I know you don't, but you should :-), I've suggested adding this
before.

> > vma_start_read() then becomes something like:
> >
> >         if (vm_lock_seq == mm_lock_seq)
> >           return false;
> >
> >         cnt = fetch_inc(1);
> >         if (cnt & msb || vm_lock_seq == mm_lock_seq) {
> >           put();
> >           return false;
> >         }
> >
> >         return true;
> >
> > vma_end_read() then becomes:
> >         put();
> >
> >
> > and the down_read() from uffffffd requires mmap_read_lock() and thus
> > does not have to worry about writers, it can simpy be inc() and put(),
> > no?
> 
> I think your proposal should work. Let me try to code it and see if
> something breaks.

Btw, for the wait-until() and put() you can use rcuwait; that is the
simplest wait form we have. It's suitable because we only ever have the
one waiter.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-11  8:25           ` Peter Zijlstra
@ 2024-12-11 15:20             ` Suren Baghdasaryan
  2024-12-12  3:01               ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-11 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 11, 2024 at 12:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 10, 2024 at 03:37:50PM -0800, Suren Baghdasaryan wrote:
>
> > > Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0
> > > -- that is, attach sets refcount to 1 to indicate it is part of the mas,
> > > detached is the final 'put'.
> >
> > I need to double-check if we ever write-lock a detached vma. I don't
> > think we do but better be safe. If we do then that wait-until() should
> > accept 0x8000'0001 as well.
>
> vma_start_write()
>   __is_vma_write_locked()
>     mmap_assert_write_locked(vma->vm_mm);
>
> So this really should hold afaict.
>
> > > RCU lookup does the inc_not_zero thing, when increment succeeds, compare
> > > mm/addr to validate.
> > >
> > > vma_start_write() already relies on mmap_lock being held for writing,
> > > and thus does not have to worry about writer-vs-writer contention, that
> > > is fully resolved by mmap_sem. This means we only need to wait for
> > > readers to drop out.
> > >
> > > vma_start_write()
> > >         add(0x8000'0001); // could fetch_add and double check the high
> > >                           // bit wasn't already set.
> > >         wait-until(refcnt == 0x8000'0002); // mas + writer ref
> > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > >         sub(0x8000'0000);
> > >
> > > vma_end_write()
> > >         put();
> >
> > We don't really have vma_end_write(). Instead it's vma_end_write_all()
> > which increments mm_lock_seq unlocking all write-locked VMAs.
> > Therefore in vma_start_write() I think we can sub(0x8000'0001) at the
> > end.
>
> Right, I know you don't, but you should :-), I've suggested adding this
> before.

I'll look into adding it. IIRC there was some issue with that but I
can't recall...

>
> > > vma_start_read() then becomes something like:
> > >
> > >         if (vm_lock_seq == mm_lock_seq)
> > >           return false;
> > >
> > >         cnt = fetch_inc(1);
> > >         if (cnt & msb || vm_lock_seq == mm_lock_seq) {
> > >           put();
> > >           return false;
> > >         }
> > >
> > >         return true;
> > >
> > > vma_end_read() then becomes:
> > >         put();
> > >
> > >
> > > and the down_read() from uffffffd requires mmap_read_lock() and thus
> > > does not have to worry about writers, it can simpy be inc() and put(),
> > > no?
> >
> > I think your proposal should work. Let me try to code it and see if
> > something breaks.
>
> Btw, for the wait-until() and put() you can use rcuwait; that is the
> simplest wait form we have. It's suitable because we only ever have the
> one waiter.

Yes, Davidlohr mentioned that before. I'll do that. Thanks!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-11 15:20             ` Suren Baghdasaryan
@ 2024-12-12  3:01               ` Suren Baghdasaryan
  2024-12-12  9:16                 ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-12  3:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 11, 2024 at 7:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Dec 11, 2024 at 12:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Dec 10, 2024 at 03:37:50PM -0800, Suren Baghdasaryan wrote:
> >
> > > > Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0
> > > > -- that is, attach sets refcount to 1 to indicate it is part of the mas,
> > > > detached is the final 'put'.
> > >
> > > I need to double-check if we ever write-lock a detached vma. I don't
> > > think we do but better be safe. If we do then that wait-until() should
> > > accept 0x8000'0001 as well.
> >
> > vma_start_write()
> >   __is_vma_write_locked()
> >     mmap_assert_write_locked(vma->vm_mm);
> >
> > So this really should hold afaict.
> >
> > > > RCU lookup does the inc_not_zero thing, when increment succeeds, compare
> > > > mm/addr to validate.
> > > >
> > > > vma_start_write() already relies on mmap_lock being held for writing,
> > > > and thus does not have to worry about writer-vs-writer contention, that
> > > > is fully resolved by mmap_sem. This means we only need to wait for
> > > > readers to drop out.
> > > >
> > > > vma_start_write()
> > > >         add(0x8000'0001); // could fetch_add and double check the high
> > > >                           // bit wasn't already set.
> > > >         wait-until(refcnt == 0x8000'0002); // mas + writer ref
> > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > >         sub(0x8000'0000);
> > > >
> > > > vma_end_write()
> > > >         put();
> > >
> > > We don't really have vma_end_write(). Instead it's vma_end_write_all()
> > > which increments mm_lock_seq unlocking all write-locked VMAs.
> > > Therefore in vma_start_write() I think we can sub(0x8000'0001) at the
> > > end.
> >
> > Right, I know you don't, but you should :-), I've suggested adding this
> > before.
>
> I'll look into adding it. IIRC there was some issue with that but I
> can't recall...
>
> >
> > > > vma_start_read() then becomes something like:
> > > >
> > > >         if (vm_lock_seq == mm_lock_seq)
> > > >           return false;
> > > >
> > > >         cnt = fetch_inc(1);
> > > >         if (cnt & msb || vm_lock_seq == mm_lock_seq) {
> > > >           put();
> > > >           return false;
> > > >         }
> > > >
> > > >         return true;
> > > >
> > > > vma_end_read() then becomes:
> > > >         put();
> > > >
> > > >
> > > > and the down_read() from uffffffd requires mmap_read_lock() and thus
> > > > does not have to worry about writers, it can simpy be inc() and put(),
> > > > no?
> > >
> > > I think your proposal should work. Let me try to code it and see if
> > > something breaks.

Ok, I tried it out and things are a bit more complex:
1. We should allow write-locking a detached VMA, IOW vma_start_write()
can be called when vm_refcnt is 0. In that situation add(0x8000'0001)
leads to "addition on 0; use-after-free". Maybe I can introduce a new
refcnt function which does not complain when adding to 0?
2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
0x40000000 to denote writers.
3. Currently vma_mark_attached() can be called on an already attached
VMA. With vma->detached being a separate attribute that works fine but
when we combine it with the vm_lock things break (extra attach would
leak into lock count). I'll see if I can catch all the cases when we
do this and clean them up (not call vma_mark_attached() when not
necessary).

> >
> > Btw, for the wait-until() and put() you can use rcuwait; that is the
> > simplest wait form we have. It's suitable because we only ever have the
> > one waiter.
>
> Yes, Davidlohr mentioned that before. I'll do that. Thanks!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-12  3:01               ` Suren Baghdasaryan
@ 2024-12-12  9:16                 ` Peter Zijlstra
  2024-12-12 14:17                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-12  9:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:

> > > > I think your proposal should work. Let me try to code it and see if
> > > > something breaks.
> 
> Ok, I tried it out and things are a bit more complex:
> 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> can be called when vm_refcnt is 0. 

This sounds dodgy, refcnt being zero basically means the object is dead
and you shouldn't be touching it no more. Where does this happen and
why?

Notably, it being 0 means it is no longer in the mas tree and can't be
found anymore.

> 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> 0x40000000 to denote writers.

I'm confused, what? We're talking about atomic_t, right?

> 3. Currently vma_mark_attached() can be called on an already attached
> VMA. With vma->detached being a separate attribute that works fine but
> when we combine it with the vm_lock things break (extra attach would
> leak into lock count). I'll see if I can catch all the cases when we
> do this and clean them up (not call vma_mark_attached() when not
> necessary).

Right, I hadn't looked at that thing in detail, that sounds like it
needs a wee cleanup like you suggest.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-12  9:16                 ` Peter Zijlstra
@ 2024-12-12 14:17                   ` Suren Baghdasaryan
  2024-12-12 14:19                     ` Suren Baghdasaryan
  2024-12-13  9:22                     ` Peter Zijlstra
  0 siblings, 2 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-12 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
>
> > > > > I think your proposal should work. Let me try to code it and see if
> > > > > something breaks.
> >
> > Ok, I tried it out and things are a bit more complex:
> > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > can be called when vm_refcnt is 0.
>
> This sounds dodgy, refcnt being zero basically means the object is dead
> and you shouldn't be touching it no more. Where does this happen and
> why?
>
> Notably, it being 0 means it is no longer in the mas tree and can't be
> found anymore.

It happens when a newly created vma that was not yet attached
(vma->vm_refcnt = 0) is write-locked before being added into the vma
tree. For example:
mmap()
  mmap_write_lock()
  vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
  //vma attributes are initialized
  vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
  mas_store_gfp()
  vma_mark_attached()
  mmap_write_lock() // vma_end_write_all()

In this scenario, we write-lock the VMA before adding it into the tree
to prevent readers (pagefaults) from using it until we drop the
mmap_write_lock(). In your proposal, the first thing vma_start_write()
does is add(0x8000'0001) and that will trigger a warning.
For now instead of add(0x8000'0001) I can play this game to avoid the warning:

if (refcount_inc_not_zero(&vma->vm_refcnt))
    refcount_add(0x80000000, &vma->vm_refcnt);
else
    refcount_set(&vma->vm_refcnt, 0x80000001);

this refcount_set() works because vma with vm_refcnt==0 could not be
found by readers. I'm not sure this will still work when we add
TYPESAFE_BY_RCU and introduce vma reuse possibility.

>
> > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > 0x40000000 to denote writers.
>
> I'm confused, what? We're talking about atomic_t, right?

I thought you suggested using refcount_t. According to
https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
that range. What am I missing?

>
> > 3. Currently vma_mark_attached() can be called on an already attached
> > VMA. With vma->detached being a separate attribute that works fine but
> > when we combine it with the vm_lock things break (extra attach would
> > leak into lock count). I'll see if I can catch all the cases when we
> > do this and clean them up (not call vma_mark_attached() when not
> > necessary).
>
> Right, I hadn't looked at that thing in detail, that sounds like it
> needs a wee cleanup like you suggest.

Yes, I'll embark on that today. Will see how much of a problem that is.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-12 14:17                   ` Suren Baghdasaryan
@ 2024-12-12 14:19                     ` Suren Baghdasaryan
  2024-12-13  4:48                       ` Suren Baghdasaryan
  2024-12-13  9:22                     ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-12 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 12, 2024 at 6:17 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
> >
> > > > > > I think your proposal should work. Let me try to code it and see if
> > > > > > something breaks.
> > >
> > > Ok, I tried it out and things are a bit more complex:
> > > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > > can be called when vm_refcnt is 0.
> >
> > This sounds dodgy, refcnt being zero basically means the object is dead
> > and you shouldn't be touching it no more. Where does this happen and
> > why?
> >
> > Notably, it being 0 means it is no longer in the mas tree and can't be
> > found anymore.
>
> It happens when a newly created vma that was not yet attached
> (vma->vm_refcnt = 0) is write-locked before being added into the vma
> tree. For example:
> mmap()
>   mmap_write_lock()
>   vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
>   //vma attributes are initialized
>   vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
>   mas_store_gfp()
>   vma_mark_attached()
>   mmap_write_lock() // vma_end_write_all()

s/mmap_write_lock()/mmap_write_unlock()
>
> In this scenario, we write-lock the VMA before adding it into the tree
> to prevent readers (pagefaults) from using it until we drop the
> mmap_write_lock(). In your proposal, the first thing vma_start_write()
> does is add(0x8000'0001) and that will trigger a warning.
> For now instead of add(0x8000'0001) I can play this game to avoid the warning:
>
> if (refcount_inc_not_zero(&vma->vm_refcnt))
>     refcount_add(0x80000000, &vma->vm_refcnt);
> else
>     refcount_set(&vma->vm_refcnt, 0x80000001);
>
> this refcount_set() works because vma with vm_refcnt==0 could not be
> found by readers. I'm not sure this will still work when we add
> TYPESAFE_BY_RCU and introduce vma reuse possibility.
>
> >
> > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > > 0x40000000 to denote writers.
> >
> > I'm confused, what? We're talking about atomic_t, right?
>
> I thought you suggested using refcount_t. According to
> https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
> valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
> that range. What am I missing?
>
> >
> > > 3. Currently vma_mark_attached() can be called on an already attached
> > > VMA. With vma->detached being a separate attribute that works fine but
> > > when we combine it with the vm_lock things break (extra attach would
> > > leak into lock count). I'll see if I can catch all the cases when we
> > > do this and clean them up (not call vma_mark_attached() when not
> > > necessary).
> >
> > Right, I hadn't looked at that thing in detail, that sounds like it
> > needs a wee cleanup like you suggest.
>
> Yes, I'll embark on that today. Will see how much of a problem that is.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-12 14:19                     ` Suren Baghdasaryan
@ 2024-12-13  4:48                       ` Suren Baghdasaryan
  2024-12-13  9:57                         ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-13  4:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 12, 2024 at 6:19 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 12, 2024 at 6:17 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > > > I think your proposal should work. Let me try to code it and see if
> > > > > > > something breaks.
> > > >
> > > > Ok, I tried it out and things are a bit more complex:
> > > > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > > > can be called when vm_refcnt is 0.
> > >
> > > This sounds dodgy, refcnt being zero basically means the object is dead
> > > and you shouldn't be touching it no more. Where does this happen and
> > > why?
> > >
> > > Notably, it being 0 means it is no longer in the mas tree and can't be
> > > found anymore.
> >
> > It happens when a newly created vma that was not yet attached
> > (vma->vm_refcnt = 0) is write-locked before being added into the vma
> > tree. For example:
> > mmap()
> >   mmap_write_lock()
> >   vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
> >   //vma attributes are initialized
> >   vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
> >   mas_store_gfp()
> >   vma_mark_attached()
> >   mmap_write_lock() // vma_end_write_all()
>
> s/mmap_write_lock()/mmap_write_unlock()
> >
> > In this scenario, we write-lock the VMA before adding it into the tree
> > to prevent readers (pagefaults) from using it until we drop the
> > mmap_write_lock(). In your proposal, the first thing vma_start_write()
> > does is add(0x8000'0001) and that will trigger a warning.
> > For now instead of add(0x8000'0001) I can play this game to avoid the warning:
> >
> > if (refcount_inc_not_zero(&vma->vm_refcnt))
> >     refcount_add(0x80000000, &vma->vm_refcnt);
> > else
> >     refcount_set(&vma->vm_refcnt, 0x80000001);
> >
> > this refcount_set() works because vma with vm_refcnt==0 could not be
> > found by readers. I'm not sure this will still work when we add
> > TYPESAFE_BY_RCU and introduce vma reuse possibility.
> >
> > >
> > > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > > > 0x40000000 to denote writers.
> > >
> > > I'm confused, what? We're talking about atomic_t, right?
> >
> > I thought you suggested using refcount_t. According to
> > https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
> > valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
> > that range. What am I missing?
> >
> > >
> > > > 3. Currently vma_mark_attached() can be called on an already attached
> > > > VMA. With vma->detached being a separate attribute that works fine but
> > > > when we combine it with the vm_lock things break (extra attach would
> > > > leak into lock count). I'll see if I can catch all the cases when we
> > > > do this and clean them up (not call vma_mark_attached() when not
> > > > necessary).
> > >
> > > Right, I hadn't looked at that thing in detail, that sounds like it
> > > needs a wee cleanup like you suggest.
> >
> > Yes, I'll embark on that today. Will see how much of a problem that is.

Ok, I think I was able to implement this in a way that ignores
duplicate attach/detach calls. One issue that I hit and don't know a
good way to fix is a circular dependency in the header files once I
try adding rcuwait into mm_struct. Once I include rcuwait.h into
mm_types.h leads to the following cycle:

In file included from ./arch/x86/include/asm/uaccess.h:12,
                 from ./include/linux/uaccess.h:12,
                 from ./include/linux/sched/task.h:13,
                 from ./include/linux/sched/signal.h:9,
                 from ./include/linux/rcuwait.h:6,
                 from ./include/linux/mm_types.h:22,

./arch/x86/include/asm/uaccess.h includes mm_types.h. But in fact
there is a shorter cycle:

rcuwait.h needs signal.h since it uses uses inlined signal_pending_state()
signal.h needs mm_types.h since it uses vm_fault_t

The way I worked around it for now is by removing signal.h include
from rcuwait.h and wrapping signal_pending_state() into a non-inlined
function so I can forward-declare it. That requires adding
linux/sched/signal.h or linux/sched/task.h into many other places:

 arch/x86/coco/sev/core.c                                     | 1 +
 arch/x86/kernel/fpu/xstate.c                                 | 1 +
 block/blk-lib.c                                              | 1 +
 block/ioctl.c                                                | 1 +
 drivers/accel/qaic/qaic_control.c                            | 1 +
 drivers/base/firmware_loader/main.c                          | 1 +
 drivers/block/ublk_drv.c                                     | 1 +
 drivers/crypto/ccp/sev-dev.c                                 | 1 +
 drivers/dma-buf/heaps/cma_heap.c                             | 1 +
 drivers/dma-buf/heaps/system_heap.c                          | 1 +
 drivers/gpio/gpio-sloppy-logic-analyzer.c                    | 1 +
 drivers/gpu/drm/etnaviv/etnaviv_gpu.c                        | 1 +
 drivers/iommu/iommufd/ioas.c                                 | 1 +
 drivers/iommu/iommufd/pages.c                                | 1 +
 .../staging/vc04_services/interface/vchiq_arm/vchiq_dev.c    | 1 +
 drivers/tty/n_tty.c                                          | 1 +
 fs/bcachefs/fs-io-buffered.c                                 | 1 +
 fs/bcachefs/journal_reclaim.c                                | 1 +
 fs/bcachefs/thread_with_file.c                               | 1 +
 fs/bcachefs/util.c                                           | 1 +
 fs/btrfs/defrag.h                                            | 1 +
 fs/btrfs/fiemap.c                                            | 2 ++
 fs/btrfs/free-space-cache.h                                  | 1 +
 fs/btrfs/reflink.c                                           | 1 +
 fs/exfat/balloc.c                                            | 1 +
 fs/gfs2/ops_fstype.c                                         | 1 +
 fs/kernel_read_file.c                                        | 1 +
 fs/netfs/buffered_write.c                                    | 1 +
 fs/zonefs/file.c                                             | 1 +
 include/linux/fs.h                                           | 2 +-
 include/linux/rcuwait.h                                      | 4 ++--
 io_uring/io_uring.h                                          | 1 +
 kernel/dma/map_benchmark.c                                   | 1 +
 kernel/futex/core.c                                          | 1 +
 kernel/futex/pi.c                                            | 1 +
 kernel/rcu/update.c                                          | 5 +++++
 kernel/task_work.c                                           | 1 +
 lib/kunit/user_alloc.c                                       | 1 +
 mm/damon/vaddr.c                                             | 1 +
 mm/memcontrol-v1.c                                           | 1 +
 mm/shrinker_debug.c                                          | 1 +
 net/dns_resolver/dns_key.c                                   | 1 +
 42 files changed, 48 insertions(+), 3 deletions(-)

I'm not sure if this is the best way to deal with this circular
dependency. Any other ideas?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-12 14:17                   ` Suren Baghdasaryan
  2024-12-12 14:19                     ` Suren Baghdasaryan
@ 2024-12-13  9:22                     ` Peter Zijlstra
  2024-12-13 17:41                       ` Suren Baghdasaryan
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-13  9:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 12, 2024 at 06:17:44AM -0800, Suren Baghdasaryan wrote:
> On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
> >
> > > > > > I think your proposal should work. Let me try to code it and see if
> > > > > > something breaks.
> > >
> > > Ok, I tried it out and things are a bit more complex:
> > > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > > can be called when vm_refcnt is 0.
> >
> > This sounds dodgy, refcnt being zero basically means the object is dead
> > and you shouldn't be touching it no more. Where does this happen and
> > why?
> >
> > Notably, it being 0 means it is no longer in the mas tree and can't be
> > found anymore.
> 
> It happens when a newly created vma that was not yet attached
> (vma->vm_refcnt = 0) is write-locked before being added into the vma
> tree. For example:
> mmap()
>   mmap_write_lock()
>   vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
>   //vma attributes are initialized
>   vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
>   mas_store_gfp()
>   vma_mark_attached()
>   mmap_write_lock() // vma_end_write_all()
> 
> In this scenario, we write-lock the VMA before adding it into the tree
> to prevent readers (pagefaults) from using it until we drop the
> mmap_write_lock().

Ah, but you can do that by setting vma->vm_lock_seq and setting the ref
to 1 before adding it (its not visible before adding anyway, so nobody
cares).

You'll note that the read thing checks both the msb (or other high bit
depending on the actual type you're going with) *and* the seq. That is
needed because we must not set the sequence number before all existing
readers are drained, but since this is pre-add that is not a concern.

> > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > > 0x40000000 to denote writers.
> >
> > I'm confused, what? We're talking about atomic_t, right?
> 
> I thought you suggested using refcount_t. According to
> https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
> valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
> that range. What am I missing?

I was talking about atomic_t :-), but yeah, maybe we can use refcount_t,
but I hadn't initially considered that.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13  4:48                       ` Suren Baghdasaryan
@ 2024-12-13  9:57                         ` Peter Zijlstra
  2024-12-13 17:45                           ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-13  9:57 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 12, 2024 at 08:48:52PM -0800, Suren Baghdasaryan wrote:

> I'm not sure if this is the best way to deal with this circular
> dependency. Any other ideas?

Move the waiting into an out-of-line slow-path?

  if (atomic_read(&vma->refcnt) != 2)
    __vma_write_start_wait(mm, vma);

or somesuch..


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13  9:22                     ` Peter Zijlstra
@ 2024-12-13 17:41                       ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-13 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 1:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 06:17:44AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > > > I think your proposal should work. Let me try to code it and see if
> > > > > > > something breaks.
> > > >
> > > > Ok, I tried it out and things are a bit more complex:
> > > > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > > > can be called when vm_refcnt is 0.
> > >
> > > This sounds dodgy, refcnt being zero basically means the object is dead
> > > and you shouldn't be touching it no more. Where does this happen and
> > > why?
> > >
> > > Notably, it being 0 means it is no longer in the mas tree and can't be
> > > found anymore.
> >
> > It happens when a newly created vma that was not yet attached
> > (vma->vm_refcnt = 0) is write-locked before being added into the vma
> > tree. For example:
> > mmap()
> >   mmap_write_lock()
> >   vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
> >   //vma attributes are initialized
> >   vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
> >   mas_store_gfp()
> >   vma_mark_attached()
> >   mmap_write_lock() // vma_end_write_all()
> >
> > In this scenario, we write-lock the VMA before adding it into the tree
> > to prevent readers (pagefaults) from using it until we drop the
> > mmap_write_lock().
>
> Ah, but you can do that by setting vma->vm_lock_seq and setting the ref
> to 1 before adding it (its not visible before adding anyway, so nobody
> cares).
>
> You'll note that the read thing checks both the msb (or other high bit
> depending on the actual type you're going with) *and* the seq. That is
> needed because we must not set the sequence number before all existing
> readers are drained, but since this is pre-add that is not a concern.

Yes, I realized that there is an interesting rule that help in this
case: vma_mark_attached() is called only on a newly created vma which
can't be found by anyone else or the vma is already locked, therefore
vma_start_write() can never race with vma_mark_attached(). Considering
that vma_mark_attached() is the only one that can raise vma->vm_refcnt
from 0, vma_start_write() can set vma->vm_lock_seq without modifying
vma->vm_refcnt at all if vma->vm_refcnt==0.

>
> > > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > > > 0x40000000 to denote writers.
> > >
> > > I'm confused, what? We're talking about atomic_t, right?
> >
> > I thought you suggested using refcount_t. According to
> > https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
> > valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
> > that range. What am I missing?
>
> I was talking about atomic_t :-), but yeah, maybe we can use refcount_t,
> but I hadn't initially considered that.

My current implementation is using refcount_t but I might have to
change that to atomic_t to handle vm_refcnt overflows in
vma_start_read().

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13  9:57                         ` Peter Zijlstra
@ 2024-12-13 17:45                           ` Suren Baghdasaryan
  2024-12-13 18:19                             ` Matthew Wilcox
  2024-12-13 18:35                             ` Peter Zijlstra
  0 siblings, 2 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-13 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 1:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 08:48:52PM -0800, Suren Baghdasaryan wrote:
>
> > I'm not sure if this is the best way to deal with this circular
> > dependency. Any other ideas?
>
> Move the waiting into an out-of-line slow-path?
>
>   if (atomic_read(&vma->refcnt) != 2)
>     __vma_write_start_wait(mm, vma);

The problem is not a function but the addition of struct rcuwait into
struct mm_struct. The type should be known and I don't want to add it
via a pointer to avoid extra dereferences. After thinking it over, I
think the simplest way would be to move the definition of struct
rcuwait into a separate header and include that header in mm_types.h.
That, with some uninlining like you suggested, should do the trick.

>
> or somesuch..


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13 17:45                           ` Suren Baghdasaryan
@ 2024-12-13 18:19                             ` Matthew Wilcox
  2024-12-13 18:35                             ` Peter Zijlstra
  1 sibling, 0 replies; 49+ messages in thread
From: Matthew Wilcox @ 2024-12-13 18:19 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 09:45:33AM -0800, Suren Baghdasaryan wrote:
> think the simplest way would be to move the definition of struct
> rcuwait into a separate header and include that header in mm_types.h.
> That, with some uninlining like you suggested, should do the trick.

I'd suggest struct rcuwait should live in types.h alongside list_head,
rcuref and so on.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13 17:45                           ` Suren Baghdasaryan
  2024-12-13 18:19                             ` Matthew Wilcox
@ 2024-12-13 18:35                             ` Peter Zijlstra
  2024-12-13 18:37                               ` Suren Baghdasaryan
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2024-12-13 18:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 09:45:33AM -0800, Suren Baghdasaryan wrote:
> On Fri, Dec 13, 2024 at 1:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Dec 12, 2024 at 08:48:52PM -0800, Suren Baghdasaryan wrote:
> >
> > > I'm not sure if this is the best way to deal with this circular
> > > dependency. Any other ideas?
> >
> > Move the waiting into an out-of-line slow-path?
> >
> >   if (atomic_read(&vma->refcnt) != 2)
> >     __vma_write_start_wait(mm, vma);
> 
> The problem is not a function but the addition of struct rcuwait into

Durr, in my brain that was a struct task_struct pointer, totally forgot
we had a type there. Yeah, as Willy says, move it to compiler_types.h or
somesuch.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13 18:35                             ` Peter Zijlstra
@ 2024-12-13 18:37                               ` Suren Baghdasaryan
  2024-12-16 19:32                                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-13 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 10:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Dec 13, 2024 at 09:45:33AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Dec 13, 2024 at 1:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Dec 12, 2024 at 08:48:52PM -0800, Suren Baghdasaryan wrote:
> > >
> > > > I'm not sure if this is the best way to deal with this circular
> > > > dependency. Any other ideas?
> > >
> > > Move the waiting into an out-of-line slow-path?
> > >
> > >   if (atomic_read(&vma->refcnt) != 2)
> > >     __vma_write_start_wait(mm, vma);
> >
> > The problem is not a function but the addition of struct rcuwait into
>
> Durr, in my brain that was a struct task_struct pointer, totally forgot
> we had a type there. Yeah, as Willy says, move it to compiler_types.h or
> somesuch.

Got it. Thank you both!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock
  2024-12-13 18:37                               ` Suren Baghdasaryan
@ 2024-12-16 19:32                                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 49+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, akpm, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, linux-mm,
	linux-kernel, kernel-team

On Fri, Dec 13, 2024 at 10:37 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Dec 13, 2024 at 10:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Dec 13, 2024 at 09:45:33AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Dec 13, 2024 at 1:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Thu, Dec 12, 2024 at 08:48:52PM -0800, Suren Baghdasaryan wrote:
> > > >
> > > > > I'm not sure if this is the best way to deal with this circular
> > > > > dependency. Any other ideas?
> > > >
> > > > Move the waiting into an out-of-line slow-path?
> > > >
> > > >   if (atomic_read(&vma->refcnt) != 2)
> > > >     __vma_write_start_wait(mm, vma);
> > >
> > > The problem is not a function but the addition of struct rcuwait into
> >
> > Durr, in my brain that was a struct task_struct pointer, totally forgot
> > we had a type there. Yeah, as Willy says, move it to compiler_types.h or
> > somesuch.
>
> Got it. Thank you both!

I posted the implementation of the mechanism we discussed here at:
https://lore.kernel.org/all/20241216192419.2970941-11-surenb@google.com/
It ended up becoming a large patchset but most patches are very small
logical units.
Feedback is greatly appreciated!


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2024-12-16 19:33 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-11 20:55 [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-11 20:55 ` [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-11-11 20:55 ` [PATCH 2/4] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-12  9:36   ` kernel test robot
2024-11-12  9:47   ` kernel test robot
2024-11-12 10:18   ` kernel test robot
2024-11-12 15:57   ` Vlastimil Babka
2024-11-12 16:08     ` Suren Baghdasaryan
2024-11-12 16:58       ` Suren Baghdasaryan
2024-11-11 20:55 ` [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
2024-11-12  0:06   ` Andrew Morton
2024-11-12  0:52     ` Suren Baghdasaryan
2024-11-12  0:35   ` Davidlohr Bueso
2024-11-12  0:59     ` Suren Baghdasaryan
2024-11-12  4:58   ` Matthew Wilcox
2024-11-12 15:18     ` Suren Baghdasaryan
2024-12-10 22:38       ` Peter Zijlstra
2024-12-10 23:37         ` Suren Baghdasaryan
2024-12-11  8:25           ` Peter Zijlstra
2024-12-11 15:20             ` Suren Baghdasaryan
2024-12-12  3:01               ` Suren Baghdasaryan
2024-12-12  9:16                 ` Peter Zijlstra
2024-12-12 14:17                   ` Suren Baghdasaryan
2024-12-12 14:19                     ` Suren Baghdasaryan
2024-12-13  4:48                       ` Suren Baghdasaryan
2024-12-13  9:57                         ` Peter Zijlstra
2024-12-13 17:45                           ` Suren Baghdasaryan
2024-12-13 18:19                             ` Matthew Wilcox
2024-12-13 18:35                             ` Peter Zijlstra
2024-12-13 18:37                               ` Suren Baghdasaryan
2024-12-16 19:32                                 ` Suren Baghdasaryan
2024-12-13  9:22                     ` Peter Zijlstra
2024-12-13 17:41                       ` Suren Baghdasaryan
2024-11-11 20:55 ` [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
2024-11-12 10:07   ` Vlastimil Babka
2024-11-12 15:45     ` Suren Baghdasaryan
2024-11-11 21:41 ` [PATCH 0/4] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-12  2:48   ` Liam R. Howlett
2024-11-12  3:27     ` Suren Baghdasaryan
2024-11-11 22:18 ` Davidlohr Bueso
2024-11-11 23:19   ` Suren Baghdasaryan
2024-11-12  0:03     ` Davidlohr Bueso
2024-11-12  0:43       ` Suren Baghdasaryan
2024-11-12  0:11     ` Andrew Morton
2024-11-12  0:48       ` Suren Baghdasaryan
2024-11-12  2:35 ` Liam R. Howlett
2024-11-12  3:23   ` Suren Baghdasaryan
2024-11-12  9:39 ` Lorenzo Stoakes
2024-11-12 15:38   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox