* [PATCH v5 0/6] move per-vma lock into vm_area_struct
@ 2024-12-06 22:51 Suren Baghdasaryan
2024-12-06 22:51 ` [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
` (6 more replies)
0 siblings, 7 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:51 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls. To avoid bloating vm_area_struct,
we introduce unioned freeptr_t field and allow freeptr_offset to be used
with ctor in a separate patch.
Pagefault microbenchmarks do not show noticeable performance change.
Changes since v4 [4]
- Added SOBs, per Lorenzo Stoakes and Shakeel Butt;
- Changed vma_clear() and vma_copy() to set required vma members
individually, per Matthew Wilcox
- Added comments in vma_start_read() about new false locked result
possibilities, per Vlastimil Babka
- Added back freeptr_t into vm_area_struct, as it's harmless and can be
used in the future to shrink the structure, per Vlastimil Babka
- Fixed the race in vm_area_alloc() when vma->detached was temporarily
reset by memcpy() before setting it back, per Vlastimil Babka
- Added a patch allowing freeptr_offset to be used with ctor,
per Vlastimil Babka
Patch applies over linux-next (due to vm_lock change [5] not in mm tree).
[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241120000826.335387-1-surenb@google.com/
[5] https://lore.kernel.org/all/20241122174416.1367052-2-surenb@google.com/
Suren Baghdasaryan (6):
mm: introduce vma_start_read_locked{_nested} helpers
mm: move per-vma lock into vm_area_struct
mm: mark vma as detached until it's added into vma tree
mm: make vma cache SLAB_TYPESAFE_BY_RCU
mm/slab: allow freeptr_offset to be used with ctor
docs/mm: document latest changes to vm_lock
Documentation/mm/process_addrs.rst | 10 +-
include/linux/mm.h | 107 ++++++++++++++----
include/linux/mm_types.h | 16 ++-
include/linux/slab.h | 11 +-
kernel/fork.c | 168 ++++++++++++++++++++---------
mm/memory.c | 17 ++-
mm/slub.c | 2 +-
mm/userfaultfd.c | 22 +---
mm/vma.c | 8 +-
mm/vma.h | 2 +
tools/testing/vma/vma_internal.h | 55 ++++------
11 files changed, 267 insertions(+), 151 deletions(-)
base-commit: ebe1b11614e079c5e366ce9bd3c8f44ca0fbcc1b
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-06 22:51 ` Suren Baghdasaryan
2024-12-10 9:03 ` Vlastimil Babka
2024-12-06 22:51 ` [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
` (5 subsequent siblings)
6 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:51 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read. Replace direct accesses to vma->vm_lock
with these new helpers.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/mm.h | 24 ++++++++++++++++++++++++
mm/userfaultfd.c | 22 +++++-----------------
2 files changed, 29 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 187e42339d8e..c4a001972223 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -734,6 +734,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
return true;
}
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+ mmap_assert_locked(vma->vm_mm);
+ down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+ mmap_assert_locked(vma->vm_mm);
+ down_read(&vma->vm_lock->lock);
+}
+
static inline void vma_end_read(struct vm_area_struct *vma)
{
rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 8e16dc290ddf..bc9a66ec6a6e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
mmap_read_lock(mm);
vma = find_vma_and_prepare_anon(mm, address);
- if (!IS_ERR(vma)) {
- /*
- * We cannot use vma_start_read() as it may fail due to
- * false locked (see comment in vma_start_read()). We
- * can avoid that by directly locking vm_lock under
- * mmap_lock, which guarantees that nobody can lock the
- * vma for write (vma_start_write()) under us.
- */
- down_read(&vma->vm_lock->lock);
- }
+ if (!IS_ERR(vma))
+ vma_start_read_locked(vma);
mmap_read_unlock(mm);
return vma;
@@ -1491,14 +1483,10 @@ static int uffd_move_lock(struct mm_struct *mm,
mmap_read_lock(mm);
err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
if (!err) {
- /*
- * See comment in uffd_lock_vma() as to why not using
- * vma_start_read() here.
- */
- down_read(&(*dst_vmap)->vm_lock->lock);
+ vma_start_read_locked(*dst_vmap);
if (*dst_vmap != *src_vmap)
- down_read_nested(&(*src_vmap)->vm_lock->lock,
- SINGLE_DEPTH_NESTING);
+ vma_start_read_locked_nested(*src_vmap,
+ SINGLE_DEPTH_NESTING);
}
mmap_read_unlock(mm);
return err;
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-06 22:51 ` [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-12-06 22:51 ` Suren Baghdasaryan
2024-12-10 9:15 ` Vlastimil Babka
2024-12-06 22:52 ` [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
` (4 subsequent siblings)
6 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:51 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
Move vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned as well.
With kernel compiled using defconfig, this causes VMA memory consumption
to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
slabinfo before:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vma_lock ... 40 102 1 : ...
vm_area_struct ... 160 51 2 : ...
slabinfo after moving vm_lock:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vm_area_struct ... 256 32 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes. Therefore these calculations are close to the
worst case scenario. A more realistic vm_area_struct usage before this
change is:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vma_lock ... 40 102 1 : ...
vm_area_struct ... 176 46 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.
This memory consumption growth can be addressed later by optimizing the
vm_lock.
[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/mm.h | 28 ++++++++++--------
include/linux/mm_types.h | 6 ++--
kernel/fork.c | 49 ++++----------------------------
tools/testing/vma/vma_internal.h | 33 +++++----------------
4 files changed, 32 insertions(+), 84 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4a001972223..ee71a504ef88 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -696,6 +696,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+ init_rwsem(&vma->vm_lock.lock);
+ vma->vm_lock_seq = UINT_MAX;
+}
+
/*
* Try to read-lock a vma. The function is allowed to occasionally yield false
* locked result to avoid performance overhead, in which case we fall back to
@@ -713,7 +719,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
return false;
- if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+ if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
return false;
/*
@@ -728,7 +734,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
* This pairs with RELEASE semantics in vma_end_write_all().
*/
if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
- up_read(&vma->vm_lock->lock);
+ up_read(&vma->vm_lock.lock);
return false;
}
return true;
@@ -743,7 +749,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
{
mmap_assert_locked(vma->vm_mm);
- down_read_nested(&vma->vm_lock->lock, subclass);
+ down_read_nested(&vma->vm_lock.lock, subclass);
}
/*
@@ -755,13 +761,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
static inline void vma_start_read_locked(struct vm_area_struct *vma)
{
mmap_assert_locked(vma->vm_mm);
- down_read(&vma->vm_lock->lock);
+ down_read(&vma->vm_lock.lock);
}
static inline void vma_end_read(struct vm_area_struct *vma)
{
rcu_read_lock(); /* keeps vma alive till the end of up_read */
- up_read(&vma->vm_lock->lock);
+ up_read(&vma->vm_lock.lock);
rcu_read_unlock();
}
@@ -790,7 +796,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
if (__is_vma_write_locked(vma, &mm_lock_seq))
return;
- down_write(&vma->vm_lock->lock);
+ down_write(&vma->vm_lock.lock);
/*
* We should use WRITE_ONCE() here because we can have concurrent reads
* from the early lockless pessimistic check in vma_start_read().
@@ -798,7 +804,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
* we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
*/
WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
- up_write(&vma->vm_lock->lock);
+ up_write(&vma->vm_lock.lock);
}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -810,7 +816,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
static inline void vma_assert_locked(struct vm_area_struct *vma)
{
- if (!rwsem_is_locked(&vma->vm_lock->lock))
+ if (!rwsem_is_locked(&vma->vm_lock.lock))
vma_assert_write_locked(vma);
}
@@ -843,6 +849,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
#else /* CONFIG_PER_VMA_LOCK */
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
static inline bool vma_start_read(struct vm_area_struct *vma)
{ return false; }
static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -877,10 +884,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
extern const struct vm_operations_struct vma_dummy_vm_ops;
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
memset(vma, 0, sizeof(*vma));
@@ -889,6 +892,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma_mark_detached(vma, false);
vma_numab_state_init(vma);
+ vma_lock_init(vma);
}
/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 706b3c926a08..be3551654325 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -700,8 +700,6 @@ struct vm_area_struct {
* slowpath.
*/
unsigned int vm_lock_seq;
- /* Unstable RCU readers are allowed to read this. */
- struct vma_lock *vm_lock;
#endif
/*
@@ -754,6 +752,10 @@ struct vm_area_struct {
struct vma_numab_state *numab_state; /* NUMA Balancing state */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+ /* Unstable RCU readers are allowed to read this. */
+ struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
} __randomize_layout;
#ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 37055b4c30fb..21660a9ad97a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
/* SLAB cache for mm_struct structures (tsk->mm) */
static struct kmem_cache *mm_cachep;
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
- vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
- if (!vma->vm_lock)
- return false;
-
- init_rwsem(&vma->vm_lock->lock);
- vma->vm_lock_seq = UINT_MAX;
-
- return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
- kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
return NULL;
vma_init(vma, mm);
- if (!vma_lock_alloc(vma)) {
- kmem_cache_free(vm_area_cachep, vma);
- return NULL;
- }
return vma;
}
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
* will be reinitialized.
*/
data_race(memcpy(new, orig, sizeof(*new)));
- if (!vma_lock_alloc(new)) {
- kmem_cache_free(vm_area_cachep, new);
- return NULL;
- }
+ vma_lock_init(new);
INIT_LIST_HEAD(&new->anon_vma_chain);
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
{
vma_numab_state_free(vma);
free_anon_vma_name(vma);
- vma_lock_free(vma);
kmem_cache_free(vm_area_cachep, vma);
}
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
vm_rcu);
/* The vma should not be locked while being destroyed. */
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
__vm_area_free(vma);
}
#endif
@@ -3190,11 +3153,9 @@ void __init proc_caches_init(void)
sizeof(struct fs_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
NULL);
-
- vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
- vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+ vm_area_cachep = KMEM_CACHE(vm_area_struct,
+ SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+ SLAB_ACCOUNT);
mmap_init();
nsproxy_cache_init();
}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index b973b3e41c83..568c18d24d53 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -270,10 +270,10 @@ struct vm_area_struct {
/*
* Can only be written (using WRITE_ONCE()) while holding both:
* - mmap_lock (in write mode)
- * - vm_lock->lock (in write mode)
+ * - vm_lock.lock (in write mode)
* Can be read reliably while holding one of:
* - mmap_lock (in read or write mode)
- * - vm_lock->lock (in read or write mode)
+ * - vm_lock.lock (in read or write mode)
* Can be read unreliably (using READ_ONCE()) for pessimistic bailout
* while holding nothing (except RCU to keep the VMA struct allocated).
*
@@ -282,7 +282,7 @@ struct vm_area_struct {
* slowpath.
*/
unsigned int vm_lock_seq;
- struct vma_lock *vm_lock;
+ struct vma_lock vm_lock;
#endif
/*
@@ -459,17 +459,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
return mas_find(&vmi->mas, ULONG_MAX);
}
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
{
- vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
- if (!vma->vm_lock)
- return false;
-
- init_rwsem(&vma->vm_lock->lock);
+ init_rwsem(&vma->vm_lock.lock);
vma->vm_lock_seq = UINT_MAX;
-
- return true;
}
static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -492,6 +485,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_ops = &vma_dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma_mark_detached(vma, false);
+ vma_lock_init(vma);
}
static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -502,10 +496,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
return NULL;
vma_init(vma, mm);
- if (!vma_lock_alloc(vma)) {
- free(vma);
- return NULL;
- }
return vma;
}
@@ -518,10 +508,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return NULL;
memcpy(new, orig, sizeof(*new));
- if (!vma_lock_alloc(new)) {
- free(new);
- return NULL;
- }
+ vma_lock_init(new);
INIT_LIST_HEAD(&new->anon_vma_chain);
return new;
@@ -691,14 +678,8 @@ static inline void mpol_put(struct mempolicy *)
{
}
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
- free(vma->vm_lock);
-}
-
static inline void __vm_area_free(struct vm_area_struct *vma)
{
- vma_lock_free(vma);
free(vma);
}
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-06 22:51 ` [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-12-06 22:51 ` [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-06 22:52 ` Suren Baghdasaryan
2024-12-10 9:35 ` Vlastimil Babka
2024-12-10 11:36 ` Vlastimil Babka
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
` (3 subsequent siblings)
6 siblings, 2 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:52 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
Current implementation does not set detached flag when a VMA is first
allocated. This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree. Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added
into a tree.
Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to
indicate detached vma and vma_mark_attached() will need an additional
mm parameter.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/mm.h | 27 ++++++++++++++++++++-------
kernel/fork.c | 4 ++++
mm/memory.c | 2 +-
mm/vma.c | 6 +++---
mm/vma.h | 2 ++
tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
6 files changed, 42 insertions(+), 16 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee71a504ef88..2bf38c1e9cca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -820,12 +820,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
vma_assert_write_locked(vma);
}
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+ vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
{
/* When detaching vma should be write-locked */
- if (detached)
- vma_assert_write_locked(vma);
- vma->detached = detached;
+ vma_assert_write_locked(vma);
+ vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+ return vma->detached;
}
static inline void release_fault_lock(struct vm_fault *vmf)
@@ -856,8 +865,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
- bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
unsigned long address)
@@ -890,7 +899,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &vma_dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
- vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+ /* vma is not locked, can't use vma_mark_detached() */
+ vma->detached = true;
+#endif
vma_numab_state_init(vma);
vma_lock_init(vma);
}
@@ -1085,6 +1097,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
+ vma_mark_attached(vma);
return 0;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 21660a9ad97a..71990f46aa4e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
data_race(memcpy(new, orig, sizeof(*new)));
vma_lock_init(new);
INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+ /* vma is not locked, can't use vma_mark_detached() */
+ new->detached = true;
+#endif
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);
diff --git a/mm/memory.c b/mm/memory.c
index f823906a4a0f..b252f19b28c9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6372,7 +6372,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
goto inval;
/* Check if the VMA got isolated after we found it */
- if (vma->detached) {
+ if (is_vma_detached(vma)) {
vma_end_read(vma);
count_vm_vma_lock_event(VMA_LOCK_MISS);
/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index a06747845cac..cdc63728f47f 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
if (vp->remove) {
again:
- vma_mark_detached(vp->remove, true);
+ vma_mark_detached(vp->remove);
if (vp->file) {
uprobe_munmap(vp->remove, vp->remove->vm_start,
vp->remove->vm_end);
@@ -1220,7 +1220,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- vma_mark_detached(vma, false);
+ vma_mark_attached(vma);
__mt_destroy(mas_detach->tree);
}
@@ -1295,7 +1295,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
if (error)
goto munmap_gather_failed;
- vma_mark_detached(next, true);
+ vma_mark_detached(next);
nrpages = vma_pages(next);
vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index 295d44ea54db..32d99b2963df 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -156,6 +156,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
+ vma_mark_attached(vma);
return 0;
}
@@ -385,6 +386,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_prealloc(&vmi->mas, vma);
+ vma_mark_attached(vma);
}
static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 568c18d24d53..0cdc5f8c3d60 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -465,13 +465,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
vma->vm_lock_seq = UINT_MAX;
}
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+ vma->detached = false;
+}
+
static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
{
/* When detaching vma should be write-locked */
- if (detached)
- vma_assert_write_locked(vma);
- vma->detached = detached;
+ vma_assert_write_locked(vma);
+ vma->detached = true;
}
extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -484,7 +488,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &vma_dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
- vma_mark_detached(vma, false);
+ /* vma is not locked, can't use vma_mark_detached() */
+ vma->detached = true;
vma_lock_init(vma);
}
@@ -510,6 +515,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
memcpy(new, orig, sizeof(*new));
vma_lock_init(new);
INIT_LIST_HEAD(&new->anon_vma_chain);
+ /* vma is not locked, can't use vma_mark_detached() */
+ new->detached = true;
return new;
}
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
` (2 preceding siblings ...)
2024-12-06 22:52 ` [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-12-06 22:52 ` Suren Baghdasaryan
2024-12-09 17:35 ` Klara Modin
` (2 more replies)
2024-12-06 22:52 ` [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor Suren Baghdasaryan
` (2 subsequent siblings)
6 siblings, 3 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:52 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected inside
lock_vma_under_rcu().
lock_vma_under_rcu() enters RCU read section, finds the vma at the
given address, locks the vma and checks if it got detached or remapped
to cover a different address range. These last checks are there
to ensure that the vma was not modified after we found it but before
locking it.
vma reuse introduces several new possibilities:
1. vma can be reused after it was found but before it is locked;
2. vma can be reused and reinitialized (including changing its vm_mm)
while being locked in vma_start_read();
3. vma can be reused and reinitialized after it was found but before
it is locked, then attached at a new address or to a new mm while
read-locked;
For case #1 current checks will help detecting cases when:
- vma was reused but not yet added into the tree (detached check)
- vma was reused at a different address range (address check);
We are missing the check for vm_mm to ensure the reused vma was not
attached to a different mm. This patch adds the missing check.
For case #2, we pass mm to vma_start_read() to prevent access to
unstable vma->vm_mm. This might lead to vma_start_read() returning
a false locked result but that's not critical if it's rare because
it will only lead to a retry under mmap_lock.
For case #3, we ensure the order in which vma->detached flag and
vm_start/vm_end/vm_mm are set and checked. vma gets attached after
vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
vma->detached before checking vm_start/vm_end/vm_mm. This is required
because attaching vma happens without vma write-lock, as opposed to
vma detaching, which requires vma write-lock. This patch adds memory
barriers inside is_vma_detached() and vma_mark_attached() needed to
order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
This will facilitate vm_area_struct reuse and will minimize the number
of call_rcu() calls.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
include/linux/mm.h | 36 +++++--
include/linux/mm_types.h | 10 +-
include/linux/slab.h | 6 --
kernel/fork.c | 157 +++++++++++++++++++++++++------
mm/memory.c | 15 ++-
mm/vma.c | 2 +-
tools/testing/vma/vma_internal.h | 7 +-
7 files changed, 179 insertions(+), 54 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2bf38c1e9cca..3568bcbc7c81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
void vm_area_free(struct vm_area_struct *);
/* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
+void vm_area_free_unreachable(struct vm_area_struct *vma);
#ifndef CONFIG_MMU
extern struct rb_root nommu_region_tree;
@@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
* Try to read-lock a vma. The function is allowed to occasionally yield false
* locked result to avoid performance overhead, in which case we fall back to
* using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
*/
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
{
/*
* Check before locking. A race might cause false locked result.
@@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
* we don't rely on for anything - the mm_lock_seq read against which we
* need ordering is below.
*/
- if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+ if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
return false;
if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
@@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
* after it has been unlocked.
* This pairs with RELEASE semantics in vma_end_write_all().
*/
- if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+ if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
up_read(&vma->vm_lock.lock);
return false;
}
@@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
static inline void vma_mark_attached(struct vm_area_struct *vma)
{
- vma->detached = false;
+ /*
+ * This pairs with smp_rmb() inside is_vma_detached().
+ * vma is marked attached after all vma modifications are done and it
+ * got added into the vma tree. All prior vma modifications should be
+ * made visible before marking the vma attached.
+ */
+ smp_wmb();
+ /* This pairs with READ_ONCE() in is_vma_detached(). */
+ WRITE_ONCE(vma->detached, false);
}
static inline void vma_mark_detached(struct vm_area_struct *vma)
@@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
static inline bool is_vma_detached(struct vm_area_struct *vma)
{
- return vma->detached;
+ bool detached;
+
+ /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
+ detached = READ_ONCE(vma->detached);
+ /*
+ * This pairs with smp_wmb() inside vma_mark_attached() to ensure
+ * vma->detached is read before vma attributes read later inside
+ * lock_vma_under_rcu().
+ */
+ smp_rmb();
+
+ return detached;
}
static inline void release_fault_lock(struct vm_fault *vmf)
@@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_lock_init(struct vm_area_struct *vma) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
{ return false; }
static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
@@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
extern const struct vm_operations_struct vma_dummy_vm_ops;
+/* Use on VMAs not created using vm_area_alloc() */
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
memset(vma, 0, sizeof(*vma));
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index be3551654325..5d8779997266 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
typedef unsigned long vm_flags_t;
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
/*
* A region containing a mapping of a non-memory backed file under NOMMU
* conditions. These are held in a global tree and are pinned by the VMAs that
@@ -657,9 +663,7 @@ struct vm_area_struct {
unsigned long vm_start;
unsigned long vm_end;
};
-#ifdef CONFIG_PER_VMA_LOCK
- struct rcu_head vm_rcu; /* Used for deferred freeing. */
-#endif
+ freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
};
/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..681b685b6c4e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -234,12 +234,6 @@ enum _slab_flag_bits {
#define SLAB_NO_OBJ_EXT __SLAB_FLAG_UNUSED
#endif
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/kernel/fork.c b/kernel/fork.c
index 71990f46aa4e..e7e76a660e4c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
/* SLAB cache for mm_struct structures (tsk->mm) */
static struct kmem_cache *mm_cachep;
+static void vm_area_ctor(void *data)
+{
+ struct vm_area_struct *vma = (struct vm_area_struct *)data;
+
+#ifdef CONFIG_PER_VMA_LOCK
+ /* vma is not locked, can't use vma_mark_detached() */
+ vma->detached = true;
+#endif
+ INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_lock_init(vma);
+}
+
+#ifdef CONFIG_PER_VMA_LOCK
+
+static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
+{
+ vma->vm_mm = mm;
+ vma->vm_ops = &vma_dummy_vm_ops;
+ vma->vm_start = 0;
+ vma->vm_end = 0;
+ vma->anon_vma = NULL;
+ vma->vm_pgoff = 0;
+ vma->vm_file = NULL;
+ vma->vm_private_data = NULL;
+ vm_flags_init(vma, 0);
+ memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
+ memset(&vma->shared, 0, sizeof(vma->shared));
+ memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
+ vma_numab_state_init(vma);
+#ifdef CONFIG_ANON_VMA_NAME
+ vma->anon_name = NULL;
+#endif
+#ifdef CONFIG_SWAP
+ memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+ vma->vm_region = NULL;
+#endif
+#ifdef CONFIG_NUMA
+ vma->vm_policy = NULL;
+#endif
+}
+
+static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
+{
+ dest->vm_mm = src->vm_mm;
+ dest->vm_ops = src->vm_ops;
+ dest->vm_start = src->vm_start;
+ dest->vm_end = src->vm_end;
+ dest->anon_vma = src->anon_vma;
+ dest->vm_pgoff = src->vm_pgoff;
+ dest->vm_file = src->vm_file;
+ dest->vm_private_data = src->vm_private_data;
+ vm_flags_init(dest, src->vm_flags);
+ memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+ sizeof(dest->vm_page_prot));
+ memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
+ memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+ sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+ dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+ memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+ sizeof(dest->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+ dest->vm_region = src->vm_region;
+#endif
+#ifdef CONFIG_NUMA
+ dest->vm_policy = src->vm_policy;
+#endif
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
+{
+ vma_init(vma, mm);
+}
+
+static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
+{
+ /*
+ * orig->shared.rb may be modified concurrently, but the clone
+ * will be reinitialized.
+ */
+ data_race(memcpy(dest, src, sizeof(*dest)));
+}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
struct vm_area_struct *vma;
@@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
if (!vma)
return NULL;
- vma_init(vma, mm);
+ vma_clear(vma, mm);
return vma;
}
@@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
- /*
- * orig->shared.rb may be modified concurrently, but the clone
- * will be reinitialized.
- */
- data_race(memcpy(new, orig, sizeof(*new)));
- vma_lock_init(new);
- INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
- /* vma is not locked, can't use vma_mark_detached() */
- new->detached = true;
-#endif
+ vma_copy(orig, new);
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);
return new;
}
-void __vm_area_free(struct vm_area_struct *vma)
+static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
{
+#ifdef CONFIG_PER_VMA_LOCK
+ /*
+ * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
+ * vma->detached to be set before vma is returned into the cache.
+ * This way reused object won't be used by readers until it's
+ * initialized and reattached.
+ * If vma is unreachable, there can be no other users and we
+ * can set vma->detached directly with no risk of a race.
+ * If vma is reachable, then it should have been already detached
+ * under vma write-lock or it was never attached.
+ */
+ if (unreachable)
+ vma->detached = true;
+ else
+ VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
+ vma->vm_lock_seq = UINT_MAX;
+#endif
+ VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
vma_numab_state_free(vma);
free_anon_vma_name(vma);
kmem_cache_free(vm_area_cachep, vma);
}
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
+void vm_area_free(struct vm_area_struct *vma)
{
- struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
- vm_rcu);
-
- /* The vma should not be locked while being destroyed. */
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
- __vm_area_free(vma);
+ __vm_area_free(vma, false);
}
-#endif
-void vm_area_free(struct vm_area_struct *vma)
+void vm_area_free_unreachable(struct vm_area_struct *vma)
{
-#ifdef CONFIG_PER_VMA_LOCK
- call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
- __vm_area_free(vma);
-#endif
+ __vm_area_free(vma, true);
}
static void account_kernel_stack(struct task_struct *tsk, int account)
@@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
void __init proc_caches_init(void)
{
+ struct kmem_cache_args args = {
+ .use_freeptr_offset = true,
+ .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+ .ctor = vm_area_ctor,
+ };
+
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
sizeof(struct fs_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
NULL);
- vm_area_cachep = KMEM_CACHE(vm_area_struct,
- SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+ vm_area_cachep = kmem_cache_create("vm_area_struct",
+ sizeof(struct vm_area_struct), &args,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
SLAB_ACCOUNT);
+
mmap_init();
nsproxy_cache_init();
}
diff --git a/mm/memory.c b/mm/memory.c
index b252f19b28c9..6f4d4d423835 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma)
goto inval;
- if (!vma_start_read(vma))
+ if (!vma_start_read(mm, vma))
goto inval;
- /* Check if the VMA got isolated after we found it */
+ /*
+ * Check if the VMA got isolated after we found it.
+ * Note: vma we found could have been recycled and is being reattached.
+ * It's possible to attach a vma while it is read-locked, however a
+ * read-locked vma can't be detached (detaching requires write-locking).
+ * Therefore if this check passes, we have an attached and stable vma.
+ */
if (is_vma_detached(vma)) {
vma_end_read(vma);
count_vm_vma_lock_event(VMA_LOCK_MISS);
@@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
* fields are accessible for RCU readers.
*/
- /* Check since vm_start/vm_end might change before we lock the VMA */
- if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+ /* Check if the vma we locked is the right one. */
+ if (unlikely(vma->vm_mm != mm ||
+ address < vma->vm_start || address >= vma->vm_end))
goto inval_end_read;
rcu_read_unlock();
diff --git a/mm/vma.c b/mm/vma.c
index cdc63728f47f..648784416833 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
if (unreachable)
- __vm_area_free(vma);
+ vm_area_free_unreachable(vma);
else
vm_area_free(vma);
}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 0cdc5f8c3d60..3eeb1317cc69 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
{
}
-static inline void __vm_area_free(struct vm_area_struct *vma)
+static inline void vm_area_free(struct vm_area_struct *vma)
{
free(vma);
}
-static inline void vm_area_free(struct vm_area_struct *vma)
+static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
{
- __vm_area_free(vma);
+ vma->detached = true;
+ vm_area_free(vma);
}
static inline void lru_add_drain(void)
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
` (3 preceding siblings ...)
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-12-06 22:52 ` Suren Baghdasaryan
2024-12-10 11:01 ` Vlastimil Babka
2024-12-06 22:52 ` [PATCH v5 6/6] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-12-07 4:29 ` [PATCH v5 0/6] move per-vma lock into vm_area_struct Andrew Morton
6 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:52 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
There is no real reason to prevent freeptr_offset usage when a slab
cache has a ctor. The only real limitation is that any field unioned
with the free pointer and initialized by ctor will be overwritten since
free pointer is set after @ctor invocation. Document this limitation
and enable usage of freeptr_offset with ctor.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
include/linux/slab.h | 5 +++--
mm/slub.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 681b685b6c4e..6bad744bef5e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -305,8 +305,9 @@ struct kmem_cache_args {
* Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
* is specified, %use_freeptr_offset must be set %true.
*
- * Note that @ctor currently isn't supported with custom free pointers
- * as a @ctor requires an external free pointer.
+ * Note that fields unioned with free pointer cannot be initialized by
+ * @ctor since free pointer is set after @ctor invocation, so those
+ * values will be overwritten.
*/
unsigned int freeptr_offset;
/**
diff --git a/mm/slub.c b/mm/slub.c
index 870a1d95521d..f62c829b7b6b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5462,7 +5462,7 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
s->inuse = size;
if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||
- (flags & SLAB_POISON) || s->ctor ||
+ (flags & SLAB_POISON) || (s->ctor && !args->use_freeptr_offset) ||
((flags & SLAB_RED_ZONE) &&
(s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {
/*
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 6/6] docs/mm: document latest changes to vm_lock
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
` (4 preceding siblings ...)
2024-12-06 22:52 ` [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor Suren Baghdasaryan
@ 2024-12-06 22:52 ` Suren Baghdasaryan
2024-12-07 3:23 ` Randy Dunlap
2024-12-07 4:29 ` [PATCH v5 0/6] move per-vma lock into vm_area_struct Andrew Morton
6 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-06 22:52 UTC (permalink / raw)
To: akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, surenb
Change the documentation to reflect that vm_lock is integrated into vma.
Document newly introduced vma_start_read_locked{_nested} functions.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
Documentation/mm/process_addrs.rst | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 81417fa2ed20..92cf497a9e3c 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -716,7 +716,11 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
+succeed in acquiring VMA read lock.
+
+VMA read locks hold the read lock on the :c:member:`!vma.vm_lock` semaphore for
their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
via :c:func:`!vma_end_read`.
@@ -780,7 +784,7 @@ keep VMAs locked across entirely separate write operations. It also maintains
correct lock ordering.
Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
+:c:member:`!vma.vm_lock` read/write semaphore and hold it, while checking that
the sequence count of the VMA does not match that of the mm.
If it does, the read lock fails. If it does not, we hold the lock, excluding
@@ -790,7 +794,7 @@ Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
are also RCU safe, so the whole read lock operation is guaranteed to function
correctly.
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
+On the write side, we acquire a write lock on the :c:member:`!vma.vm_lock`
read/write semaphore, before setting the VMA's sequence number under this lock,
also simultaneously holding the mmap write lock.
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 6/6] docs/mm: document latest changes to vm_lock
2024-12-06 22:52 ` [PATCH v5 6/6] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2024-12-07 3:23 ` Randy Dunlap
2024-12-07 4:24 ` Akira Yokosawa
0 siblings, 1 reply; 32+ messages in thread
From: Randy Dunlap @ 2024-12-07 3:23 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
Hi,
Can someone explain what the (consistent) usage of '!' does in this file?
This is the only file in Documentation/ that uses this syntax.
E.g.:
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index 81417fa2ed20..92cf497a9e3c 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -716,7 +716,11 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
> critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
> before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
>
> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
> +succeed in acquiring VMA read lock.
thanks.
--
~Randy
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 6/6] docs/mm: document latest changes to vm_lock
2024-12-07 3:23 ` Randy Dunlap
@ 2024-12-07 4:24 ` Akira Yokosawa
2024-12-07 17:33 ` Suren Baghdasaryan
0 siblings, 1 reply; 32+ messages in thread
From: Akira Yokosawa @ 2024-12-07 4:24 UTC (permalink / raw)
To: rdunlap, surenb
Cc: akpm, brauner, corbet, dave, david, dhowells, hannes, hdanton,
hughd, jannh, kernel-team, liam.howlett, linux-doc, linux-kernel,
linux-mm, lorenzo.stoakes, mgorman, mhocko, minchan, mjguzik,
oleg, oliver.sang, pasha.tatashin, paulmck, peterx, shakeel.butt,
souravpanda, vbabka, willy
On Fri, 6 Dec 2024 19:23:59 -0800, Randy Dunlap wrote:
> Hi,
>
> Can someone explain what the (consistent) usage of '!' does in this file?
> This is the only file in Documentation/ that uses this syntax.
>
>
> E.g.:
>
>> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
>> index 81417fa2ed20..92cf497a9e3c 100644
>> --- a/Documentation/mm/process_addrs.rst
>> +++ b/Documentation/mm/process_addrs.rst
>> @@ -716,7 +716,11 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
>> critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
>> before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
>>
>> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
>> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
>> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
>> +succeed in acquiring VMA read lock.
>
Quoting from https://www.sphinx-doc.org/en/master/usage/referencing.html#syntax
* Suppressed link: Prefixing with an exclamation mark (!) prevents the
creation of a link but otherwise keeps the visual output of the role.
For example, writing :py:func:`!target` displays target(), with no link
generated.
This is helpful for cases in which the link target does not exist; e.g.
changelog entries that describe removed functionality, or third-party
libraries that don’t support intersphinx. Suppressing the link prevents
warnings in nitpicky mode.
But in kernel documentation, there is a preferred alternative.
Referencing by function names is the way to go. For example:
calls rcu_read_lock() to ensure that the VMA is looked up in an RCU
critical section, then attempts to VMA lock it via vma_start_read(),
before releasing the RCU lock via rcu_read_unlock().
In cases when the user already holds mmap read lock, vma_start_read_locked()
and vma_start_read_locked_nested() can be used. These functions always
succeed in acquiring VMA read lock.
They work regardless of link target's existence.
Kernel-specific Sphinx extension named "automarkup" does conversions
for you.
HTH, Akira
> thanks.
> --
> ~Randy
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 0/6] move per-vma lock into vm_area_struct
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
` (5 preceding siblings ...)
2024-12-06 22:52 ` [PATCH v5 6/6] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2024-12-07 4:29 ` Andrew Morton
6 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2024-12-07 4:29 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, Peter Zijlstra
On Fri, 6 Dec 2024 14:51:57 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
>
> ...
>
> Patch applies over linux-next (due to vm_lock change [5] not in mm tree).
>
> ...
>
> [5] https://lore.kernel.org/all/20241122174416.1367052-2-surenb@google.com/
Well that's awkward. I added the "seqlock: add raw_seqcount_try_begin"
series to mm.git. Peter, please drop your copy from linux-next?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 6/6] docs/mm: document latest changes to vm_lock
2024-12-07 4:24 ` Akira Yokosawa
@ 2024-12-07 17:33 ` Suren Baghdasaryan
0 siblings, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-07 17:33 UTC (permalink / raw)
To: Akira Yokosawa, lorenzo.stoakes
Cc: rdunlap, akpm, brauner, corbet, dave, david, dhowells, hannes,
hdanton, hughd, jannh, kernel-team, liam.howlett, linux-doc,
linux-kernel, linux-mm, mgorman, mhocko, minchan, mjguzik, oleg,
oliver.sang, pasha.tatashin, paulmck, peterx, shakeel.butt,
souravpanda, vbabka, willy
On Fri, Dec 6, 2024 at 8:24 PM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On Fri, 6 Dec 2024 19:23:59 -0800, Randy Dunlap wrote:
> > Hi,
> >
> > Can someone explain what the (consistent) usage of '!' does in this file?
> > This is the only file in Documentation/ that uses this syntax.
> >
> >
> > E.g.:
> >
> >> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> >> index 81417fa2ed20..92cf497a9e3c 100644
> >> --- a/Documentation/mm/process_addrs.rst
> >> +++ b/Documentation/mm/process_addrs.rst
> >> @@ -716,7 +716,11 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
> >> critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
> >> before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
> >>
> >> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> >> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> >> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions always
> >> +succeed in acquiring VMA read lock.
> >
>
> Quoting from https://www.sphinx-doc.org/en/master/usage/referencing.html#syntax
>
> * Suppressed link: Prefixing with an exclamation mark (!) prevents the
> creation of a link but otherwise keeps the visual output of the role.
>
> For example, writing :py:func:`!target` displays target(), with no link
> generated.
>
> This is helpful for cases in which the link target does not exist; e.g.
> changelog entries that describe removed functionality, or third-party
> libraries that don’t support intersphinx. Suppressing the link prevents
> warnings in nitpicky mode.
>
> But in kernel documentation, there is a preferred alternative.
> Referencing by function names is the way to go. For example:
>
> calls rcu_read_lock() to ensure that the VMA is looked up in an RCU
> critical section, then attempts to VMA lock it via vma_start_read(),
> before releasing the RCU lock via rcu_read_unlock().
>
> In cases when the user already holds mmap read lock, vma_start_read_locked()
> and vma_start_read_locked_nested() can be used. These functions always
> succeed in acquiring VMA read lock.
>
> They work regardless of link target's existence.
> Kernel-specific Sphinx extension named "automarkup" does conversions
> for you.
Thanks for the information. I was simply following the same style the
document was written in. Sounds like converting it to the preferred
alternative in a separate patch would be best. Lorenzo, WDYT?
>
> HTH, Akira
>
> > thanks.
> > --
> > ~Randy
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-12-09 17:35 ` Klara Modin
2024-12-09 20:28 ` Suren Baghdasaryan
2024-12-10 12:06 ` Vlastimil Babka
2024-12-10 14:21 ` Vlastimil Babka
2 siblings, 1 reply; 32+ messages in thread
From: Klara Modin @ 2024-12-09 17:35 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
[-- Attachment #1: Type: text/plain, Size: 18644 bytes --]
Hi,
On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209)
seems to cause an oops on a MIPS board of mine (Cavium Octeon III)
(abbreviated, full attached):
CPU 2 Unable to handle kernel paging request at virtual address
0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
Oops[#1]:
CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
6.13.0-rc1-00162-g85ad413389ae #156
Call Trace:
unlink_anon_vmas (mm/rmap.c:408)
free_pgtables (mm/memory.c:393)
vms_clear_ptes (mm/vma.c:1143)
vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
do_vmi_align_munmap (mm/vma.c:1452)
__vm_munmap (mm/vma.c:2892)
sys_munmap (mm/mmap.c:1053)
syscall_common (arch/mips/kernel/scall64-n64.S:62)
I saw that there's already a report, but maybe another arch can be
useful for tracking this down.
Please let me know if there's anything else you need.
Regards,
Klara Modin
Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> include/linux/mm.h | 36 +++++--
> include/linux/mm_types.h | 10 +-
> include/linux/slab.h | 6 --
> kernel/fork.c | 157 +++++++++++++++++++++++++------
> mm/memory.c | 15 ++-
> mm/vma.c | 2 +-
> tools/testing/vma/vma_internal.h | 7 +-
> 7 files changed, 179 insertions(+), 54 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2bf38c1e9cca..3568bcbc7c81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> void vm_area_free(struct vm_area_struct *);
> /* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
> +void vm_area_free_unreachable(struct vm_area_struct *vma);
>
> #ifndef CONFIG_MMU
> extern struct rb_root nommu_region_tree;
> @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> * Try to read-lock a vma. The function is allowed to occasionally yield false
> * locked result to avoid performance overhead, in which case we fall back to
> * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
> */
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> /*
> * Check before locking. A race might cause false locked result.
> @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> * we don't rely on for anything - the mm_lock_seq read against which we
> * need ordering is below.
> */
> - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> return false;
>
> if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> * after it has been unlocked.
> * This pairs with RELEASE semantics in vma_end_write_all().
> */
> - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> up_read(&vma->vm_lock.lock);
> return false;
> }
> @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>
> static inline void vma_mark_attached(struct vm_area_struct *vma)
> {
> - vma->detached = false;
> + /*
> + * This pairs with smp_rmb() inside is_vma_detached().
> + * vma is marked attached after all vma modifications are done and it
> + * got added into the vma tree. All prior vma modifications should be
> + * made visible before marking the vma attached.
> + */
> + smp_wmb();
> + /* This pairs with READ_ONCE() in is_vma_detached(). */
> + WRITE_ONCE(vma->detached, false);
> }
>
> static inline void vma_mark_detached(struct vm_area_struct *vma)
> @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
>
> static inline bool is_vma_detached(struct vm_area_struct *vma)
> {
> - return vma->detached;
> + bool detached;
> +
> + /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> + detached = READ_ONCE(vma->detached);
> + /*
> + * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> + * vma->detached is read before vma attributes read later inside
> + * lock_vma_under_rcu().
> + */
> + smp_rmb();
> +
> + return detached;
> }
>
> static inline void release_fault_lock(struct vm_fault *vmf)
> @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline void vma_lock_init(struct vm_area_struct *vma) {}
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> { return false; }
> static inline void vma_end_read(struct vm_area_struct *vma) {}
> static inline void vma_start_write(struct vm_area_struct *vma) {}
> @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>
> extern const struct vm_operations_struct vma_dummy_vm_ops;
>
> +/* Use on VMAs not created using vm_area_alloc() */
> static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> {
> memset(vma, 0, sizeof(*vma));
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index be3551654325..5d8779997266 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
>
> typedef unsigned long vm_flags_t;
>
> +/*
> + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> + */
> +typedef struct { unsigned long v; } freeptr_t;
> +
> /*
> * A region containing a mapping of a non-memory backed file under NOMMU
> * conditions. These are held in a global tree and are pinned by the VMAs that
> @@ -657,9 +663,7 @@ struct vm_area_struct {
> unsigned long vm_start;
> unsigned long vm_end;
> };
> -#ifdef CONFIG_PER_VMA_LOCK
> - struct rcu_head vm_rcu; /* Used for deferred freeing. */
> -#endif
> + freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> };
>
> /*
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..681b685b6c4e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> #define SLAB_NO_OBJ_EXT __SLAB_FLAG_UNUSED
> #endif
>
> -/*
> - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> - */
> -typedef struct { unsigned long v; } freeptr_t;
> -
> /*
> * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> *
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 71990f46aa4e..e7e76a660e4c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
> /* SLAB cache for mm_struct structures (tsk->mm) */
> static struct kmem_cache *mm_cachep;
>
> +static void vm_area_ctor(void *data)
> +{
> + struct vm_area_struct *vma = (struct vm_area_struct *)data;
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> + /* vma is not locked, can't use vma_mark_detached() */
> + vma->detached = true;
> +#endif
> + INIT_LIST_HEAD(&vma->anon_vma_chain);
> + vma_lock_init(vma);
> +}
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> + vma->vm_mm = mm;
> + vma->vm_ops = &vma_dummy_vm_ops;
> + vma->vm_start = 0;
> + vma->vm_end = 0;
> + vma->anon_vma = NULL;
> + vma->vm_pgoff = 0;
> + vma->vm_file = NULL;
> + vma->vm_private_data = NULL;
> + vm_flags_init(vma, 0);
> + memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> + memset(&vma->shared, 0, sizeof(vma->shared));
> + memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> + vma_numab_state_init(vma);
> +#ifdef CONFIG_ANON_VMA_NAME
> + vma->anon_name = NULL;
> +#endif
> +#ifdef CONFIG_SWAP
> + memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> + vma->vm_region = NULL;
> +#endif
> +#ifdef CONFIG_NUMA
> + vma->vm_policy = NULL;
> +#endif
> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> + dest->vm_mm = src->vm_mm;
> + dest->vm_ops = src->vm_ops;
> + dest->vm_start = src->vm_start;
> + dest->vm_end = src->vm_end;
> + dest->anon_vma = src->anon_vma;
> + dest->vm_pgoff = src->vm_pgoff;
> + dest->vm_file = src->vm_file;
> + dest->vm_private_data = src->vm_private_data;
> + vm_flags_init(dest, src->vm_flags);
> + memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> + sizeof(dest->vm_page_prot));
> + memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> + sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> + dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> + sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> + dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> + dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> + vma_init(vma, mm);
> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> + /*
> + * orig->shared.rb may be modified concurrently, but the clone
> + * will be reinitialized.
> + */
> + data_race(memcpy(dest, src, sizeof(*dest)));
> +}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> {
> struct vm_area_struct *vma;
> @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> if (!vma)
> return NULL;
>
> - vma_init(vma, mm);
> + vma_clear(vma, mm);
>
> return vma;
> }
> @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>
> ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> - /*
> - * orig->shared.rb may be modified concurrently, but the clone
> - * will be reinitialized.
> - */
> - data_race(memcpy(new, orig, sizeof(*new)));
> - vma_lock_init(new);
> - INIT_LIST_HEAD(&new->anon_vma_chain);
> -#ifdef CONFIG_PER_VMA_LOCK
> - /* vma is not locked, can't use vma_mark_detached() */
> - new->detached = true;
> -#endif
> + vma_copy(orig, new);
> vma_numab_state_init(new);
> dup_anon_vma_name(orig, new);
>
> return new;
> }
>
> -void __vm_area_free(struct vm_area_struct *vma)
> +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
> {
> +#ifdef CONFIG_PER_VMA_LOCK
> + /*
> + * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> + * vma->detached to be set before vma is returned into the cache.
> + * This way reused object won't be used by readers until it's
> + * initialized and reattached.
> + * If vma is unreachable, there can be no other users and we
> + * can set vma->detached directly with no risk of a race.
> + * If vma is reachable, then it should have been already detached
> + * under vma write-lock or it was never attached.
> + */
> + if (unreachable)
> + vma->detached = true;
> + else
> + VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> + vma->vm_lock_seq = UINT_MAX;
> +#endif
> + VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
> vma_numab_state_free(vma);
> free_anon_vma_name(vma);
> kmem_cache_free(vm_area_cachep, vma);
> }
>
> -#ifdef CONFIG_PER_VMA_LOCK
> -static void vm_area_free_rcu_cb(struct rcu_head *head)
> +void vm_area_free(struct vm_area_struct *vma)
> {
> - struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> - vm_rcu);
> -
> - /* The vma should not be locked while being destroyed. */
> - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> - __vm_area_free(vma);
> + __vm_area_free(vma, false);
> }
> -#endif
>
> -void vm_area_free(struct vm_area_struct *vma)
> +void vm_area_free_unreachable(struct vm_area_struct *vma)
> {
> -#ifdef CONFIG_PER_VMA_LOCK
> - call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> -#else
> - __vm_area_free(vma);
> -#endif
> + __vm_area_free(vma, true);
> }
>
> static void account_kernel_stack(struct task_struct *tsk, int account)
> @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
>
> void __init proc_caches_init(void)
> {
> + struct kmem_cache_args args = {
> + .use_freeptr_offset = true,
> + .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> + .ctor = vm_area_ctor,
> + };
> +
> sighand_cachep = kmem_cache_create("sighand_cache",
> sizeof(struct sighand_struct), 0,
> SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
> sizeof(struct fs_struct), 0,
> SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> NULL);
> - vm_area_cachep = KMEM_CACHE(vm_area_struct,
> - SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> + vm_area_cachep = kmem_cache_create("vm_area_struct",
> + sizeof(struct vm_area_struct), &args,
> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> SLAB_ACCOUNT);
> +
> mmap_init();
> nsproxy_cache_init();
> }
> diff --git a/mm/memory.c b/mm/memory.c
> index b252f19b28c9..6f4d4d423835 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma)
> goto inval;
>
> - if (!vma_start_read(vma))
> + if (!vma_start_read(mm, vma))
> goto inval;
>
> - /* Check if the VMA got isolated after we found it */
> + /*
> + * Check if the VMA got isolated after we found it.
> + * Note: vma we found could have been recycled and is being reattached.
> + * It's possible to attach a vma while it is read-locked, however a
> + * read-locked vma can't be detached (detaching requires write-locking).
> + * Therefore if this check passes, we have an attached and stable vma.
> + */
> if (is_vma_detached(vma)) {
> vma_end_read(vma);
> count_vm_vma_lock_event(VMA_LOCK_MISS);
> @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> * fields are accessible for RCU readers.
> */
>
> - /* Check since vm_start/vm_end might change before we lock the VMA */
> - if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> + /* Check if the vma we locked is the right one. */
> + if (unlikely(vma->vm_mm != mm ||
> + address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;
>
> rcu_read_unlock();
> diff --git a/mm/vma.c b/mm/vma.c
> index cdc63728f47f..648784416833 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> fput(vma->vm_file);
> mpol_put(vma_policy(vma));
> if (unreachable)
> - __vm_area_free(vma);
> + vm_area_free_unreachable(vma);
> else
> vm_area_free(vma);
> }
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 0cdc5f8c3d60..3eeb1317cc69 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
> {
> }
>
> -static inline void __vm_area_free(struct vm_area_struct *vma)
> +static inline void vm_area_free(struct vm_area_struct *vma)
> {
> free(vma);
> }
>
> -static inline void vm_area_free(struct vm_area_struct *vma)
> +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
> {
> - __vm_area_free(vma);
> + vma->detached = true;
> + vm_area_free(vma);
> }
>
> static inline void lru_add_drain(void)
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 24042 bytes --]
[-- Attachment #3: oops-bisect --]
[-- Type: text/plain, Size: 3245 bytes --]
# bad: [7e0a79eb1674cbe8916bbc7d7f9f6bfb409dfb4c] commit 0790303ec869 leads to cpu stall without CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
git bisect start 'HEAD'
# status: waiting for good commit(s), bad commit known
# good: [fac04efc5c793dccbd07e2d59af9f90b7fc0dca4] Linux 6.13-rc2
git bisect good fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
# bad: [06036a53ccaf22767bbd5caa491c52bb673604e6] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect bad 06036a53ccaf22767bbd5caa491c52bb673604e6
# bad: [ddaabea9ec4322b7d5cdd2017c4cc59b28adb8fb] Merge branch 'linux-next' of git://github.com/c-sky/csky-linux.git
git bisect bad ddaabea9ec4322b7d5cdd2017c4cc59b28adb8fb
# good: [5de9573848aa512bb06e8842faf5da6926e693bd] Merge branch 'tip/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect good 5de9573848aa512bb06e8842faf5da6926e693bd
# bad: [8bb435707c07ff7650fe8ef690700845c772d3cf] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux.git
git bisect bad 8bb435707c07ff7650fe8ef690700845c772d3cf
# good: [7e274b34309bb7884cade55b7cb0b0b8722dcde2] mm: do_zap_pte_range: return any_skipped information to the caller
git bisect good 7e274b34309bb7884cade55b7cb0b0b8722dcde2
# skip: [6e165f54437931f329d09dca6c19d99af08a36e1] mm/page_isolation: fixup isolate_single_pageblock() comment regarding splitting free pages
git bisect skip 6e165f54437931f329d09dca6c19d99af08a36e1
# good: [c81cad99b0b13b0bb339a8e30aa13ebc1e5dbff3] minmax.h: update some comments
git bisect good c81cad99b0b13b0bb339a8e30aa13ebc1e5dbff3
# good: [191812969871e9ae447cf3e7426fb72b351f1336] scripts/spelling.txt: add more spellings to spelling.txt
git bisect good 191812969871e9ae447cf3e7426fb72b351f1336
# good: [f635cc53910d4ba7726f925e49ce57c5332cd6a1] ucounts: move kfree() out of critical zone protected by ucounts_lock
git bisect good f635cc53910d4ba7726f925e49ce57c5332cd6a1
# skip: [4f3ce240fd4bd7681335231b8cf21fdc37c4e848] mm/page_alloc: conditionally split > pageblock_order pages in free_one_page() and move_freepages_block_isolate()
git bisect skip 4f3ce240fd4bd7681335231b8cf21fdc37c4e848
# bad: [53b2a428e1036e481af0c7fc39cb92128344b689] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
git bisect bad 53b2a428e1036e481af0c7fc39cb92128344b689
# good: [1062a24d0a17862738df0e6202d445ba0a260088] mm: introduce vma_start_read_locked{_nested} helpers
git bisect good 1062a24d0a17862738df0e6202d445ba0a260088
# bad: [c6fcd83658200c74c837c4a2e3f95f7eb487dd7d] mseal: remove can_do_mseal()
git bisect bad c6fcd83658200c74c837c4a2e3f95f7eb487dd7d
# bad: [2684a4244bfeac0c246a2610997dad78b2dc7877] mm/slab: allow freeptr_offset to be used with ctor
git bisect bad 2684a4244bfeac0c246a2610997dad78b2dc7877
# good: [98d5eefb975e46c182fc2de6a543e375481aedaf] mm: mark vma as detached until it's added into vma tree
git bisect good 98d5eefb975e46c182fc2de6a543e375481aedaf
# bad: [85ad413389aec04cfaaba043caa8128b76c6e491] mm: make vma cache SLAB_TYPESAFE_BY_RCU
git bisect bad 85ad413389aec04cfaaba043caa8128b76c6e491
# first bad commit: [85ad413389aec04cfaaba043caa8128b76c6e491] mm: make vma cache SLAB_TYPESAFE_BY_RCU
[-- Attachment #4: oops-decoded.gz --]
[-- Type: application/gzip, Size: 7128 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-09 17:35 ` Klara Modin
@ 2024-12-09 20:28 ` Suren Baghdasaryan
2024-12-09 22:19 ` Suren Baghdasaryan
0 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-09 20:28 UTC (permalink / raw)
To: Klara Modin
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Mon, Dec 9, 2024 at 9:35 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
>
> This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209)
> seems to cause an oops on a MIPS board of mine (Cavium Octeon III)
> (abbreviated, full attached):
Thanks! This is really helpful because both errors are reported on
architectures which don't set CONFIG_PER_VMA_LOCK. This is a great
clue.
I'm investigating it and will post a fix asap.
Thanks,
Suren.
>
> CPU 2 Unable to handle kernel paging request at virtual address
> 0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
> Oops[#1]:
> CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
> 6.13.0-rc1-00162-g85ad413389ae #156
> Call Trace:
> unlink_anon_vmas (mm/rmap.c:408)
> free_pgtables (mm/memory.c:393)
> vms_clear_ptes (mm/vma.c:1143)
> vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
> do_vmi_align_munmap (mm/vma.c:1452)
> __vm_munmap (mm/vma.c:2892)
> sys_munmap (mm/mmap.c:1053)
> syscall_common (arch/mips/kernel/scall64-n64.S:62)
>
> I saw that there's already a report, but maybe another arch can be
> useful for tracking this down.
>
> Please let me know if there's anything else you need.
>
> Regards,
> Klara Modin
>
> Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com
>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > include/linux/mm.h | 36 +++++--
> > include/linux/mm_types.h | 10 +-
> > include/linux/slab.h | 6 --
> > kernel/fork.c | 157 +++++++++++++++++++++++++------
> > mm/memory.c | 15 ++-
> > mm/vma.c | 2 +-
> > tools/testing/vma/vma_internal.h | 7 +-
> > 7 files changed, 179 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2bf38c1e9cca..3568bcbc7c81 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > void vm_area_free(struct vm_area_struct *);
> > /* Use only if VMA has no other users */
> > -void __vm_area_free(struct vm_area_struct *vma);
> > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> >
> > #ifndef CONFIG_MMU
> > extern struct rb_root nommu_region_tree;
> > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > * Try to read-lock a vma. The function is allowed to occasionally yield false
> > * locked result to avoid performance overhead, in which case we fall back to
> > * using mmap_lock. The function should never yield false unlocked result.
> > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > + * reused and attached to a different mm before we lock it.
> > */
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > /*
> > * Check before locking. A race might cause false locked result.
> > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > * we don't rely on for anything - the mm_lock_seq read against which we
> > * need ordering is below.
> > */
> > - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > return false;
> >
> > if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > * after it has been unlocked.
> > * This pairs with RELEASE semantics in vma_end_write_all().
> > */
> > - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> > up_read(&vma->vm_lock.lock);
> > return false;
> > }
> > @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> >
> > static inline void vma_mark_attached(struct vm_area_struct *vma)
> > {
> > - vma->detached = false;
> > + /*
> > + * This pairs with smp_rmb() inside is_vma_detached().
> > + * vma is marked attached after all vma modifications are done and it
> > + * got added into the vma tree. All prior vma modifications should be
> > + * made visible before marking the vma attached.
> > + */
> > + smp_wmb();
> > + /* This pairs with READ_ONCE() in is_vma_detached(). */
> > + WRITE_ONCE(vma->detached, false);
> > }
> >
> > static inline void vma_mark_detached(struct vm_area_struct *vma)
> > @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
> >
> > static inline bool is_vma_detached(struct vm_area_struct *vma)
> > {
> > - return vma->detached;
> > + bool detached;
> > +
> > + /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> > + detached = READ_ONCE(vma->detached);
> > + /*
> > + * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> > + * vma->detached is read before vma attributes read later inside
> > + * lock_vma_under_rcu().
> > + */
> > + smp_rmb();
> > +
> > + return detached;
> > }
> >
> > static inline void release_fault_lock(struct vm_fault *vmf)
> > @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > #else /* CONFIG_PER_VMA_LOCK */
> >
> > static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > { return false; }
> > static inline void vma_end_read(struct vm_area_struct *vma) {}
> > static inline void vma_start_write(struct vm_area_struct *vma) {}
> > @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> >
> > extern const struct vm_operations_struct vma_dummy_vm_ops;
> >
> > +/* Use on VMAs not created using vm_area_alloc() */
> > static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > {
> > memset(vma, 0, sizeof(*vma));
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index be3551654325..5d8779997266 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
> >
> > typedef unsigned long vm_flags_t;
> >
> > +/*
> > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > + */
> > +typedef struct { unsigned long v; } freeptr_t;
> > +
> > /*
> > * A region containing a mapping of a non-memory backed file under NOMMU
> > * conditions. These are held in a global tree and are pinned by the VMAs that
> > @@ -657,9 +663,7 @@ struct vm_area_struct {
> > unsigned long vm_start;
> > unsigned long vm_end;
> > };
> > -#ifdef CONFIG_PER_VMA_LOCK
> > - struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > -#endif
> > + freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > };
> >
> > /*
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 10a971c2bde3..681b685b6c4e 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > #define SLAB_NO_OBJ_EXT __SLAB_FLAG_UNUSED
> > #endif
> >
> > -/*
> > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > - */
> > -typedef struct { unsigned long v; } freeptr_t;
> > -
> > /*
> > * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > *
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 71990f46aa4e..e7e76a660e4c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
> > /* SLAB cache for mm_struct structures (tsk->mm) */
> > static struct kmem_cache *mm_cachep;
> >
> > +static void vm_area_ctor(void *data)
> > +{
> > + struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + /* vma is not locked, can't use vma_mark_detached() */
> > + vma->detached = true;
> > +#endif
> > + INIT_LIST_HEAD(&vma->anon_vma_chain);
> > + vma_lock_init(vma);
> > +}
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > + vma->vm_mm = mm;
> > + vma->vm_ops = &vma_dummy_vm_ops;
> > + vma->vm_start = 0;
> > + vma->vm_end = 0;
> > + vma->anon_vma = NULL;
> > + vma->vm_pgoff = 0;
> > + vma->vm_file = NULL;
> > + vma->vm_private_data = NULL;
> > + vm_flags_init(vma, 0);
> > + memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > + memset(&vma->shared, 0, sizeof(vma->shared));
> > + memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > + vma_numab_state_init(vma);
> > +#ifdef CONFIG_ANON_VMA_NAME
> > + vma->anon_name = NULL;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > + memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > + vma->vm_region = NULL;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > + vma->vm_policy = NULL;
> > +#endif
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > + dest->vm_mm = src->vm_mm;
> > + dest->vm_ops = src->vm_ops;
> > + dest->vm_start = src->vm_start;
> > + dest->vm_end = src->vm_end;
> > + dest->anon_vma = src->anon_vma;
> > + dest->vm_pgoff = src->vm_pgoff;
> > + dest->vm_file = src->vm_file;
> > + dest->vm_private_data = src->vm_private_data;
> > + vm_flags_init(dest, src->vm_flags);
> > + memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > + sizeof(dest->vm_page_prot));
> > + memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > + sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > + dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > + sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > + dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > + dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > + vma_init(vma, mm);
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > + /*
> > + * orig->shared.rb may be modified concurrently, but the clone
> > + * will be reinitialized.
> > + */
> > + data_race(memcpy(dest, src, sizeof(*dest)));
> > +}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > {
> > struct vm_area_struct *vma;
> > @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > if (!vma)
> > return NULL;
> >
> > - vma_init(vma, mm);
> > + vma_clear(vma, mm);
> >
> > return vma;
> > }
> > @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > - /*
> > - * orig->shared.rb may be modified concurrently, but the clone
> > - * will be reinitialized.
> > - */
> > - data_race(memcpy(new, orig, sizeof(*new)));
> > - vma_lock_init(new);
> > - INIT_LIST_HEAD(&new->anon_vma_chain);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > - /* vma is not locked, can't use vma_mark_detached() */
> > - new->detached = true;
> > -#endif
> > + vma_copy(orig, new);
> > vma_numab_state_init(new);
> > dup_anon_vma_name(orig, new);
> >
> > return new;
> > }
> >
> > -void __vm_area_free(struct vm_area_struct *vma)
> > +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
> > {
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + /*
> > + * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> > + * vma->detached to be set before vma is returned into the cache.
> > + * This way reused object won't be used by readers until it's
> > + * initialized and reattached.
> > + * If vma is unreachable, there can be no other users and we
> > + * can set vma->detached directly with no risk of a race.
> > + * If vma is reachable, then it should have been already detached
> > + * under vma write-lock or it was never attached.
> > + */
> > + if (unreachable)
> > + vma->detached = true;
> > + else
> > + VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > + vma->vm_lock_seq = UINT_MAX;
> > +#endif
> > + VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
> > vma_numab_state_free(vma);
> > free_anon_vma_name(vma);
> > kmem_cache_free(vm_area_cachep, vma);
> > }
> >
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -static void vm_area_free_rcu_cb(struct rcu_head *head)
> > +void vm_area_free(struct vm_area_struct *vma)
> > {
> > - struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> > - vm_rcu);
> > -
> > - /* The vma should not be locked while being destroyed. */
> > - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> > - __vm_area_free(vma);
> > + __vm_area_free(vma, false);
> > }
> > -#endif
> >
> > -void vm_area_free(struct vm_area_struct *vma)
> > +void vm_area_free_unreachable(struct vm_area_struct *vma)
> > {
> > -#ifdef CONFIG_PER_VMA_LOCK
> > - call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> > -#else
> > - __vm_area_free(vma);
> > -#endif
> > + __vm_area_free(vma, true);
> > }
> >
> > static void account_kernel_stack(struct task_struct *tsk, int account)
> > @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
> >
> > void __init proc_caches_init(void)
> > {
> > + struct kmem_cache_args args = {
> > + .use_freeptr_offset = true,
> > + .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > + .ctor = vm_area_ctor,
> > + };
> > +
> > sighand_cachep = kmem_cache_create("sighand_cache",
> > sizeof(struct sighand_struct), 0,
> > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
> > sizeof(struct fs_struct), 0,
> > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > NULL);
> > - vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > - SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > + vm_area_cachep = kmem_cache_create("vm_area_struct",
> > + sizeof(struct vm_area_struct), &args,
> > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > SLAB_ACCOUNT);
> > +
> > mmap_init();
> > nsproxy_cache_init();
> > }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b252f19b28c9..6f4d4d423835 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > if (!vma)
> > goto inval;
> >
> > - if (!vma_start_read(vma))
> > + if (!vma_start_read(mm, vma))
> > goto inval;
> >
> > - /* Check if the VMA got isolated after we found it */
> > + /*
> > + * Check if the VMA got isolated after we found it.
> > + * Note: vma we found could have been recycled and is being reattached.
> > + * It's possible to attach a vma while it is read-locked, however a
> > + * read-locked vma can't be detached (detaching requires write-locking).
> > + * Therefore if this check passes, we have an attached and stable vma.
> > + */
> > if (is_vma_detached(vma)) {
> > vma_end_read(vma);
> > count_vm_vma_lock_event(VMA_LOCK_MISS);
> > @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > * fields are accessible for RCU readers.
> > */
> >
> > - /* Check since vm_start/vm_end might change before we lock the VMA */
> > - if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > + /* Check if the vma we locked is the right one. */
> > + if (unlikely(vma->vm_mm != mm ||
> > + address < vma->vm_start || address >= vma->vm_end))
> > goto inval_end_read;
> >
> > rcu_read_unlock();
> > diff --git a/mm/vma.c b/mm/vma.c
> > index cdc63728f47f..648784416833 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > fput(vma->vm_file);
> > mpol_put(vma_policy(vma));
> > if (unreachable)
> > - __vm_area_free(vma);
> > + vm_area_free_unreachable(vma);
> > else
> > vm_area_free(vma);
> > }
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 0cdc5f8c3d60..3eeb1317cc69 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
> > {
> > }
> >
> > -static inline void __vm_area_free(struct vm_area_struct *vma)
> > +static inline void vm_area_free(struct vm_area_struct *vma)
> > {
> > free(vma);
> > }
> >
> > -static inline void vm_area_free(struct vm_area_struct *vma)
> > +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
> > {
> > - __vm_area_free(vma);
> > + vma->detached = true;
> > + vm_area_free(vma);
> > }
> >
> > static inline void lru_add_drain(void)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-09 20:28 ` Suren Baghdasaryan
@ 2024-12-09 22:19 ` Suren Baghdasaryan
0 siblings, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-09 22:19 UTC (permalink / raw)
To: Klara Modin
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Mon, Dec 9, 2024 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Dec 9, 2024 at 9:35 AM Klara Modin <klarasmodin@gmail.com> wrote:
> >
> > Hi,
> >
> > On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > object reuse before RCU grace period is over will be detected inside
> > > lock_vma_under_rcu().
> > > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > > given address, locks the vma and checks if it got detached or remapped
> > > to cover a different address range. These last checks are there
> > > to ensure that the vma was not modified after we found it but before
> > > locking it.
> > > vma reuse introduces several new possibilities:
> > > 1. vma can be reused after it was found but before it is locked;
> > > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > > while being locked in vma_start_read();
> > > 3. vma can be reused and reinitialized after it was found but before
> > > it is locked, then attached at a new address or to a new mm while
> > > read-locked;
> > > For case #1 current checks will help detecting cases when:
> > > - vma was reused but not yet added into the tree (detached check)
> > > - vma was reused at a different address range (address check);
> > > We are missing the check for vm_mm to ensure the reused vma was not
> > > attached to a different mm. This patch adds the missing check.
> > > For case #2, we pass mm to vma_start_read() to prevent access to
> > > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > > a false locked result but that's not critical if it's rare because
> > > it will only lead to a retry under mmap_lock.
> > > For case #3, we ensure the order in which vma->detached flag and
> > > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > > because attaching vma happens without vma write-lock, as opposed to
> > > vma detaching, which requires vma write-lock. This patch adds memory
> > > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > > This will facilitate vm_area_struct reuse and will minimize the number
> > > of call_rcu() calls.
> >
> > This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209)
> > seems to cause an oops on a MIPS board of mine (Cavium Octeon III)
> > (abbreviated, full attached):
>
> Thanks! This is really helpful because both errors are reported on
> architectures which don't set CONFIG_PER_VMA_LOCK. This is a great
> clue.
> I'm investigating it and will post a fix asap.
I think https://lore.kernel.org/all/20241209221028.1644210-1-surenb@google.com/
should fix this.
> Thanks,
> Suren.
>
> >
> > CPU 2 Unable to handle kernel paging request at virtual address
> > 0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
> > Oops[#1]:
> > CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
> > 6.13.0-rc1-00162-g85ad413389ae #156
> > Call Trace:
> > unlink_anon_vmas (mm/rmap.c:408)
> > free_pgtables (mm/memory.c:393)
> > vms_clear_ptes (mm/vma.c:1143)
> > vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
> > do_vmi_align_munmap (mm/vma.c:1452)
> > __vm_munmap (mm/vma.c:2892)
> > sys_munmap (mm/mmap.c:1053)
> > syscall_common (arch/mips/kernel/scall64-n64.S:62)
> >
> > I saw that there's already a report, but maybe another arch can be
> > useful for tracking this down.
> >
> > Please let me know if there's anything else you need.
> >
> > Regards,
> > Klara Modin
> >
> > Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > > include/linux/mm.h | 36 +++++--
> > > include/linux/mm_types.h | 10 +-
> > > include/linux/slab.h | 6 --
> > > kernel/fork.c | 157 +++++++++++++++++++++++++------
> > > mm/memory.c | 15 ++-
> > > mm/vma.c | 2 +-
> > > tools/testing/vma/vma_internal.h | 7 +-
> > > 7 files changed, 179 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 2bf38c1e9cca..3568bcbc7c81 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > > struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > > void vm_area_free(struct vm_area_struct *);
> > > /* Use only if VMA has no other users */
> > > -void __vm_area_free(struct vm_area_struct *vma);
> > > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> > >
> > > #ifndef CONFIG_MMU
> > > extern struct rb_root nommu_region_tree;
> > > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > > * Try to read-lock a vma. The function is allowed to occasionally yield false
> > > * locked result to avoid performance overhead, in which case we fall back to
> > > * using mmap_lock. The function should never yield false unlocked result.
> > > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > > + * reused and attached to a different mm before we lock it.
> > > */
> > > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > > {
> > > /*
> > > * Check before locking. A race might cause false locked result.
> > > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > > * we don't rely on for anything - the mm_lock_seq read against which we
> > > * need ordering is below.
> > > */
> > > - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > > + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > > return false;
> > >
> > > if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > > * after it has been unlocked.
> > > * This pairs with RELEASE semantics in vma_end_write_all().
> > > */
> > > - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > > + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> > > up_read(&vma->vm_lock.lock);
> > > return false;
> > > }
> > > @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >
> > > static inline void vma_mark_attached(struct vm_area_struct *vma)
> > > {
> > > - vma->detached = false;
> > > + /*
> > > + * This pairs with smp_rmb() inside is_vma_detached().
> > > + * vma is marked attached after all vma modifications are done and it
> > > + * got added into the vma tree. All prior vma modifications should be
> > > + * made visible before marking the vma attached.
> > > + */
> > > + smp_wmb();
> > > + /* This pairs with READ_ONCE() in is_vma_detached(). */
> > > + WRITE_ONCE(vma->detached, false);
> > > }
> > >
> > > static inline void vma_mark_detached(struct vm_area_struct *vma)
> > > @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
> > >
> > > static inline bool is_vma_detached(struct vm_area_struct *vma)
> > > {
> > > - return vma->detached;
> > > + bool detached;
> > > +
> > > + /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> > > + detached = READ_ONCE(vma->detached);
> > > + /*
> > > + * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> > > + * vma->detached is read before vma attributes read later inside
> > > + * lock_vma_under_rcu().
> > > + */
> > > + smp_rmb();
> > > +
> > > + return detached;
> > > }
> > >
> > > static inline void release_fault_lock(struct vm_fault *vmf)
> > > @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > #else /* CONFIG_PER_VMA_LOCK */
> > >
> > > static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > > { return false; }
> > > static inline void vma_end_read(struct vm_area_struct *vma) {}
> > > static inline void vma_start_write(struct vm_area_struct *vma) {}
> > > @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> > >
> > > extern const struct vm_operations_struct vma_dummy_vm_ops;
> > >
> > > +/* Use on VMAs not created using vm_area_alloc() */
> > > static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > > {
> > > memset(vma, 0, sizeof(*vma));
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index be3551654325..5d8779997266 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
> > >
> > > typedef unsigned long vm_flags_t;
> > >
> > > +/*
> > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > + */
> > > +typedef struct { unsigned long v; } freeptr_t;
> > > +
> > > /*
> > > * A region containing a mapping of a non-memory backed file under NOMMU
> > > * conditions. These are held in a global tree and are pinned by the VMAs that
> > > @@ -657,9 +663,7 @@ struct vm_area_struct {
> > > unsigned long vm_start;
> > > unsigned long vm_end;
> > > };
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > - struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > -#endif
> > > + freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > > };
> > >
> > > /*
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index 10a971c2bde3..681b685b6c4e 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > > #define SLAB_NO_OBJ_EXT __SLAB_FLAG_UNUSED
> > > #endif
> > >
> > > -/*
> > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > - */
> > > -typedef struct { unsigned long v; } freeptr_t;
> > > -
> > > /*
> > > * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > > *
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 71990f46aa4e..e7e76a660e4c 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
> > > /* SLAB cache for mm_struct structures (tsk->mm) */
> > > static struct kmem_cache *mm_cachep;
> > >
> > > +static void vm_area_ctor(void *data)
> > > +{
> > > + struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > > +
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > + /* vma is not locked, can't use vma_mark_detached() */
> > > + vma->detached = true;
> > > +#endif
> > > + INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > + vma_lock_init(vma);
> > > +}
> > > +
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > > +{
> > > + vma->vm_mm = mm;
> > > + vma->vm_ops = &vma_dummy_vm_ops;
> > > + vma->vm_start = 0;
> > > + vma->vm_end = 0;
> > > + vma->anon_vma = NULL;
> > > + vma->vm_pgoff = 0;
> > > + vma->vm_file = NULL;
> > > + vma->vm_private_data = NULL;
> > > + vm_flags_init(vma, 0);
> > > + memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > > + memset(&vma->shared, 0, sizeof(vma->shared));
> > > + memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > > + vma_numab_state_init(vma);
> > > +#ifdef CONFIG_ANON_VMA_NAME
> > > + vma->anon_name = NULL;
> > > +#endif
> > > +#ifdef CONFIG_SWAP
> > > + memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > > +#endif
> > > +#ifndef CONFIG_MMU
> > > + vma->vm_region = NULL;
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > + vma->vm_policy = NULL;
> > > +#endif
> > > +}
> > > +
> > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > +{
> > > + dest->vm_mm = src->vm_mm;
> > > + dest->vm_ops = src->vm_ops;
> > > + dest->vm_start = src->vm_start;
> > > + dest->vm_end = src->vm_end;
> > > + dest->anon_vma = src->anon_vma;
> > > + dest->vm_pgoff = src->vm_pgoff;
> > > + dest->vm_file = src->vm_file;
> > > + dest->vm_private_data = src->vm_private_data;
> > > + vm_flags_init(dest, src->vm_flags);
> > > + memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > + sizeof(dest->vm_page_prot));
> > > + memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > > + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > + sizeof(dest->vm_userfaultfd_ctx));
> > > +#ifdef CONFIG_ANON_VMA_NAME
> > > + dest->anon_name = src->anon_name;
> > > +#endif
> > > +#ifdef CONFIG_SWAP
> > > + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > + sizeof(dest->swap_readahead_info));
> > > +#endif
> > > +#ifndef CONFIG_MMU
> > > + dest->vm_region = src->vm_region;
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > + dest->vm_policy = src->vm_policy;
> > > +#endif
> > > +}
> > > +
> > > +#else /* CONFIG_PER_VMA_LOCK */
> > > +
> > > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > > +{
> > > + vma_init(vma, mm);
> > > +}
> > > +
> > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > +{
> > > + /*
> > > + * orig->shared.rb may be modified concurrently, but the clone
> > > + * will be reinitialized.
> > > + */
> > > + data_race(memcpy(dest, src, sizeof(*dest)));
> > > +}
> > > +
> > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > +
> > > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > {
> > > struct vm_area_struct *vma;
> > > @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > if (!vma)
> > > return NULL;
> > >
> > > - vma_init(vma, mm);
> > > + vma_clear(vma, mm);
> > >
> > > return vma;
> > > }
> > > @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > > - /*
> > > - * orig->shared.rb may be modified concurrently, but the clone
> > > - * will be reinitialized.
> > > - */
> > > - data_race(memcpy(new, orig, sizeof(*new)));
> > > - vma_lock_init(new);
> > > - INIT_LIST_HEAD(&new->anon_vma_chain);
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > - /* vma is not locked, can't use vma_mark_detached() */
> > > - new->detached = true;
> > > -#endif
> > > + vma_copy(orig, new);
> > > vma_numab_state_init(new);
> > > dup_anon_vma_name(orig, new);
> > >
> > > return new;
> > > }
> > >
> > > -void __vm_area_free(struct vm_area_struct *vma)
> > > +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
> > > {
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > + /*
> > > + * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> > > + * vma->detached to be set before vma is returned into the cache.
> > > + * This way reused object won't be used by readers until it's
> > > + * initialized and reattached.
> > > + * If vma is unreachable, there can be no other users and we
> > > + * can set vma->detached directly with no risk of a race.
> > > + * If vma is reachable, then it should have been already detached
> > > + * under vma write-lock or it was never attached.
> > > + */
> > > + if (unreachable)
> > > + vma->detached = true;
> > > + else
> > > + VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > > + vma->vm_lock_seq = UINT_MAX;
> > > +#endif
> > > + VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
> > > vma_numab_state_free(vma);
> > > free_anon_vma_name(vma);
> > > kmem_cache_free(vm_area_cachep, vma);
> > > }
> > >
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -static void vm_area_free_rcu_cb(struct rcu_head *head)
> > > +void vm_area_free(struct vm_area_struct *vma)
> > > {
> > > - struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> > > - vm_rcu);
> > > -
> > > - /* The vma should not be locked while being destroyed. */
> > > - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> > > - __vm_area_free(vma);
> > > + __vm_area_free(vma, false);
> > > }
> > > -#endif
> > >
> > > -void vm_area_free(struct vm_area_struct *vma)
> > > +void vm_area_free_unreachable(struct vm_area_struct *vma)
> > > {
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > - call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> > > -#else
> > > - __vm_area_free(vma);
> > > -#endif
> > > + __vm_area_free(vma, true);
> > > }
> > >
> > > static void account_kernel_stack(struct task_struct *tsk, int account)
> > > @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
> > >
> > > void __init proc_caches_init(void)
> > > {
> > > + struct kmem_cache_args args = {
> > > + .use_freeptr_offset = true,
> > > + .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > > + .ctor = vm_area_ctor,
> > > + };
> > > +
> > > sighand_cachep = kmem_cache_create("sighand_cache",
> > > sizeof(struct sighand_struct), 0,
> > > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > > @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
> > > sizeof(struct fs_struct), 0,
> > > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > > NULL);
> > > - vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > > - SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > > + vm_area_cachep = kmem_cache_create("vm_area_struct",
> > > + sizeof(struct vm_area_struct), &args,
> > > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > > SLAB_ACCOUNT);
> > > +
> > > mmap_init();
> > > nsproxy_cache_init();
> > > }
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index b252f19b28c9..6f4d4d423835 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > if (!vma)
> > > goto inval;
> > >
> > > - if (!vma_start_read(vma))
> > > + if (!vma_start_read(mm, vma))
> > > goto inval;
> > >
> > > - /* Check if the VMA got isolated after we found it */
> > > + /*
> > > + * Check if the VMA got isolated after we found it.
> > > + * Note: vma we found could have been recycled and is being reattached.
> > > + * It's possible to attach a vma while it is read-locked, however a
> > > + * read-locked vma can't be detached (detaching requires write-locking).
> > > + * Therefore if this check passes, we have an attached and stable vma.
> > > + */
> > > if (is_vma_detached(vma)) {
> > > vma_end_read(vma);
> > > count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > * fields are accessible for RCU readers.
> > > */
> > >
> > > - /* Check since vm_start/vm_end might change before we lock the VMA */
> > > - if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > > + /* Check if the vma we locked is the right one. */
> > > + if (unlikely(vma->vm_mm != mm ||
> > > + address < vma->vm_start || address >= vma->vm_end))
> > > goto inval_end_read;
> > >
> > > rcu_read_unlock();
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index cdc63728f47f..648784416833 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > > fput(vma->vm_file);
> > > mpol_put(vma_policy(vma));
> > > if (unreachable)
> > > - __vm_area_free(vma);
> > > + vm_area_free_unreachable(vma);
> > > else
> > > vm_area_free(vma);
> > > }
> > > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > > index 0cdc5f8c3d60..3eeb1317cc69 100644
> > > --- a/tools/testing/vma/vma_internal.h
> > > +++ b/tools/testing/vma/vma_internal.h
> > > @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
> > > {
> > > }
> > >
> > > -static inline void __vm_area_free(struct vm_area_struct *vma)
> > > +static inline void vm_area_free(struct vm_area_struct *vma)
> > > {
> > > free(vma);
> > > }
> > >
> > > -static inline void vm_area_free(struct vm_area_struct *vma)
> > > +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
> > > {
> > > - __vm_area_free(vma);
> > > + vma->detached = true;
> > > + vm_area_free(vma);
> > > }
> > >
> > > static inline void lru_add_drain(void)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers
2024-12-06 22:51 ` [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-12-10 9:03 ` Vlastimil Babka
0 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 9:03 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:51, Suren Baghdasaryan wrote:
> Introduce helper functions which can be used to read-lock a VMA when
> holding mmap_lock for read. Replace direct accesses to vma->vm_lock
> with these new helpers.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct
2024-12-06 22:51 ` [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-10 9:15 ` Vlastimil Babka
0 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 9:15 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:51, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> Move vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned as well.
> With kernel compiled using defconfig, this causes VMA memory consumption
> to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
>
> slabinfo before:
> <name> ... <objsize> <objperslab> <pagesperslab> : ...
> vma_lock ... 40 102 1 : ...
> vm_area_struct ... 160 51 2 : ...
>
> slabinfo after moving vm_lock:
> <name> ... <objsize> <objperslab> <pagesperslab> : ...
> vm_area_struct ... 256 32 2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs. Note that the size of this structure is
> dependent on the kernel configuration and typically the original size is
> higher than 160 bytes. Therefore these calculations are close to the
> worst case scenario. A more realistic vm_area_struct usage before this
> change is:
>
> <name> ... <objsize> <objperslab> <pagesperslab> : ...
> vma_lock ... 40 102 1 : ...
> vm_area_struct ... 176 46 2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
> which is 3.9MB per 100000 VMAs.
> This memory consumption growth can be addressed later by optimizing the
> vm_lock.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree
2024-12-06 22:52 ` [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-12-10 9:35 ` Vlastimil Babka
2024-12-10 11:36 ` Vlastimil Babka
1 sibling, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 9:35 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:52, Suren Baghdasaryan wrote:
> Current implementation does not set detached flag when a VMA is first
> allocated. This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree. Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added
> into a tree.
> Introduce vma_mark_attached() to make the API more readable and to
> simplify possible future cleanup when vma->vm_mm might be used to
> indicate detached vma and vma_mark_attached() will need an additional
> mm parameter.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor
2024-12-06 22:52 ` [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor Suren Baghdasaryan
@ 2024-12-10 11:01 ` Vlastimil Babka
0 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 11:01 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:52, Suren Baghdasaryan wrote:
> There is no real reason to prevent freeptr_offset usage when a slab
> cache has a ctor. The only real limitation is that any field unioned
> with the free pointer and initialized by ctor will be overwritten since
> free pointer is set after @ctor invocation. Document this limitation
> and enable usage of freeptr_offset with ctor.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> include/linux/slab.h | 5 +++--
> mm/slub.c | 2 +-
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 681b685b6c4e..6bad744bef5e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -305,8 +305,9 @@ struct kmem_cache_args {
> * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> * is specified, %use_freeptr_offset must be set %true.
> *
> - * Note that @ctor currently isn't supported with custom free pointers
> - * as a @ctor requires an external free pointer.
> + * Note that fields unioned with free pointer cannot be initialized by
> + * @ctor since free pointer is set after @ctor invocation, so those
> + * values will be overwritten.
> */
> unsigned int freeptr_offset;
> /**
> diff --git a/mm/slub.c b/mm/slub.c
> index 870a1d95521d..f62c829b7b6b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5462,7 +5462,7 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
> s->inuse = size;
>
> if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||
> - (flags & SLAB_POISON) || s->ctor ||
> + (flags & SLAB_POISON) || (s->ctor && !args->use_freeptr_offset) ||
> ((flags & SLAB_RED_ZONE) &&
> (s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {
> /*
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree
2024-12-06 22:52 ` [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-12-10 9:35 ` Vlastimil Babka
@ 2024-12-10 11:36 ` Vlastimil Babka
2024-12-10 16:28 ` Suren Baghdasaryan
1 sibling, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 11:36 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:52, Suren Baghdasaryan wrote:
> Current implementation does not set detached flag when a VMA is first
> allocated. This does not represent the real state of the VMA, which is
> detached until it is added into mm's VMA tree. Fix this by marking new
> VMAs as detached and resetting detached flag only after VMA is added
> into a tree.
> Introduce vma_mark_attached() to make the API more readable and to
> simplify possible future cleanup when vma->vm_mm might be used to
> indicate detached vma and vma_mark_attached() will need an additional
> mm parameter.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 21660a9ad97a..71990f46aa4e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> data_race(memcpy(new, orig, sizeof(*new)));
> vma_lock_init(new);
> INIT_LIST_HEAD(&new->anon_vma_chain);
> +#ifdef CONFIG_PER_VMA_LOCK
> + /* vma is not locked, can't use vma_mark_detached() */
> + new->detached = true;
> +#endif
It occured me we could probably move the detached setting to vma_lock_init()
to avoid the #ifdef (also in the ctor in next patch)?
> vma_numab_state_init(new);
> dup_anon_vma_name(orig, new);
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-12-09 17:35 ` Klara Modin
@ 2024-12-10 12:06 ` Vlastimil Babka
2024-12-10 16:23 ` Suren Baghdasaryan
2024-12-10 14:21 ` Vlastimil Babka
2 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 12:06 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> include/linux/mm.h | 36 +++++--
> include/linux/mm_types.h | 10 +-
> include/linux/slab.h | 6 --
> kernel/fork.c | 157 +++++++++++++++++++++++++------
> mm/memory.c | 15 ++-
> mm/vma.c | 2 +-
> tools/testing/vma/vma_internal.h | 7 +-
> 7 files changed, 179 insertions(+), 54 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2bf38c1e9cca..3568bcbc7c81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> void vm_area_free(struct vm_area_struct *);
> /* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
> +void vm_area_free_unreachable(struct vm_area_struct *vma);
>
> #ifndef CONFIG_MMU
> extern struct rb_root nommu_region_tree;
> @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> * Try to read-lock a vma. The function is allowed to occasionally yield false
> * locked result to avoid performance overhead, in which case we fall back to
> * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
> */
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> /*
> * Check before locking. A race might cause false locked result.
> @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> * we don't rely on for anything - the mm_lock_seq read against which we
> * need ordering is below.
> */
> - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> return false;
>
> if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> * after it has been unlocked.
> * This pairs with RELEASE semantics in vma_end_write_all().
> */
> - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> up_read(&vma->vm_lock.lock);
> return false;
> }
This could have been perhaps another preparatory patch to make this one smaller?
>
> +static void vm_area_ctor(void *data)
> +{
> + struct vm_area_struct *vma = (struct vm_area_struct *)data;
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> + /* vma is not locked, can't use vma_mark_detached() */
> + vma->detached = true;
> +#endif
> + INIT_LIST_HEAD(&vma->anon_vma_chain);
> + vma_lock_init(vma);
> +}
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> + vma->vm_mm = mm;
> + vma->vm_ops = &vma_dummy_vm_ops;
> + vma->vm_start = 0;
> + vma->vm_end = 0;
> + vma->anon_vma = NULL;
> + vma->vm_pgoff = 0;
> + vma->vm_file = NULL;
> + vma->vm_private_data = NULL;
> + vm_flags_init(vma, 0);
> + memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> + memset(&vma->shared, 0, sizeof(vma->shared));
> + memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> + vma_numab_state_init(vma);
> +#ifdef CONFIG_ANON_VMA_NAME
> + vma->anon_name = NULL;
> +#endif
> +#ifdef CONFIG_SWAP
> + memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> + vma->vm_region = NULL;
> +#endif
> +#ifdef CONFIG_NUMA
> + vma->vm_policy = NULL;
> +#endif
This isn't the ideal pattern I think, now that we have a ctor. Ideally the
ctor would do all this (except setting the vm_mm), and then we need to make
sure it's also done when freeing the vma, to make sure the freed object is
in the same state as a new object after the constructor.
On freeing, things like numab_state and anon_name could be NULL'd (by the
respective destructors) only when they are non-NULL and thus freeing the
objects pointed to. vm_policy and vm_file could perhaps be handled same way
after some refactoring (see remove_vma()), vma_dummy_vm_ops are possibly
already reset by vma_close(), etc.
> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> + dest->vm_mm = src->vm_mm;
> + dest->vm_ops = src->vm_ops;
> + dest->vm_start = src->vm_start;
> + dest->vm_end = src->vm_end;
> + dest->anon_vma = src->anon_vma;
> + dest->vm_pgoff = src->vm_pgoff;
> + dest->vm_file = src->vm_file;
> + dest->vm_private_data = src->vm_private_data;
> + vm_flags_init(dest, src->vm_flags);
> + memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> + sizeof(dest->vm_page_prot));
> + memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> + sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> + dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> + sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> + dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> + dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-12-09 17:35 ` Klara Modin
2024-12-10 12:06 ` Vlastimil Babka
@ 2024-12-10 14:21 ` Vlastimil Babka
2024-12-10 16:20 ` Suren Baghdasaryan
2 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 14:21 UTC (permalink / raw)
To: Suren Baghdasaryan, akpm
Cc: willy, liam.howlett, lorenzo.stoakes, mhocko, hannes, mjguzik,
oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
brauner, dhowells, hdanton, hughd, minchan, jannh, shakeel.butt,
souravpanda, pasha.tatashin, corbet, linux-doc, linux-mm,
linux-kernel, kernel-team
On 12/6/24 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
I'm wondering about the vma freeing path. Consider vma_complete():
vma_mark_detached(vp->remove);
vma->detached = true; - plain write
vm_area_free(vp->remove);
vma->vm_lock_seq = UINT_MAX; - plain write
kmem_cache_free(vm_area_cachep)
...
potential reallocation
against:
lock_vma_under_rcu()
- mas_walk finds a stale vma due to race
vma_start_read()
if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
- can be false, the vma was not being locked on the freeing side?
down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
this is acquire, but was there any release?
is_vma_detached() - false negative as the write above didn't propagate
here yet; a read barrier but where is the write barrier?
checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
positive, or they got reset on reallocation but writes didn't propagate
Am I missing something that would prevent lock_vma_under_rcu() falsely
succeeding here?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 14:21 ` Vlastimil Babka
@ 2024-12-10 16:20 ` Suren Baghdasaryan
2024-12-10 16:32 ` Vlastimil Babka
0 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 16:20 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> I'm wondering about the vma freeing path. Consider vma_complete():
>
> vma_mark_detached(vp->remove);
> vma->detached = true; - plain write
> vm_area_free(vp->remove);
> vma->vm_lock_seq = UINT_MAX; - plain write
> kmem_cache_free(vm_area_cachep)
> ...
> potential reallocation
>
> against:
>
> lock_vma_under_rcu()
> - mas_walk finds a stale vma due to race
> vma_start_read()
> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> - can be false, the vma was not being locked on the freeing side?
> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> this is acquire, but was there any release?
Yes, there was a release. I think what you missed is that
vma_mark_detached() that is called from vma_complete() requires VMA to
be write-locked (see vma_assert_write_locked() in
vma_mark_detached()). The rule is that a VMA can be attached without
write-locking but only a write-locked VMA can be detached. So, after
vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
in vma_start_read() the VMA write-lock should have been released by
mmap_write_unlock() and therefore vma->detached=false should be
visible to the reader when it executed lock_vma_under_rcu().
> is_vma_detached() - false negative as the write above didn't propagate
> here yet; a read barrier but where is the write barrier?
> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> positive, or they got reset on reallocation but writes didn't propagate
>
> Am I missing something that would prevent lock_vma_under_rcu() falsely
> succeeding here?
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 12:06 ` Vlastimil Babka
@ 2024-12-10 16:23 ` Suren Baghdasaryan
0 siblings, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 16:23 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 4:06 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > include/linux/mm.h | 36 +++++--
> > include/linux/mm_types.h | 10 +-
> > include/linux/slab.h | 6 --
> > kernel/fork.c | 157 +++++++++++++++++++++++++------
> > mm/memory.c | 15 ++-
> > mm/vma.c | 2 +-
> > tools/testing/vma/vma_internal.h | 7 +-
> > 7 files changed, 179 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2bf38c1e9cca..3568bcbc7c81 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > void vm_area_free(struct vm_area_struct *);
> > /* Use only if VMA has no other users */
> > -void __vm_area_free(struct vm_area_struct *vma);
> > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> >
> > #ifndef CONFIG_MMU
> > extern struct rb_root nommu_region_tree;
> > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > * Try to read-lock a vma. The function is allowed to occasionally yield false
> > * locked result to avoid performance overhead, in which case we fall back to
> > * using mmap_lock. The function should never yield false unlocked result.
> > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > + * reused and attached to a different mm before we lock it.
> > */
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > /*
> > * Check before locking. A race might cause false locked result.
> > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > * we don't rely on for anything - the mm_lock_seq read against which we
> > * need ordering is below.
> > */
> > - if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > return false;
> >
> > if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > * after it has been unlocked.
> > * This pairs with RELEASE semantics in vma_end_write_all().
> > */
> > - if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > + if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> > up_read(&vma->vm_lock.lock);
> > return false;
> > }
>
> This could have been perhaps another preparatory patch to make this one smaller?
>
> >
> > +static void vm_area_ctor(void *data)
> > +{
> > + struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + /* vma is not locked, can't use vma_mark_detached() */
> > + vma->detached = true;
> > +#endif
> > + INIT_LIST_HEAD(&vma->anon_vma_chain);
> > + vma_lock_init(vma);
> > +}
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > + vma->vm_mm = mm;
> > + vma->vm_ops = &vma_dummy_vm_ops;
> > + vma->vm_start = 0;
> > + vma->vm_end = 0;
> > + vma->anon_vma = NULL;
> > + vma->vm_pgoff = 0;
> > + vma->vm_file = NULL;
> > + vma->vm_private_data = NULL;
> > + vm_flags_init(vma, 0);
> > + memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > + memset(&vma->shared, 0, sizeof(vma->shared));
> > + memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > + vma_numab_state_init(vma);
> > +#ifdef CONFIG_ANON_VMA_NAME
> > + vma->anon_name = NULL;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > + memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > + vma->vm_region = NULL;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > + vma->vm_policy = NULL;
> > +#endif
>
> This isn't the ideal pattern I think, now that we have a ctor. Ideally the
> ctor would do all this (except setting the vm_mm), and then we need to make
> sure it's also done when freeing the vma, to make sure the freed object is
> in the same state as a new object after the constructor.
>
> On freeing, things like numab_state and anon_name could be NULL'd (by the
> respective destructors) only when they are non-NULL and thus freeing the
> objects pointed to. vm_policy and vm_file could perhaps be handled same way
> after some refactoring (see remove_vma()), vma_dummy_vm_ops are possibly
> already reset by vma_close(), etc.
Ok, let me look some more into it and see if I can improve and
simplify the initialization/freeing logic. Thanks!
>
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > + dest->vm_mm = src->vm_mm;
> > + dest->vm_ops = src->vm_ops;
> > + dest->vm_start = src->vm_start;
> > + dest->vm_end = src->vm_end;
> > + dest->anon_vma = src->anon_vma;
> > + dest->vm_pgoff = src->vm_pgoff;
> > + dest->vm_file = src->vm_file;
> > + dest->vm_private_data = src->vm_private_data;
> > + vm_flags_init(dest, src->vm_flags);
> > + memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > + sizeof(dest->vm_page_prot));
> > + memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > + memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > + sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > + dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > + memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > + sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > + dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > + dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree
2024-12-10 11:36 ` Vlastimil Babka
@ 2024-12-10 16:28 ` Suren Baghdasaryan
0 siblings, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 16:28 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 3:36 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > Current implementation does not set detached flag when a VMA is first
> > allocated. This does not represent the real state of the VMA, which is
> > detached until it is added into mm's VMA tree. Fix this by marking new
> > VMAs as detached and resetting detached flag only after VMA is added
> > into a tree.
> > Introduce vma_mark_attached() to make the API more readable and to
> > simplify possible future cleanup when vma->vm_mm might be used to
> > indicate detached vma and vma_mark_attached() will need an additional
> > mm parameter.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 21660a9ad97a..71990f46aa4e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > data_race(memcpy(new, orig, sizeof(*new)));
> > vma_lock_init(new);
> > INIT_LIST_HEAD(&new->anon_vma_chain);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + /* vma is not locked, can't use vma_mark_detached() */
> > + new->detached = true;
> > +#endif
>
> It occured me we could probably move the detached setting to vma_lock_init()
> to avoid the #ifdef (also in the ctor in next patch)?
Yes but setting vma->detached inside vma_lock_init() does not seem
logical. I went back and forth on this and in the end decided to keep
it this way. If the #ifdef-ery is too ugly I can introduce
vma_init_detached() or something like that.
>
> > vma_numab_state_init(new);
> > dup_anon_vma_name(orig, new);
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 16:20 ` Suren Baghdasaryan
@ 2024-12-10 16:32 ` Vlastimil Babka
2024-12-10 17:16 ` Suren Baghdasaryan
0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 16:32 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On 12/10/24 17:20, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > object reuse before RCU grace period is over will be detected inside
>> > lock_vma_under_rcu().
>> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> > given address, locks the vma and checks if it got detached or remapped
>> > to cover a different address range. These last checks are there
>> > to ensure that the vma was not modified after we found it but before
>> > locking it.
>> > vma reuse introduces several new possibilities:
>> > 1. vma can be reused after it was found but before it is locked;
>> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> > while being locked in vma_start_read();
>> > 3. vma can be reused and reinitialized after it was found but before
>> > it is locked, then attached at a new address or to a new mm while
>> > read-locked;
>> > For case #1 current checks will help detecting cases when:
>> > - vma was reused but not yet added into the tree (detached check)
>> > - vma was reused at a different address range (address check);
>> > We are missing the check for vm_mm to ensure the reused vma was not
>> > attached to a different mm. This patch adds the missing check.
>> > For case #2, we pass mm to vma_start_read() to prevent access to
>> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> > a false locked result but that's not critical if it's rare because
>> > it will only lead to a retry under mmap_lock.
>> > For case #3, we ensure the order in which vma->detached flag and
>> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> > because attaching vma happens without vma write-lock, as opposed to
>> > vma detaching, which requires vma write-lock. This patch adds memory
>> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> > This will facilitate vm_area_struct reuse and will minimize the number
>> > of call_rcu() calls.
>> >
>> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>
>> I'm wondering about the vma freeing path. Consider vma_complete():
>>
>> vma_mark_detached(vp->remove);
>> vma->detached = true; - plain write
>> vm_area_free(vp->remove);
>> vma->vm_lock_seq = UINT_MAX; - plain write
>> kmem_cache_free(vm_area_cachep)
>> ...
>> potential reallocation
>>
>> against:
>>
>> lock_vma_under_rcu()
>> - mas_walk finds a stale vma due to race
>> vma_start_read()
>> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> - can be false, the vma was not being locked on the freeing side?
>> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> this is acquire, but was there any release?
>
> Yes, there was a release. I think what you missed is that
> vma_mark_detached() that is called from vma_complete() requires VMA to
> be write-locked (see vma_assert_write_locked() in
> vma_mark_detached()). The rule is that a VMA can be attached without
> write-locking but only a write-locked VMA can be detached. So, after
OK but write unlocking means the mm's seqcount is bumped and becomes
non-equal with vma's vma->vm_lock_seq, right?
Yet in the example above we happily set it to UINT_MAX and thus effectively
false unlock it for vma_start_read()?
And this is all done before the vma_complete() side would actually reach
mmap_write_unlock(), AFAICS.
> vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> in vma_start_read() the VMA write-lock should have been released by
> mmap_write_unlock() and therefore vma->detached=false should be
> visible to the reader when it executed lock_vma_under_rcu().
>
>> is_vma_detached() - false negative as the write above didn't propagate
>> here yet; a read barrier but where is the write barrier?
>> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> positive, or they got reset on reallocation but writes didn't propagate
>>
>> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> succeeding here?
>>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 16:32 ` Vlastimil Babka
@ 2024-12-10 17:16 ` Suren Baghdasaryan
2024-12-10 17:25 ` Vlastimil Babka
0 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 17:16 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> > object reuse before RCU grace period is over will be detected inside
> >> > lock_vma_under_rcu().
> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> > given address, locks the vma and checks if it got detached or remapped
> >> > to cover a different address range. These last checks are there
> >> > to ensure that the vma was not modified after we found it but before
> >> > locking it.
> >> > vma reuse introduces several new possibilities:
> >> > 1. vma can be reused after it was found but before it is locked;
> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> > while being locked in vma_start_read();
> >> > 3. vma can be reused and reinitialized after it was found but before
> >> > it is locked, then attached at a new address or to a new mm while
> >> > read-locked;
> >> > For case #1 current checks will help detecting cases when:
> >> > - vma was reused but not yet added into the tree (detached check)
> >> > - vma was reused at a different address range (address check);
> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> > attached to a different mm. This patch adds the missing check.
> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> > a false locked result but that's not critical if it's rare because
> >> > it will only lead to a retry under mmap_lock.
> >> > For case #3, we ensure the order in which vma->detached flag and
> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> > because attaching vma happens without vma write-lock, as opposed to
> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> > of call_rcu() calls.
> >> >
> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>
> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >>
> >> vma_mark_detached(vp->remove);
> >> vma->detached = true; - plain write
> >> vm_area_free(vp->remove);
> >> vma->vm_lock_seq = UINT_MAX; - plain write
> >> kmem_cache_free(vm_area_cachep)
> >> ...
> >> potential reallocation
> >>
> >> against:
> >>
> >> lock_vma_under_rcu()
> >> - mas_walk finds a stale vma due to race
> >> vma_start_read()
> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> - can be false, the vma was not being locked on the freeing side?
> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> this is acquire, but was there any release?
> >
> > Yes, there was a release. I think what you missed is that
> > vma_mark_detached() that is called from vma_complete() requires VMA to
> > be write-locked (see vma_assert_write_locked() in
> > vma_mark_detached()). The rule is that a VMA can be attached without
> > write-locking but only a write-locked VMA can be detached. So, after
>
> OK but write unlocking means the mm's seqcount is bumped and becomes
> non-equal with vma's vma->vm_lock_seq, right?
>
> Yet in the example above we happily set it to UINT_MAX and thus effectively
> false unlock it for vma_start_read()?
>
> And this is all done before the vma_complete() side would actually reach
> mmap_write_unlock(), AFAICS.
Ah, you are right. With the possibility of reuse, even a freed VMA
should be kept write-locked until it is unlocked by
mmap_write_unlock(). I think the fix for this is simply to not reset
vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
comment for vm_lock_seq explaining these requirements.
Do you agree that such a change would resolve the issue?
>
> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> > in vma_start_read() the VMA write-lock should have been released by
> > mmap_write_unlock() and therefore vma->detached=false should be
> > visible to the reader when it executed lock_vma_under_rcu().
> >
> >> is_vma_detached() - false negative as the write above didn't propagate
> >> here yet; a read barrier but where is the write barrier?
> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> positive, or they got reset on reallocation but writes didn't propagate
> >>
> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> succeeding here?
> >>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 17:16 ` Suren Baghdasaryan
@ 2024-12-10 17:25 ` Vlastimil Babka
2024-12-10 18:53 ` Suren Baghdasaryan
2024-12-10 23:01 ` Suren Baghdasaryan
0 siblings, 2 replies; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-10 17:25 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On 12/10/24 18:16, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/10/24 17:20, Suren Baghdasaryan wrote:
>> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >>
>> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> >> > object reuse before RCU grace period is over will be detected inside
>> >> > lock_vma_under_rcu().
>> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> >> > given address, locks the vma and checks if it got detached or remapped
>> >> > to cover a different address range. These last checks are there
>> >> > to ensure that the vma was not modified after we found it but before
>> >> > locking it.
>> >> > vma reuse introduces several new possibilities:
>> >> > 1. vma can be reused after it was found but before it is locked;
>> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> >> > while being locked in vma_start_read();
>> >> > 3. vma can be reused and reinitialized after it was found but before
>> >> > it is locked, then attached at a new address or to a new mm while
>> >> > read-locked;
>> >> > For case #1 current checks will help detecting cases when:
>> >> > - vma was reused but not yet added into the tree (detached check)
>> >> > - vma was reused at a different address range (address check);
>> >> > We are missing the check for vm_mm to ensure the reused vma was not
>> >> > attached to a different mm. This patch adds the missing check.
>> >> > For case #2, we pass mm to vma_start_read() to prevent access to
>> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> >> > a false locked result but that's not critical if it's rare because
>> >> > it will only lead to a retry under mmap_lock.
>> >> > For case #3, we ensure the order in which vma->detached flag and
>> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> >> > because attaching vma happens without vma write-lock, as opposed to
>> >> > vma detaching, which requires vma write-lock. This patch adds memory
>> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> >> > This will facilitate vm_area_struct reuse and will minimize the number
>> >> > of call_rcu() calls.
>> >> >
>> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> >>
>> >> I'm wondering about the vma freeing path. Consider vma_complete():
>> >>
>> >> vma_mark_detached(vp->remove);
>> >> vma->detached = true; - plain write
>> >> vm_area_free(vp->remove);
>> >> vma->vm_lock_seq = UINT_MAX; - plain write
>> >> kmem_cache_free(vm_area_cachep)
>> >> ...
>> >> potential reallocation
>> >>
>> >> against:
>> >>
>> >> lock_vma_under_rcu()
>> >> - mas_walk finds a stale vma due to race
>> >> vma_start_read()
>> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> >> - can be false, the vma was not being locked on the freeing side?
>> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> >> this is acquire, but was there any release?
>> >
>> > Yes, there was a release. I think what you missed is that
>> > vma_mark_detached() that is called from vma_complete() requires VMA to
>> > be write-locked (see vma_assert_write_locked() in
>> > vma_mark_detached()). The rule is that a VMA can be attached without
>> > write-locking but only a write-locked VMA can be detached. So, after
>>
>> OK but write unlocking means the mm's seqcount is bumped and becomes
>> non-equal with vma's vma->vm_lock_seq, right?
>>
>> Yet in the example above we happily set it to UINT_MAX and thus effectively
>> false unlock it for vma_start_read()?
>>
>> And this is all done before the vma_complete() side would actually reach
>> mmap_write_unlock(), AFAICS.
>
> Ah, you are right. With the possibility of reuse, even a freed VMA
> should be kept write-locked until it is unlocked by
> mmap_write_unlock(). I think the fix for this is simply to not reset
> vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
it can proceed and end up doing a vma_start_write() and rewrite it there
anyway, no?
> comment for vm_lock_seq explaining these requirements.
> Do you agree that such a change would resolve the issue?
>
>>
>> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
>> > in vma_start_read() the VMA write-lock should have been released by
>> > mmap_write_unlock() and therefore vma->detached=false should be
>> > visible to the reader when it executed lock_vma_under_rcu().
>> >
>> >> is_vma_detached() - false negative as the write above didn't propagate
>> >> here yet; a read barrier but where is the write barrier?
>> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> >> positive, or they got reset on reallocation but writes didn't propagate
>> >>
>> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> >> succeeding here?
>> >>
>>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 17:25 ` Vlastimil Babka
@ 2024-12-10 18:53 ` Suren Baghdasaryan
2024-12-10 23:01 ` Suren Baghdasaryan
1 sibling, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 18:53 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> >> > object reuse before RCU grace period is over will be detected inside
> >> >> > lock_vma_under_rcu().
> >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> >> > given address, locks the vma and checks if it got detached or remapped
> >> >> > to cover a different address range. These last checks are there
> >> >> > to ensure that the vma was not modified after we found it but before
> >> >> > locking it.
> >> >> > vma reuse introduces several new possibilities:
> >> >> > 1. vma can be reused after it was found but before it is locked;
> >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> >> > while being locked in vma_start_read();
> >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> >> > it is locked, then attached at a new address or to a new mm while
> >> >> > read-locked;
> >> >> > For case #1 current checks will help detecting cases when:
> >> >> > - vma was reused but not yet added into the tree (detached check)
> >> >> > - vma was reused at a different address range (address check);
> >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> >> > attached to a different mm. This patch adds the missing check.
> >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> >> > a false locked result but that's not critical if it's rare because
> >> >> > it will only lead to a retry under mmap_lock.
> >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> >> > of call_rcu() calls.
> >> >> >
> >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> >>
> >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> >>
> >> >> vma_mark_detached(vp->remove);
> >> >> vma->detached = true; - plain write
> >> >> vm_area_free(vp->remove);
> >> >> vma->vm_lock_seq = UINT_MAX; - plain write
> >> >> kmem_cache_free(vm_area_cachep)
> >> >> ...
> >> >> potential reallocation
> >> >>
> >> >> against:
> >> >>
> >> >> lock_vma_under_rcu()
> >> >> - mas_walk finds a stale vma due to race
> >> >> vma_start_read()
> >> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> >> - can be false, the vma was not being locked on the freeing side?
> >> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> >> this is acquire, but was there any release?
> >> >
> >> > Yes, there was a release. I think what you missed is that
> >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > be write-locked (see vma_assert_write_locked() in
> >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > write-locking but only a write-locked VMA can be detached. So, after
> >>
> >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> non-equal with vma's vma->vm_lock_seq, right?
> >>
> >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> false unlock it for vma_start_read()?
> >>
> >> And this is all done before the vma_complete() side would actually reach
> >> mmap_write_unlock(), AFAICS.
> >
> > Ah, you are right. With the possibility of reuse, even a freed VMA
> > should be kept write-locked until it is unlocked by
> > mmap_write_unlock(). I think the fix for this is simply to not reset
> > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>
> But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> it can proceed and end up doing a vma_start_write() and rewrite it there
> anyway, no?
Ugh, yes. It looks like we will need a write memory barrier in
vma_mark_detached() to make it immediately visible.
Case 1:
lock_vma_under_rcu()
vma_complete()
vma_mark_detached(vp->remove);
vma->detached = true;
smp_wmb();
vm_area_free(vp->remove);
vma->vm_lock_seq = UINT_MAX;
vma_start_read()
is_vma_detached() <<- abort
kmem_cache_free(vm_area_cachep);
mmap_write_unlock()
Case 2:
lock_vma_under_rcu()
vma_complete()
vma_mark_detached(vp->remove);
vma->detached = true;
smp_wmb();
vm_area_free(vp->remove);
vma->vm_lock_seq = UINT_MAX;
vma_start_read()
kmem_cache_free(vm_area_cachep);
mmap_write_unlock() // changes mm->mm_lock_seq but does not matter
// reader holds vma->vm_lock, so new writers have to wait
...
vm_area_alloc();
// sets all vma attributes
vma_mark_attached();
smp_wmb();
// if is_vma_detached() called
// before this point, we will
// abort like in Case 1
vma->detached = true;
is_vma_detached()
// check vm_mm, vm_start, vm_end
// if all checks pass, this is a
// new attached vma and it's
// read-locked (can't be modified)
Did I miss any other race?
>
> > comment for vm_lock_seq explaining these requirements.
> > Do you agree that such a change would resolve the issue?
> >
> >>
> >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > in vma_start_read() the VMA write-lock should have been released by
> >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > visible to the reader when it executed lock_vma_under_rcu().
> >> >
> >> >> is_vma_detached() - false negative as the write above didn't propagate
> >> >> here yet; a read barrier but where is the write barrier?
> >> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> >> positive, or they got reset on reallocation but writes didn't propagate
> >> >>
> >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> >> succeeding here?
> >> >>
> >>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 17:25 ` Vlastimil Babka
2024-12-10 18:53 ` Suren Baghdasaryan
@ 2024-12-10 23:01 ` Suren Baghdasaryan
2024-12-11 15:30 ` Suren Baghdasaryan
1 sibling, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-10 23:01 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> >> > object reuse before RCU grace period is over will be detected inside
> >> >> > lock_vma_under_rcu().
> >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> >> > given address, locks the vma and checks if it got detached or remapped
> >> >> > to cover a different address range. These last checks are there
> >> >> > to ensure that the vma was not modified after we found it but before
> >> >> > locking it.
> >> >> > vma reuse introduces several new possibilities:
> >> >> > 1. vma can be reused after it was found but before it is locked;
> >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> >> > while being locked in vma_start_read();
> >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> >> > it is locked, then attached at a new address or to a new mm while
> >> >> > read-locked;
> >> >> > For case #1 current checks will help detecting cases when:
> >> >> > - vma was reused but not yet added into the tree (detached check)
> >> >> > - vma was reused at a different address range (address check);
> >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> >> > attached to a different mm. This patch adds the missing check.
> >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> >> > a false locked result but that's not critical if it's rare because
> >> >> > it will only lead to a retry under mmap_lock.
> >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> >> > of call_rcu() calls.
> >> >> >
> >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> >>
> >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> >>
> >> >> vma_mark_detached(vp->remove);
> >> >> vma->detached = true; - plain write
> >> >> vm_area_free(vp->remove);
> >> >> vma->vm_lock_seq = UINT_MAX; - plain write
> >> >> kmem_cache_free(vm_area_cachep)
> >> >> ...
> >> >> potential reallocation
> >> >>
> >> >> against:
> >> >>
> >> >> lock_vma_under_rcu()
> >> >> - mas_walk finds a stale vma due to race
> >> >> vma_start_read()
> >> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> >> - can be false, the vma was not being locked on the freeing side?
> >> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> >> this is acquire, but was there any release?
> >> >
> >> > Yes, there was a release. I think what you missed is that
> >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > be write-locked (see vma_assert_write_locked() in
> >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > write-locking but only a write-locked VMA can be detached. So, after
> >>
> >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> non-equal with vma's vma->vm_lock_seq, right?
> >>
> >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> false unlock it for vma_start_read()?
> >>
> >> And this is all done before the vma_complete() side would actually reach
> >> mmap_write_unlock(), AFAICS.
> >
> > Ah, you are right. With the possibility of reuse, even a freed VMA
> > should be kept write-locked until it is unlocked by
> > mmap_write_unlock(). I think the fix for this is simply to not reset
> > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>
> But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> it can proceed and end up doing a vma_start_write() and rewrite it there
> anyway, no?
Actually, I think with a small change we can simplify these locking rules:
static inline void vma_start_write(struct vm_area_struct *vma)
{
int mm_lock_seq;
- if (__is_vma_write_locked(vma, &mm_lock_seq))
- return;
+ mmap_assert_write_locked(vma->vm_mm);
+ mm_lock_seq = vma->vm_mm->mm_lock_seq;
down_write(&vma->vm_lock->lock);
/*
* We should use WRITE_ONCE() here because we can have concurrent reads
* from the early lockless pessimistic check in vma_start_read().
* We don't really care about the correctness of that early check, but
* we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
*/
WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
up_write(&vma->vm_lock->lock);
}
This will force vma_start_write() to always write-lock vma->vm_lock
before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
the other readers/writers will synchronize on it even if vma got
reused.
>
> > comment for vm_lock_seq explaining these requirements.
> > Do you agree that such a change would resolve the issue?
> >
> >>
> >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > in vma_start_read() the VMA write-lock should have been released by
> >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > visible to the reader when it executed lock_vma_under_rcu().
> >> >
> >> >> is_vma_detached() - false negative as the write above didn't propagate
> >> >> here yet; a read barrier but where is the write barrier?
> >> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> >> positive, or they got reset on reallocation but writes didn't propagate
> >> >>
> >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> >> succeeding here?
> >> >>
> >>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-10 23:01 ` Suren Baghdasaryan
@ 2024-12-11 15:30 ` Suren Baghdasaryan
2024-12-11 16:05 ` Vlastimil Babka
0 siblings, 1 reply; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-11 15:30 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team
On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >> >>
> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > >> >> > object reuse before RCU grace period is over will be detected inside
> > >> >> > lock_vma_under_rcu().
> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > >> >> > given address, locks the vma and checks if it got detached or remapped
> > >> >> > to cover a different address range. These last checks are there
> > >> >> > to ensure that the vma was not modified after we found it but before
> > >> >> > locking it.
> > >> >> > vma reuse introduces several new possibilities:
> > >> >> > 1. vma can be reused after it was found but before it is locked;
> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > >> >> > while being locked in vma_start_read();
> > >> >> > 3. vma can be reused and reinitialized after it was found but before
> > >> >> > it is locked, then attached at a new address or to a new mm while
> > >> >> > read-locked;
> > >> >> > For case #1 current checks will help detecting cases when:
> > >> >> > - vma was reused but not yet added into the tree (detached check)
> > >> >> > - vma was reused at a different address range (address check);
> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> > >> >> > attached to a different mm. This patch adds the missing check.
> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > >> >> > a false locked result but that's not critical if it's rare because
> > >> >> > it will only lead to a retry under mmap_lock.
> > >> >> > For case #3, we ensure the order in which vma->detached flag and
> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > >> >> > because attaching vma happens without vma write-lock, as opposed to
> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> > >> >> > of call_rcu() calls.
> > >> >> >
> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >> >>
> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> > >> >>
> > >> >> vma_mark_detached(vp->remove);
> > >> >> vma->detached = true; - plain write
> > >> >> vm_area_free(vp->remove);
> > >> >> vma->vm_lock_seq = UINT_MAX; - plain write
> > >> >> kmem_cache_free(vm_area_cachep)
> > >> >> ...
> > >> >> potential reallocation
> > >> >>
> > >> >> against:
> > >> >>
> > >> >> lock_vma_under_rcu()
> > >> >> - mas_walk finds a stale vma due to race
> > >> >> vma_start_read()
> > >> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > >> >> - can be false, the vma was not being locked on the freeing side?
> > >> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> > >> >> this is acquire, but was there any release?
> > >> >
> > >> > Yes, there was a release. I think what you missed is that
> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> > >> > be write-locked (see vma_assert_write_locked() in
> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
> > >> > write-locking but only a write-locked VMA can be detached. So, after
> > >>
> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
> > >> non-equal with vma's vma->vm_lock_seq, right?
> > >>
> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> > >> false unlock it for vma_start_read()?
> > >>
> > >> And this is all done before the vma_complete() side would actually reach
> > >> mmap_write_unlock(), AFAICS.
> > >
> > > Ah, you are right. With the possibility of reuse, even a freed VMA
> > > should be kept write-locked until it is unlocked by
> > > mmap_write_unlock(). I think the fix for this is simply to not reset
> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
> >
> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> > it can proceed and end up doing a vma_start_write() and rewrite it there
> > anyway, no?
>
> Actually, I think with a small change we can simplify these locking rules:
>
> static inline void vma_start_write(struct vm_area_struct *vma)
> {
> int mm_lock_seq;
>
> - if (__is_vma_write_locked(vma, &mm_lock_seq))
> - return;
> + mmap_assert_write_locked(vma->vm_mm);
> + mm_lock_seq = vma->vm_mm->mm_lock_seq;
>
> down_write(&vma->vm_lock->lock);
> /*
> * We should use WRITE_ONCE() here because we can have concurrent reads
> * from the early lockless pessimistic check in vma_start_read().
> * We don't really care about the correctness of that early check, but
> * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> */
> WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> up_write(&vma->vm_lock->lock);
> }
>
> This will force vma_start_write() to always write-lock vma->vm_lock
> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
> the other readers/writers will synchronize on it even if vma got
> reused.
After thinking of all the alternatives, I think the cleanest way to
handle vma detaching would be to follow the same pattern as for vma
attaching. To attach a vma we do:
vma->vm_mm = xxx;
...
vma_mark_attached()
smp_wmb();
WRITE_ONCE(vma->detached, false);
lock_vma_under_rcu() ensures that a vma is attached and still
unchanged like this:
lock_vma_under_rcu()
vma_start_read();
is_vma_detached()
detached = READ_ONCE(vma->detached);
smp_rmb();
if (vma->vm_mm != mm)
So, vm_area_free() can follow the same pattern to ensure vma reuse
gets detected even if lock_vma_under_rcu() succeeds in locking the
vma:
vm_area_free()
vma->vm_mm = NULL;
smp_wmb();
WRITE_ONCE(vma->detached, true);
Vlastimil, I think that should address the race you described. WDYT?
>
> >
> > > comment for vm_lock_seq explaining these requirements.
> > > Do you agree that such a change would resolve the issue?
> > >
> > >>
> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> > >> > in vma_start_read() the VMA write-lock should have been released by
> > >> > mmap_write_unlock() and therefore vma->detached=false should be
> > >> > visible to the reader when it executed lock_vma_under_rcu().
> > >> >
> > >> >> is_vma_detached() - false negative as the write above didn't propagate
> > >> >> here yet; a read barrier but where is the write barrier?
> > >> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> > >> >> positive, or they got reset on reallocation but writes didn't propagate
> > >> >>
> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> > >> >> succeeding here?
> > >> >>
> > >>
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-11 15:30 ` Suren Baghdasaryan
@ 2024-12-11 16:05 ` Vlastimil Babka
2024-12-11 16:14 ` Suren Baghdasaryan
0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2024-12-11 16:05 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, Peter Zijlstra
On 12/11/24 16:30, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
>> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > >>
>> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
>> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > >> >>
>> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > >> >> > object reuse before RCU grace period is over will be detected inside
>> > >> >> > lock_vma_under_rcu().
>> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> > >> >> > given address, locks the vma and checks if it got detached or remapped
>> > >> >> > to cover a different address range. These last checks are there
>> > >> >> > to ensure that the vma was not modified after we found it but before
>> > >> >> > locking it.
>> > >> >> > vma reuse introduces several new possibilities:
>> > >> >> > 1. vma can be reused after it was found but before it is locked;
>> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> > >> >> > while being locked in vma_start_read();
>> > >> >> > 3. vma can be reused and reinitialized after it was found but before
>> > >> >> > it is locked, then attached at a new address or to a new mm while
>> > >> >> > read-locked;
>> > >> >> > For case #1 current checks will help detecting cases when:
>> > >> >> > - vma was reused but not yet added into the tree (detached check)
>> > >> >> > - vma was reused at a different address range (address check);
>> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
>> > >> >> > attached to a different mm. This patch adds the missing check.
>> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
>> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> > >> >> > a false locked result but that's not critical if it's rare because
>> > >> >> > it will only lead to a retry under mmap_lock.
>> > >> >> > For case #3, we ensure the order in which vma->detached flag and
>> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> > >> >> > because attaching vma happens without vma write-lock, as opposed to
>> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
>> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
>> > >> >> > of call_rcu() calls.
>> > >> >> >
>> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> > >> >>
>> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
>> > >> >>
>> > >> >> vma_mark_detached(vp->remove);
>> > >> >> vma->detached = true; - plain write
>> > >> >> vm_area_free(vp->remove);
>> > >> >> vma->vm_lock_seq = UINT_MAX; - plain write
>> > >> >> kmem_cache_free(vm_area_cachep)
>> > >> >> ...
>> > >> >> potential reallocation
>> > >> >>
>> > >> >> against:
>> > >> >>
>> > >> >> lock_vma_under_rcu()
>> > >> >> - mas_walk finds a stale vma due to race
>> > >> >> vma_start_read()
>> > >> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> > >> >> - can be false, the vma was not being locked on the freeing side?
>> > >> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> > >> >> this is acquire, but was there any release?
>> > >> >
>> > >> > Yes, there was a release. I think what you missed is that
>> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
>> > >> > be write-locked (see vma_assert_write_locked() in
>> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
>> > >> > write-locking but only a write-locked VMA can be detached. So, after
>> > >>
>> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
>> > >> non-equal with vma's vma->vm_lock_seq, right?
>> > >>
>> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
>> > >> false unlock it for vma_start_read()?
>> > >>
>> > >> And this is all done before the vma_complete() side would actually reach
>> > >> mmap_write_unlock(), AFAICS.
>> > >
>> > > Ah, you are right. With the possibility of reuse, even a freed VMA
>> > > should be kept write-locked until it is unlocked by
>> > > mmap_write_unlock(). I think the fix for this is simply to not reset
>> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>> >
>> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
>> > it can proceed and end up doing a vma_start_write() and rewrite it there
>> > anyway, no?
>>
>> Actually, I think with a small change we can simplify these locking rules:
>>
>> static inline void vma_start_write(struct vm_area_struct *vma)
>> {
>> int mm_lock_seq;
>>
>> - if (__is_vma_write_locked(vma, &mm_lock_seq))
>> - return;
>> + mmap_assert_write_locked(vma->vm_mm);
>> + mm_lock_seq = vma->vm_mm->mm_lock_seq;
>>
>> down_write(&vma->vm_lock->lock);
>> /*
>> * We should use WRITE_ONCE() here because we can have concurrent reads
>> * from the early lockless pessimistic check in vma_start_read().
>> * We don't really care about the correctness of that early check, but
>> * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>> */
>> WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>> up_write(&vma->vm_lock->lock);
>> }
>>
>> This will force vma_start_write() to always write-lock vma->vm_lock
>> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
>> the other readers/writers will synchronize on it even if vma got
>> reused.
>
> After thinking of all the alternatives, I think the cleanest way to
> handle vma detaching would be to follow the same pattern as for vma
> attaching. To attach a vma we do:
>
> vma->vm_mm = xxx;
> ...
> vma_mark_attached()
> smp_wmb();
> WRITE_ONCE(vma->detached, false);
>
>
> lock_vma_under_rcu() ensures that a vma is attached and still
> unchanged like this:
>
> lock_vma_under_rcu()
> vma_start_read();
> is_vma_detached()
> detached = READ_ONCE(vma->detached);
> smp_rmb();
> if (vma->vm_mm != mm)
>
> So, vm_area_free() can follow the same pattern to ensure vma reuse
> gets detected even if lock_vma_under_rcu() succeeds in locking the
> vma:
>
> vm_area_free()
> vma->vm_mm = NULL;
> smp_wmb();
> WRITE_ONCE(vma->detached, true);
>
> Vlastimil, I think that should address the race you described. WDYT?
I'm not sure. AFAIU the barriers would ensure that if lock_vma_under_rcu()
sees detached, it also sees vm_mm is NULL. But as it doesn't ensure that it
will see it detached, so it also doesn't ensure we will see vm_mm as NULL.
I think the main problem is that we unlock the vma by writing to a mm, not
the vma, which makes it hard to apply the necessary SLAB_TYPESAFE_BY_RCU
validation patterns to it. I thought the direction you were discussing with
PeterZ in the other thread would solve this (in addition of getting rid of
the rwsem, which we were considering it anyway, but enabling
SLAB_TYPESAFE_BY_RCU by that would be a stronger argument).
Perhaps a solution to this that would work with the current rwsem would be
that setting detached and vm_mm to NULL would be set under the down_write()
of the rwsem. That would make sure that if lock_vma_under_rcu() succeeds the
down_read_trylock(), it would be guaranteed to see those assignments?
>>
>> >
>> > > comment for vm_lock_seq explaining these requirements.
>> > > Do you agree that such a change would resolve the issue?
>> > >
>> > >>
>> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
>> > >> > in vma_start_read() the VMA write-lock should have been released by
>> > >> > mmap_write_unlock() and therefore vma->detached=false should be
>> > >> > visible to the reader when it executed lock_vma_under_rcu().
>> > >> >
>> > >> >> is_vma_detached() - false negative as the write above didn't propagate
>> > >> >> here yet; a read barrier but where is the write barrier?
>> > >> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> > >> >> positive, or they got reset on reallocation but writes didn't propagate
>> > >> >>
>> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> > >> >> succeeding here?
>> > >> >>
>> > >>
>> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU
2024-12-11 16:05 ` Vlastimil Babka
@ 2024-12-11 16:14 ` Suren Baghdasaryan
0 siblings, 0 replies; 32+ messages in thread
From: Suren Baghdasaryan @ 2024-12-11 16:14 UTC (permalink / raw)
To: Vlastimil Babka
Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, hannes,
mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
paulmck, brauner, dhowells, hdanton, hughd, minchan, jannh,
shakeel.butt, souravpanda, pasha.tatashin, corbet, linux-doc,
linux-mm, linux-kernel, kernel-team, Peter Zijlstra
On Wed, Dec 11, 2024 at 8:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/11/24 16:30, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >
> >> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
> >> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > >>
> >> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > >> >>
> >> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> > >> >> > object reuse before RCU grace period is over will be detected inside
> >> > >> >> > lock_vma_under_rcu().
> >> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> > >> >> > given address, locks the vma and checks if it got detached or remapped
> >> > >> >> > to cover a different address range. These last checks are there
> >> > >> >> > to ensure that the vma was not modified after we found it but before
> >> > >> >> > locking it.
> >> > >> >> > vma reuse introduces several new possibilities:
> >> > >> >> > 1. vma can be reused after it was found but before it is locked;
> >> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> > >> >> > while being locked in vma_start_read();
> >> > >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> > >> >> > it is locked, then attached at a new address or to a new mm while
> >> > >> >> > read-locked;
> >> > >> >> > For case #1 current checks will help detecting cases when:
> >> > >> >> > - vma was reused but not yet added into the tree (detached check)
> >> > >> >> > - vma was reused at a different address range (address check);
> >> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> > >> >> > attached to a different mm. This patch adds the missing check.
> >> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> > >> >> > a false locked result but that's not critical if it's rare because
> >> > >> >> > it will only lead to a retry under mmap_lock.
> >> > >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> > >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> > >> >> > of call_rcu() calls.
> >> > >> >> >
> >> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> > >> >>
> >> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> > >> >>
> >> > >> >> vma_mark_detached(vp->remove);
> >> > >> >> vma->detached = true; - plain write
> >> > >> >> vm_area_free(vp->remove);
> >> > >> >> vma->vm_lock_seq = UINT_MAX; - plain write
> >> > >> >> kmem_cache_free(vm_area_cachep)
> >> > >> >> ...
> >> > >> >> potential reallocation
> >> > >> >>
> >> > >> >> against:
> >> > >> >>
> >> > >> >> lock_vma_under_rcu()
> >> > >> >> - mas_walk finds a stale vma due to race
> >> > >> >> vma_start_read()
> >> > >> >> if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> > >> >> - can be false, the vma was not being locked on the freeing side?
> >> > >> >> down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> > >> >> this is acquire, but was there any release?
> >> > >> >
> >> > >> > Yes, there was a release. I think what you missed is that
> >> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > >> > be write-locked (see vma_assert_write_locked() in
> >> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > >> > write-locking but only a write-locked VMA can be detached. So, after
> >> > >>
> >> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> > >> non-equal with vma's vma->vm_lock_seq, right?
> >> > >>
> >> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> > >> false unlock it for vma_start_read()?
> >> > >>
> >> > >> And this is all done before the vma_complete() side would actually reach
> >> > >> mmap_write_unlock(), AFAICS.
> >> > >
> >> > > Ah, you are right. With the possibility of reuse, even a freed VMA
> >> > > should be kept write-locked until it is unlocked by
> >> > > mmap_write_unlock(). I think the fix for this is simply to not reset
> >> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
> >> >
> >> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> >> > it can proceed and end up doing a vma_start_write() and rewrite it there
> >> > anyway, no?
> >>
> >> Actually, I think with a small change we can simplify these locking rules:
> >>
> >> static inline void vma_start_write(struct vm_area_struct *vma)
> >> {
> >> int mm_lock_seq;
> >>
> >> - if (__is_vma_write_locked(vma, &mm_lock_seq))
> >> - return;
> >> + mmap_assert_write_locked(vma->vm_mm);
> >> + mm_lock_seq = vma->vm_mm->mm_lock_seq;
> >>
> >> down_write(&vma->vm_lock->lock);
> >> /*
> >> * We should use WRITE_ONCE() here because we can have concurrent reads
> >> * from the early lockless pessimistic check in vma_start_read().
> >> * We don't really care about the correctness of that early check, but
> >> * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >> */
> >> WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >> up_write(&vma->vm_lock->lock);
> >> }
> >>
> >> This will force vma_start_write() to always write-lock vma->vm_lock
> >> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
> >> the other readers/writers will synchronize on it even if vma got
> >> reused.
> >
> > After thinking of all the alternatives, I think the cleanest way to
> > handle vma detaching would be to follow the same pattern as for vma
> > attaching. To attach a vma we do:
> >
> > vma->vm_mm = xxx;
> > ...
> > vma_mark_attached()
> > smp_wmb();
> > WRITE_ONCE(vma->detached, false);
> >
> >
> > lock_vma_under_rcu() ensures that a vma is attached and still
> > unchanged like this:
> >
> > lock_vma_under_rcu()
> > vma_start_read();
> > is_vma_detached()
> > detached = READ_ONCE(vma->detached);
> > smp_rmb();
> > if (vma->vm_mm != mm)
> >
> > So, vm_area_free() can follow the same pattern to ensure vma reuse
> > gets detected even if lock_vma_under_rcu() succeeds in locking the
> > vma:
> >
> > vm_area_free()
> > vma->vm_mm = NULL;
> > smp_wmb();
> > WRITE_ONCE(vma->detached, true);
> >
> > Vlastimil, I think that should address the race you described. WDYT?
>
> I'm not sure. AFAIU the barriers would ensure that if lock_vma_under_rcu()
> sees detached, it also sees vm_mm is NULL. But as it doesn't ensure that it
> will see it detached, so it also doesn't ensure we will see vm_mm as NULL.
>
> I think the main problem is that we unlock the vma by writing to a mm, not
> the vma, which makes it hard to apply the necessary SLAB_TYPESAFE_BY_RCU
> validation patterns to it. I thought the direction you were discussing with
> PeterZ in the other thread would solve this (in addition of getting rid of
> the rwsem, which we were considering it anyway, but enabling
> SLAB_TYPESAFE_BY_RCU by that would be a stronger argument).
I was hoping to implement SLAB_TYPESAFE_BY_RCU independently from
vm_lock change but you are probably right. Incorporating vma->detached
flag into the lock itself (which survives reuse) would make things way
easier. Let me pivot towards making that change first and see if
SLAB_TYPESAFE_BY_RCU becomes simpler.
>
> Perhaps a solution to this that would work with the current rwsem would be
> that setting detached and vm_mm to NULL would be set under the down_write()
> of the rwsem. That would make sure that if lock_vma_under_rcu() succeeds the
> down_read_trylock(), it would be guaranteed to see those assignments?
Yeah, that would definitely work. I was trying to avoid extra locking
but it looks like it's unavoidable.
Anyway, let me try replacing vm_lock first and will see where we end up.
Thanks for the input!
>
> >>
> >> >
> >> > > comment for vm_lock_seq explaining these requirements.
> >> > > Do you agree that such a change would resolve the issue?
> >> > >
> >> > >>
> >> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > >> > in vma_start_read() the VMA write-lock should have been released by
> >> > >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > >> > visible to the reader when it executed lock_vma_under_rcu().
> >> > >> >
> >> > >> >> is_vma_detached() - false negative as the write above didn't propagate
> >> > >> >> here yet; a read barrier but where is the write barrier?
> >> > >> >> checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> > >> >> positive, or they got reset on reallocation but writes didn't propagate
> >> > >> >>
> >> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> > >> >> succeeding here?
> >> > >> >>
> >> > >>
> >> >
>
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2024-12-11 16:14 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-06 22:51 [PATCH v5 0/6] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-06 22:51 ` [PATCH v5 1/6] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-12-10 9:03 ` Vlastimil Babka
2024-12-06 22:51 ` [PATCH v5 2/6] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-10 9:15 ` Vlastimil Babka
2024-12-06 22:52 ` [PATCH v5 3/6] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-12-10 9:35 ` Vlastimil Babka
2024-12-10 11:36 ` Vlastimil Babka
2024-12-10 16:28 ` Suren Baghdasaryan
2024-12-06 22:52 ` [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-12-09 17:35 ` Klara Modin
2024-12-09 20:28 ` Suren Baghdasaryan
2024-12-09 22:19 ` Suren Baghdasaryan
2024-12-10 12:06 ` Vlastimil Babka
2024-12-10 16:23 ` Suren Baghdasaryan
2024-12-10 14:21 ` Vlastimil Babka
2024-12-10 16:20 ` Suren Baghdasaryan
2024-12-10 16:32 ` Vlastimil Babka
2024-12-10 17:16 ` Suren Baghdasaryan
2024-12-10 17:25 ` Vlastimil Babka
2024-12-10 18:53 ` Suren Baghdasaryan
2024-12-10 23:01 ` Suren Baghdasaryan
2024-12-11 15:30 ` Suren Baghdasaryan
2024-12-11 16:05 ` Vlastimil Babka
2024-12-11 16:14 ` Suren Baghdasaryan
2024-12-06 22:52 ` [PATCH v5 5/6] mm/slab: allow freeptr_offset to be used with ctor Suren Baghdasaryan
2024-12-10 11:01 ` Vlastimil Babka
2024-12-06 22:52 ` [PATCH v5 6/6] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-12-07 3:23 ` Randy Dunlap
2024-12-07 4:24 ` Akira Yokosawa
2024-12-07 17:33 ` Suren Baghdasaryan
2024-12-07 4:29 ` [PATCH v5 0/6] move per-vma lock into vm_area_struct Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox