From: Suren Baghdasaryan <surenb@google.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, willy@infradead.org,
liam.howlett@oracle.com, mhocko@suse.com, vbabka@suse.cz,
hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com,
mgorman@techsingularity.net, david@redhat.com,
peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net,
paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com,
hdanton@sina.com, hughd@google.com, minchan@google.com,
jannh@google.com, shakeel.butt@linux.dev,
souravpanda@google.com, pasha.tatashin@soleen.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
kernel-team@android.com
Subject: Re: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct
Date: Wed, 13 Nov 2024 07:42:49 -0800 [thread overview]
Message-ID: <CAJuCfpHz-u9w1A8d9LEZ6=mmxXN=JLQr9N00dKpALWUT6GVsfA@mail.gmail.com> (raw)
In-Reply-To: <637370b8-3e3e-4457-81d6-5913a3ff1d4e@lucifer.local>
On Wed, Nov 13, 2024 at 6:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?
>
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> > slabinfo before:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vma_lock ... 40 102 1 : ...
> > vm_area_struct ... 160 51 2 : ...
>
> Pedantry, but it might be worth mentioning how much this can vary by config.
>
> For instance, on my machine:
>
> vm_area_struct 125238 138820 184 44
Ack.
>
> >
> > slabinfo after moving vm_lock:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vm_area_struct ... 256 32 2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> > addressed later by optimizing the vm_lock.
>
> Yes grabbing this back is of critical importance I'd say! :)
>
> Functionally it looks ok to me but would like to see a stronger
> justification in the commit msg! :)
>
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > include/linux/mm.h | 28 +++++++++++++----------
> > include/linux/mm_types.h | 6 +++--
> > kernel/fork.c | 49 ++++------------------------------------
> > 3 files changed, 25 insertions(+), 58 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 01ce619f3d17..a5eb0be3e351 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> > #endif /* CONFIG_NUMA_BALANCING */
> >
> > #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_lock_init(struct vm_area_struct *vma)
> > +{
> > + init_rwsem(&vma->vm_lock.lock);
> > + vma->vm_lock_seq = UINT_MAX;
> > +}
> > +
> > /*
> > * Try to read-lock a vma. The function is allowed to occasionally yield false
> > * locked result to avoid performance overhead, in which case we fall back to
> > @@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > return false;
> >
> > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > + if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > return false;
> >
> > /*
> > @@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > * This pairs with RELEASE semantics in vma_end_write_all().
> > */
> > if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > - up_read(&vma->vm_lock->lock);
> > + up_read(&vma->vm_lock.lock);
> > return false;
> > }
> > return true;
> > @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > {
> > mmap_assert_locked(vma->vm_mm);
> > - down_read_nested(&vma->vm_lock->lock, subclass);
> > + down_read_nested(&vma->vm_lock.lock, subclass);
> > }
> >
> > /*
> > @@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> > static inline void vma_start_read_locked(struct vm_area_struct *vma)
> > {
> > mmap_assert_locked(vma->vm_mm);
> > - down_read(&vma->vm_lock->lock);
> > + down_read(&vma->vm_lock.lock);
> > }
> >
> > static inline void vma_end_read(struct vm_area_struct *vma)
> > {
> > rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > - up_read(&vma->vm_lock->lock);
> > + up_read(&vma->vm_lock.lock);
> > rcu_read_unlock();
> > }
> >
> > @@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> > if (__is_vma_write_locked(vma, &mm_lock_seq))
> > return;
> >
> > - down_write(&vma->vm_lock->lock);
> > + down_write(&vma->vm_lock.lock);
> > /*
> > * We should use WRITE_ONCE() here because we can have concurrent reads
> > * from the early lockless pessimistic check in vma_start_read().
> > @@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> > * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> > */
> > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > - up_write(&vma->vm_lock->lock);
> > + up_write(&vma->vm_lock.lock);
> > }
> >
> > static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > @@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> > static inline void vma_assert_locked(struct vm_area_struct *vma)
> > {
> > - if (!rwsem_is_locked(&vma->vm_lock->lock))
> > + if (!rwsem_is_locked(&vma->vm_lock.lock))
> > vma_assert_write_locked(vma);
> > }
> >
> > @@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >
> > #else /* CONFIG_PER_VMA_LOCK */
> >
> > +static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > static inline bool vma_start_read(struct vm_area_struct *vma)
> > { return false; }
> > static inline void vma_end_read(struct vm_area_struct *vma) {}
> > @@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> >
> > extern const struct vm_operations_struct vma_dummy_vm_ops;
> >
> > -/*
> > - * WARNING: vma_init does not initialize vma->vm_lock.
> > - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> > - */
> > static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > {
> > memset(vma, 0, sizeof(*vma));
> > @@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > INIT_LIST_HEAD(&vma->anon_vma_chain);
> > vma_mark_detached(vma, false);
> > vma_numab_state_init(vma);
> > + vma_lock_init(vma);
> > }
> >
> > /* Use when VMA is not part of the VMA tree and needs no locking */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 80fef38d9d64..5c4bfdcfac72 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -716,8 +716,6 @@ struct vm_area_struct {
> > * slowpath.
> > */
> > unsigned int vm_lock_seq;
> > - /* Unstable RCU readers are allowed to read this. */
> > - struct vma_lock *vm_lock;
> > #endif
> >
> > /*
> > @@ -770,6 +768,10 @@ struct vm_area_struct {
> > struct vma_numab_state *numab_state; /* NUMA Balancing state */
> > #endif
> > struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + /* Unstable RCU readers are allowed to read this. */
> > + struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +#endif
> > } __randomize_layout;
> >
> > #ifdef CONFIG_NUMA
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0061cf2450ef..7823797e31d2 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
> > /* SLAB cache for mm_struct structures (tsk->mm) */
> > static struct kmem_cache *mm_cachep;
> >
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -
> > -/* SLAB cache for vm_area_struct.lock */
> > -static struct kmem_cache *vma_lock_cachep;
> > -
> > -static bool vma_lock_alloc(struct vm_area_struct *vma)
> > -{
> > - vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> > - if (!vma->vm_lock)
> > - return false;
> > -
> > - init_rwsem(&vma->vm_lock->lock);
> > - vma->vm_lock_seq = UINT_MAX;
> > -
> > - return true;
> > -}
> > -
> > -static inline void vma_lock_free(struct vm_area_struct *vma)
> > -{
> > - kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> > -}
> > -
> > -#else /* CONFIG_PER_VMA_LOCK */
> > -
> > -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> > -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> > -
> > -#endif /* CONFIG_PER_VMA_LOCK */
> > -
> > struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > {
> > struct vm_area_struct *vma;
> > @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > return NULL;
> >
> > vma_init(vma, mm);
> > - if (!vma_lock_alloc(vma)) {
> > - kmem_cache_free(vm_area_cachep, vma);
> > - return NULL;
> > - }
> >
> > return vma;
> > }
> > @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > * will be reinitialized.
> > */
> > data_race(memcpy(new, orig, sizeof(*new)));
> > - if (!vma_lock_alloc(new)) {
> > - kmem_cache_free(vm_area_cachep, new);
> > - return NULL;
> > - }
> > + vma_lock_init(new);
> > INIT_LIST_HEAD(&new->anon_vma_chain);
> > vma_numab_state_init(new);
> > dup_anon_vma_name(orig, new);
> > @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> > {
> > vma_numab_state_free(vma);
> > free_anon_vma_name(vma);
> > - vma_lock_free(vma);
> > kmem_cache_free(vm_area_cachep, vma);
> > }
> >
> > @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
> > vm_rcu);
> >
> > /* The vma should not be locked while being destroyed. */
> > - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> > __vm_area_free(vma);
> > }
> > #endif
> > @@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
> > sizeof(struct fs_struct), 0,
> > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > NULL);
> > -
> > - vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > - vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> > -#endif
> > + vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > + SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > + SLAB_ACCOUNT);
>
> Why the SLAB_NO_MERGE?
Ah, I had it there for convenience to be able to see them separately
in /proc/slabinfo. With SLAB_HWCACHE_ALIGN I don't think it matters
for cacheline sharing...
Once we add SLAB_TYPESAFE_BY_RCU this flag won't matter anyway because
it will prevent slab merging.
>
> > mmap_init();
> > nsproxy_cache_init();
> > }
> > --
> > 2.47.0.277.g8800431eea-goog
> >
next prev parent reply other threads:[~2024-11-13 15:43 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-12 19:46 [PATCH v2 0/5] " Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 1/5] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-11-13 14:10 ` Lorenzo Stoakes
2024-11-13 15:30 ` Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-11-13 14:28 ` Lorenzo Stoakes
2024-11-13 14:45 ` Vlastimil Babka
2024-11-13 14:58 ` Lorenzo Stoakes
2024-11-13 15:09 ` Vlastimil Babka
2024-11-13 14:53 ` Mateusz Guzik
2024-11-13 14:59 ` Lorenzo Stoakes
2024-11-13 15:01 ` Lorenzo Stoakes
2024-11-13 15:45 ` Suren Baghdasaryan
2024-11-13 15:42 ` Suren Baghdasaryan [this message]
2024-11-12 19:46 ` [PATCH v2 3/5] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-11-13 14:43 ` Lorenzo Stoakes
2024-11-13 15:37 ` Suren Baghdasaryan
2024-11-12 19:46 ` [PATCH v2 4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-11-13 2:57 ` Suren Baghdasaryan
2024-11-13 5:08 ` Hugh Dickins
2024-11-13 6:03 ` Suren Baghdasaryan
2024-11-13 6:52 ` Hugh Dickins
2024-11-13 8:19 ` Suren Baghdasaryan
2024-11-13 8:58 ` Vlastimil Babka
2024-11-13 12:38 ` Liam R. Howlett
2024-11-13 13:57 ` Matthew Wilcox
2024-11-13 15:22 ` Liam R. Howlett
2024-11-13 15:25 ` Suren Baghdasaryan
2024-11-13 15:29 ` Liam R. Howlett
2024-11-13 15:47 ` Suren Baghdasaryan
2024-11-13 19:05 ` Suren Baghdasaryan
2024-11-14 16:18 ` Suren Baghdasaryan
2024-11-14 16:21 ` Vlastimil Babka
2024-11-13 16:44 ` Jann Horn
2024-11-13 20:59 ` Matthew Wilcox
2024-11-13 21:23 ` Jann Horn
2024-11-12 19:46 ` [PATCH v2 5/5] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-11-12 19:51 ` Suren Baghdasaryan
2024-11-13 14:46 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJuCfpHz-u9w1A8d9LEZ6=mmxXN=JLQr9N00dKpALWUT6GVsfA@mail.gmail.com' \
--to=surenb@google.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dhowells@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=hughd@google.com \
--cc=jannh@google.com \
--cc=kernel-team@android.com \
--cc=liam.howlett@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=minchan@google.com \
--cc=mjguzik@gmail.com \
--cc=oleg@redhat.com \
--cc=oliver.sang@intel.com \
--cc=pasha.tatashin@soleen.com \
--cc=paulmck@kernel.org \
--cc=peterx@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=souravpanda@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox