From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7E66C3DA78 for ; Tue, 17 Jan 2023 21:08:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 408276B0073; Tue, 17 Jan 2023 16:08:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 390836B0074; Tue, 17 Jan 2023 16:08:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 230F56B0075; Tue, 17 Jan 2023 16:08:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 107956B0073 for ; Tue, 17 Jan 2023 16:08:46 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D4DD11C5F83 for ; Tue, 17 Jan 2023 21:08:45 +0000 (UTC) X-FDA: 80365530210.30.0CFA1A7 Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) by imf14.hostedemail.com (Postfix) with ESMTP id 3E605100014 for ; Tue, 17 Jan 2023 21:08:44 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=OUKgX7cE; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673989724; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=urcfogXP8/5QcG3eqI+/efCdOfk/1SNX6adzw2Z8FeQ=; b=j+tHBAnTpvSGTOLFibADREhm6SaQQJuffVWinTcqR59SmVeCIMP56GEtlU0hqVaUXcriWy bVmshpGqWznSWPHCovJF1b7hchkKT3cJ9VTb3hB+LfohjZ5utbuWc0CyqvY3ycL2cblmQa mU9OLd9XYGNPz6smD76P0JceNvNJFE4= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=OUKgX7cE; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673989724; a=rsa-sha256; cv=none; b=eVblzAWRxhoxU57JqVHvSnv2QMVaNxn/mJQma/EPwbx103GRPu2EPOZOg4kGK4FTAKkVRL nhCVBRwQwINGU+Locci+TsoLhGv0OtEV3A2AeYNZb4p1rRbdDe/oRpQCPRLEKGfogF8GHa uYCmCJZQFI9MLzZtyT+YLpRZh1O/vhk= Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-4a2f8ad29d5so440377677b3.8 for ; Tue, 17 Jan 2023 13:08:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=urcfogXP8/5QcG3eqI+/efCdOfk/1SNX6adzw2Z8FeQ=; b=OUKgX7cEG3o2z5mUFzRkTB+81zNGAOtat3KbodR8544SGcTYvYmZErRrA0aSlTNdRA gfxEORf6ArfGuhMgJsDq42P4OIj6RSJIqR6GKkcih2eaVq9SGefH1f25NpuV+To6eU6q c5HMDU6MJqWdvnMSrFqizfGY+8TnzOeJB4+F6DBGOggu8TCIlRULLttzSmyH7uMbqj2u ssf2CbHggwypr/5Iehem8hsWpkh9FQtUaAyeqSxTIyxLgOcJlcb+WnpphcbFsXWmp+YZ PuOo3ta9aQEoUxNcIFxLRUl+6XhGJw+fk5paIBHJstlKZzOKv0h0CFitP29+Y4lL7ikx rwcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=urcfogXP8/5QcG3eqI+/efCdOfk/1SNX6adzw2Z8FeQ=; b=UCVMXeU7ta4rR+ASknS8Fnt/VDaXZblc0/dy7xIN8b/5n6G64eE8PpggAawH5njgJl TavRSDCj1Zp0cyQyfoYh5oRAOs+Deu8M7LNlbNWEXpm/Hd0eC4oKEt+zeg6NI8OQOiuN 7CG7E0WAE6FnP6e/Chgr2UOtQ433F698wVQoOvrR6jvJfFSj0jNMTLQULGdIyc8EDvYf iiUZ45PY0Id693RnXJjWVe1xJRGNcrgCxQGqp0k3QFfsObvjw5cdfvA1FSV5XHhYaK+R eB/6XLPW3T7HhCZ7e6ybbdM0CTFq+LUrcTHX6EFzcbU92Efffdfblf34No4BCiz0YGKU vf6Q== X-Gm-Message-State: AFqh2kpEk8LrOcP+FMyeMkj5Uw4w68BJqNQ2C2U0/ZzqOWnwzV2A01fY cDHUTEuTfsDrIfCJ/Fx2bOkzGQ+iFHsbC9U27aFzxA== X-Google-Smtp-Source: AMrXdXuirSR0jy63s0DZ+9m/b/wvxTwEF1sI4YVVCYw6Ow14874/Gci5gJMHPrHKDnQjh50N83G8r2p25UXnJQG51P4= X-Received: by 2002:a0d:fc05:0:b0:3ea:454d:d1ef with SMTP id m5-20020a0dfc05000000b003ea454dd1efmr603849ywf.409.1673989722997; Tue, 17 Jan 2023 13:08:42 -0800 (PST) MIME-Version: 1.0 References: <20230109205336.3665937-1-surenb@google.com> <20230109205336.3665937-13-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Tue, 17 Jan 2023 13:08:30 -0800 Message-ID: Subject: Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it To: Michal Hocko Cc: akpm@linux-foundation.org, michel@lespinasse.org, jglisse@google.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, hughlynch@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: cjpq7jiawk63pgs4pr1t8s456g1sram8 X-Rspam-User: X-Rspamd-Queue-Id: 3E605100014 X-Rspamd-Server: rspam06 X-HE-Tag: 1673989724-396223 X-HE-Meta: U2FsdGVkX1/Bo2TL3a+jTYXb07ydQAmEtG6kdtkwdeif+lsrfzf3Vt9sS1H8turql4BHsw1e/J9Q/nUGh0BJFTNP9pP2C7vMLtlq4iDcJzp/P06pdh9ybUw135Y6LSgRq6jLReXAfBuwiMesaw2VjjnyYU3vw3kDvbpW+mjeXcKrE3wxpJSXZnxlQL4mF7+KU8CxlTuJNIavlieZ5Nx2Ld9Vl/nNJyPUcWWq+wBAGVoVmC1KoytW1gzoLMDL8Bx5oPDIOQyX9xxFgZ7HnYYviMFSt0A9n7IH7iHYy4r5Y2P8H5s5SMhdmWSRqRI3LMW/P5Ez9uCXJR6rfJ/DTQ4z0cyBdWE91Vl3s48MgmgeO6j/AjW7PaEe4j5z5Gyhg7ErDWFD3OHEfBUhBr73R3EoR0UGRys0XUvmuAaamhUF5DPQm8Ou7AfUq5S5SWQiCKpqvTC+6uEG59WdhmvKKHAoZCNeXWbbNC/PO2zX2huVZalzuSlU/x4QOWwD0zQpsipazksNPRh1XM3w8sOtIrIwtJft4c5BI/TNngKXIom1lJtTDPBN9dQus217MS7DzHoR4rTI/UHYZDSOlDqi66ALOLXpucHGjv+0J9ho6fStpnJ2YP4xlRUzbbMXmPxwuWAkBeJTBaCLaabSWaVR6bPEQFUMZuGCipi10jLiD+8DqvT4a9IvKK5Rvy4Hc0fYf3uP4roRmvIUHM6C2kaEJr4+XNV8C3hVg+YP235LGcMhjq6j7SopqFjRUFv7ssQR8fXdSAxC60y6MupXFzOvmxCeqlIE1MivL2vBm0QXYHUSU+3FSZ2TnPgv5Fdhk1PKM+6ZZNwpdO8KSUDd7Tmke3nA0UWOa9qiqHYWde3aKBTuRl9O69noPYc08tf1o2w6/jqtBEHvYPpydho02LA52DlMvF4gFUcMKny+c+tKnGAj0PE6A+px9wqocUnOhAMJfgDgj46PWKPiG12WudwulqE IoNK362J w3SRPLvyUq51/A3A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jan 17, 2023 at 7:04 AM Michal Hocko wrote: > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote: > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > instead of mmap_lock. Because there are cases when multiple VMAs need > > to be exclusively locked during VMA tree modifications, instead of the > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > locked. > > I have to say I was struggling a bit with the above and only understood > what you mean by reading the patch several times. I would phrase it like > this (feel free to use if you consider this to be an improvement). > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a > per-vma and per-mm sequence counters to note exclusive locking: > - read lock - (implemented by vma_read_trylock) requires the the > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to > differ. If they match then there must be a vma exclusive lock > held somewhere. > - read unlock - (implemented by vma_read_unlock) is a trivial > vma->lock unlock. > - write lock - (vma_write_lock) requires the mmap_lock to be > held exclusively and the current mm counter is noted to the vma > side. This will allow multiple vmas to be locked under a single > mmap_lock write lock (e.g. during vma merging). The vma counter > is modified under exclusive vma lock. > - write unlock - (vma_write_unlock_mm) is a batch release of all > vma locks held. It doesn't pair with a specific > vma_write_lock! It is done before exclusive mmap_lock is > released by incrementing mm sequence counter (mm_lock_seq). > - write downgrade - if the mmap_lock is downgraded to the read > lock all vma write locks are released as well (effectivelly > same as write unlock). Thanks for the suggestion, Michal. I'll definitely reuse your description. > > > VMA lock is placed on the cache line boundary so that its 'count' field > > falls into the first cache line while the rest of the fields fall into > > the second cache line. This lets the 'count' field to be cached with > > other frequently accessed fields and used quickly in uncontended case > > while 'owner' and other fields used in the contended case will not > > invalidate the first cache line while waiting on the lock. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > include/linux/mm.h | 80 +++++++++++++++++++++++++++++++++++++++ > > include/linux/mm_types.h | 8 ++++ > > include/linux/mmap_lock.h | 13 +++++++ > > kernel/fork.c | 4 ++ > > mm/init-mm.c | 3 ++ > > 5 files changed, 108 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index f3f196e4d66d..ec2c4c227d51 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -612,6 +612,85 @@ struct vm_operations_struct { > > unsigned long addr); > > }; > > > > +#ifdef CONFIG_PER_VMA_LOCK > > +static inline void vma_init_lock(struct vm_area_struct *vma) > > +{ > > + init_rwsem(&vma->lock); > > + vma->vm_lock_seq = -1; > > +} > > + > > +static inline void vma_write_lock(struct vm_area_struct *vma) > > +{ > > + int mm_lock_seq; > > + > > + mmap_assert_write_locked(vma->vm_mm); > > + > > + /* > > + * current task is holding mmap_write_lock, both vma->vm_lock_seq and > > + * mm->mm_lock_seq can't be concurrently modified. > > + */ > > + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq); > > + if (vma->vm_lock_seq == mm_lock_seq) > > + return; > > + > > + down_write(&vma->lock); > > + vma->vm_lock_seq = mm_lock_seq; > > + up_write(&vma->lock); > > +} > > + > > +/* > > + * Try to read-lock a vma. The function is allowed to occasionally yield false > > + * locked result to avoid performance overhead, in which case we fall back to > > + * using mmap_lock. The function should never yield false unlocked result. > > + */ > > +static inline bool vma_read_trylock(struct vm_area_struct *vma) > > +{ > > + /* Check before locking. A race might cause false locked result. */ > > + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) > > + return false; > > + > > + if (unlikely(down_read_trylock(&vma->lock) == 0)) > > + return false; > > + > > + /* > > + * Overflow might produce false locked result. > > + * False unlocked result is impossible because we modify and check > > + * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq > > + * modification invalidates all existing locks. > > + */ > > + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) { > > + up_read(&vma->lock); > > + return false; > > + } > > + return true; > > +} > > + > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > +{ > > + up_read(&vma->lock); > > +} > > + > > +static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > +{ > > + mmap_assert_write_locked(vma->vm_mm); > > + /* > > + * current task is holding mmap_write_lock, both vma->vm_lock_seq and > > + * mm->mm_lock_seq can't be concurrently modified. > > + */ > > + VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma); > > +} > > + > > +#else /* CONFIG_PER_VMA_LOCK */ > > + > > +static inline void vma_init_lock(struct vm_area_struct *vma) {} > > +static inline void vma_write_lock(struct vm_area_struct *vma) {} > > +static inline bool vma_read_trylock(struct vm_area_struct *vma) > > + { return false; } > > +static inline void vma_read_unlock(struct vm_area_struct *vma) {} > > +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} > > + > > +#endif /* CONFIG_PER_VMA_LOCK */ > > + > > static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) > > { > > static const struct vm_operations_struct dummy_vm_ops = {}; > > @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) > > vma->vm_mm = mm; > > vma->vm_ops = &dummy_vm_ops; > > INIT_LIST_HEAD(&vma->anon_vma_chain); > > + vma_init_lock(vma); > > } > > > > static inline void vma_set_anonymous(struct vm_area_struct *vma) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index d5cdec1314fe..5f7c5ca89931 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -555,6 +555,11 @@ struct vm_area_struct { > > pgprot_t vm_page_prot; > > unsigned long vm_flags; /* Flags, see mm.h. */ > > > > +#ifdef CONFIG_PER_VMA_LOCK > > + int vm_lock_seq; > > + struct rw_semaphore lock; > > +#endif > > + > > /* > > * For areas with an address space and backing store, > > * linkage into the address_space->i_mmap interval tree. > > @@ -680,6 +685,9 @@ struct mm_struct { > > * init_mm.mmlist, and are protected > > * by mmlist_lock > > */ > > +#ifdef CONFIG_PER_VMA_LOCK > > + int mm_lock_seq; > > +#endif > > > > > > unsigned long hiwater_rss; /* High-watermark of RSS usage */ > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h > > index e49ba91bb1f0..40facd4c398b 100644 > > --- a/include/linux/mmap_lock.h > > +++ b/include/linux/mmap_lock.h > > @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm) > > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); > > } > > > > +#ifdef CONFIG_PER_VMA_LOCK > > +static inline void vma_write_unlock_mm(struct mm_struct *mm) > > +{ > > + mmap_assert_write_locked(mm); > > + /* No races during update due to exclusive mmap_lock being held */ > > + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1); > > +} > > +#else > > +static inline void vma_write_unlock_mm(struct mm_struct *mm) {} > > +#endif > > + > > static inline void mmap_init_lock(struct mm_struct *mm) > > { > > init_rwsem(&mm->mmap_lock); > > @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm) > > static inline void mmap_write_unlock(struct mm_struct *mm) > > { > > __mmap_lock_trace_released(mm, true); > > + vma_write_unlock_mm(mm); > > up_write(&mm->mmap_lock); > > } > > > > static inline void mmap_write_downgrade(struct mm_struct *mm) > > { > > __mmap_lock_trace_acquire_returned(mm, false, true); > > + vma_write_unlock_mm(mm); > > downgrade_write(&mm->mmap_lock); > > } > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 5986817f393c..c026d75108b3 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) > > */ > > *new = data_race(*orig); > > INIT_LIST_HEAD(&new->anon_vma_chain); > > + vma_init_lock(new); > > dup_anon_vma_name(orig, new); > > } > > return new; > > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > > seqcount_init(&mm->write_protect_seq); > > mmap_init_lock(mm); > > INIT_LIST_HEAD(&mm->mmlist); > > +#ifdef CONFIG_PER_VMA_LOCK > > + WRITE_ONCE(mm->mm_lock_seq, 0); > > +#endif > > mm_pgtables_bytes_init(mm); > > mm->map_count = 0; > > mm->locked_vm = 0; > > diff --git a/mm/init-mm.c b/mm/init-mm.c > > index c9327abb771c..33269314e060 100644 > > --- a/mm/init-mm.c > > +++ b/mm/init-mm.c > > @@ -37,6 +37,9 @@ struct mm_struct init_mm = { > > .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), > > .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), > > .mmlist = LIST_HEAD_INIT(init_mm.mmlist), > > +#ifdef CONFIG_PER_VMA_LOCK > > + .mm_lock_seq = 0, > > +#endif > > .user_ns = &init_user_ns, > > .cpu_bitmap = CPU_BITS_NONE, > > #ifdef CONFIG_IOMMU_SVA > > -- > > 2.39.0 > > -- > Michal Hocko > SUSE Labs