From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 444B7E77188 for ; Fri, 10 Jan 2025 16:50:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D2B826B009E; Fri, 10 Jan 2025 11:50:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CDA3C6B00CE; Fri, 10 Jan 2025 11:50:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B553A8D0003; Fri, 10 Jan 2025 11:50:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 88FC16B009E for ; Fri, 10 Jan 2025 11:50:54 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4F85DAF4FD for ; Fri, 10 Jan 2025 16:50:54 +0000 (UTC) X-FDA: 82992131628.02.EC0D6FC Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf12.hostedemail.com (Postfix) with ESMTP id 673CA40003 for ; Fri, 10 Jan 2025 16:50:52 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="JzixLj/d"; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736527852; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tJ3BBM8ln6ISkso43QvSblhvjQJh7nb5EsHUKP5H+LQ=; b=uqERaoI/735dXf0JXwbxV9HMGSWZY4CZrRnwi64+gLe19MGh35+Gr6tf153uMbAJGS+jG/ Xclt3Y2kOrRiJLptyj6VQsBIYRrD9eKS0+1HF0ukN0j/yYhmT7Oy5Tav38pGCPnMxuWZE8 P/M5cMOtWgCh//xH2Zw9fN7TuYBq5W8= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="JzixLj/d"; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736527852; a=rsa-sha256; cv=none; b=qDtAPsQcXtQOc11Jp1C0Zs5heOw/QOJYdgyARHyXQPUzD0q05oW0w08+FdVg5AK0hc1xu+ kMVlwqN5u2tM/u8ySV/OB2+obj/H7819T7JLCqUGZ8UYy4h6EdSGHts6BqAylyicIQAlZu dj7MfghfWACQQh2ImEhwTlZN+P/piBc= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-467896541e1so296621cf.0 for ; Fri, 10 Jan 2025 08:50:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736527851; x=1737132651; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tJ3BBM8ln6ISkso43QvSblhvjQJh7nb5EsHUKP5H+LQ=; b=JzixLj/dmuJUaau8BnblEzcIqpoZq3SVc8VsOrZS5+bNg/P2skMo1jRz9nqMC4yeGG UUF0XIXldp49tsDrQvuBBqac9rJ/CaC5d4xiy0N04s/xSmDsmBrjv1jJ6lQtaYatF7hS 7zQv2KrZZJmXoJ/GPIouHxFh6rSgUbVf7hzChzHf+S8gEC611NJ0q2lSJ50tfV/p+IX1 tN8KK4Z7yrmGPiv8L6LGnY/pliVt2FEbMKwE0fKBtFpTfwxhdJtzZiBXTguUBZA20xdf l/PJ16vC2asHGRhMAm+IUU711Uf/khISC8R6ZWuX919Z9CHkFPBfZpdL4XYjWKiSElqS mAIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736527851; x=1737132651; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tJ3BBM8ln6ISkso43QvSblhvjQJh7nb5EsHUKP5H+LQ=; b=myUMcpYGVZaZ1aX1+5mJG3e/Db8T/x3EZpHgx8ryiUASDPyeaiY0vnGGVQNv5I1ppx rtMCJxoOxJWXQeyUJeZ9sVwxPjVJVISSyNBxv8uX4ubLapl9eh01nlKQvrYYGvOToUJl sz8bncEaViEvebgkX/st7UluNnNG7W1d4AECCcVTjTff8HJa3KxVpwgMBrRzCivhw813 ZEpfBIRcOcSBK2/qQI4D9GMN+YioajYw+G80rhLnb3BFF6XpCp0/dbJ2nmUTAAUXcwSk MIs+dGe+gRZCtYVHlbh1KpvwMVzkmjOm91Sb5y/eo2xGskEIHIk4JqY9Quq9BVjbaXRI 9Xtg== X-Forwarded-Encrypted: i=1; AJvYcCWo+Sv2kZj2UacpqsVnizDVdwsEBguVBs8MyDJhpfeWdLOdB+Xi8nVreNmhohiC+qCprPqCUhyQ8g==@kvack.org X-Gm-Message-State: AOJu0YyhSksNthR8EkZZlQU/95AsgaO7+PRiJSYQ4kya4TOjFmzzUBZp wVnqfA2O61HzNt73XjpSsWCwarDTBAjAjaRhb64qK6PRgeRGMrcTsTl9paTBNddWCGj3/vbTUZ4 DQOsP52A6C7H0iM7JoT5T5A3FhU0yicTmBpvK X-Gm-Gg: ASbGncufBBb5MR4dOCyxcu/oGO5GqLNa8tz1zJFzGytOdp1KdsrFxmvTxixFWNCxohd XBZwCbzYgESH0/1EWosQ0GweAfPVV/lzjOdDNbQ== X-Google-Smtp-Source: AGHT+IEwVPU7ki8mes1OagoxZWTiyu4JoBUB/ovIA3aC1wGRPF/kCcQxbxtNLQXC4RzXdolyDc8NDx+b2yNJ5KmyAnE= X-Received: by 2002:ac8:7e8e:0:b0:460:4620:232b with SMTP id d75a77b69052e-46c87f4a867mr3811261cf.28.1736527851056; Fri, 10 Jan 2025 08:50:51 -0800 (PST) MIME-Version: 1.0 References: <20250109023025.2242447-1-surenb@google.com> <20250109023025.2242447-12-surenb@google.com> <95e9d80e-6c19-4a1f-9c21-307006858dff@suse.cz> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 10 Jan 2025 08:50:40 -0800 X-Gm-Features: AbW1kvawcsM6ilvgNUS6_fgR9__YcADp-x515-vZeNXJSxpO5E4zadNzEofPNUU Message-ID: Subject: Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count To: Vlastimil Babka Cc: akpm@linux-foundation.org, peterz@infradead.org, willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, richard.weiyang@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 673CA40003 X-Rspamd-Server: rspam12 X-Stat-Signature: b3o8gt77ehbcnogdxit6ih7eza6wixeg X-Rspam-User: X-HE-Tag: 1736527852-586111 X-HE-Meta: U2FsdGVkX1+FYsl2HehPH38MFevpXK1J8iXkgt8JSBSw7qq0fxiWXdBDjPdKubqzJVSK/SJnjWzENvJgUrxntnoCXi2rrELE8787FAfVrTAsWrzISonUcpZ++OJC/t51DcyiZO/jWBa8fPErO/5/3FG9mZkkoitWE/TEeMnypAfeOh3lExA6i3s1WLUVYRkw4Z1VVsI8LiEvkOIrGw5VlJZZlEddESstlcSmKG/CS8Q7Stl0yuUKyjzbl+gnrNdHhGhtYcZlu3GcDAQYs/TomKsYhigOR+pHr7mQHC+VNK7a8hTnD4iEqytZfrljZ9rQAf3OvplY/PWiYp/6fQCAN8kSU6twQ25SqPaHz2nInQXUiYg+x84XKnHX1KIiWc9+hcI7sh2sp0hrKSfnlK0n4u9EcYq91RUhJ/7a+CEyXbKcfgYdueW3f5WaAmYSqvdqOk6EJINXs8ekYm2moJXSQNb4woR+1XT454Dmkd8FO9GZSa5v6jcybLW9wmM7NUE0EU+K42a6PPP3N3n7JhXaXz0YNfI5K27f+3bX0+5ud1Rm6sSF0MJ+AXDdJEB6q7l28KyGerr5GBAzWmdOxu+9e1SvTQSreBg+Zg/pIesfFyFvrTLiOhk+0nrPRtjBamsqymrehkAsejEU3nNZpz6O2+piPg0AoornYBB5zTEPb9vj3nFfH3XqaqTkSF5wzcFDne2MAY+fnD8MSdBR3pMzBChrbV1fh73bkAwntZwmO1C3J9ef7MqLorERSZFRXVX+iGRGmNgmFOHDm5mBVVQnkBLG1y9Ta/4h4Azw1GqOB8WCrnvaKIIvPpr0IgpVdGG5qIn4KWF2lcxymNSYdaBTagnvGCIWTckH4QK71PB32NCk4JxmLqtjyFORan6USZAEBQ9Gu0cli0BTNHeywdWtp7BUY8uj+/dKcFuf7sbiB8mgQqbdIfTe0gd1NOurvp45ivbL36tP04cXYI56V1a p3CP3yTd Z9F0cM/dVyEwqUEoCRqlFb+4uH+VG4NZSpK0kTRzX9FqL4uvESxWn+34xPwdqbCgD0itrbbCWcAgEzI+jdrvfZXgXSRHTUC9fFax06Zl5T35TcMhZ54P78GAzs11l6u5sXtYA1tJzINemmpal3+mGWosHAoxBk+Os7vbJVu8jRRRxy2RXa3+AdwhYZgmktsfcfvasTYewuLEJLQhm+H0+4LEMQiasC8bn9ShFWx911q6G6njTOC2SCLXdKWwh7o7qfjMZ2cdp6R3exolFaGRK/EiEDKIGkcRFj8Z1Y2CTYQBcghXjaasn7VZegu5E5WZWiBF3JxtpPdoWecatQBTqVGrHLQeMuPmeml0EX5Hr1PPvOtGW6McPwzaINvUYKwlltLxyukJ/Qs9BKao= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 8:47=E2=80=AFAM Suren Baghdasaryan wrote: > > On Fri, Jan 10, 2025 at 7:56=E2=80=AFAM Suren Baghdasaryan wrote: > > > > On Fri, Jan 10, 2025 at 6:33=E2=80=AFAM Vlastimil Babka wrote: > > > > > > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote: > > > > rw_semaphore is a sizable structure of 40 bytes and consumes > > > > considerable space for each vm_area_struct. However vma_lock has > > > > two important specifics which can be used to replace rw_semaphore > > > > with a simpler structure: > > > > 1. Readers never wait. They try to take the vma_lock and fall back = to > > > > mmap_lock if that fails. > > > > 2. Only one writer at a time will ever try to write-lock a vma_lock > > > > because writers first take mmap_lock in write mode. > > > > Because of these requirements, full rw_semaphore functionality is n= ot > > > > needed and we can replace rw_semaphore and the vma->detached flag w= ith > > > > a refcount (vm_refcnt). > > > > When vma is in detached state, vm_refcnt is 0 and only a call to > > > > vma_mark_attached() can take it out of this state. Note that unlike > > > > before, now we enforce both vma_mark_attached() and vma_mark_detach= ed() > > > > to be done only after vma has been write-locked. vma_mark_attached(= ) > > > > changes vm_refcnt to 1 to indicate that it has been attached to the= vma > > > > tree. When a reader takes read lock, it increments vm_refcnt, unles= s the > > > > top usable bit of vm_refcnt (0x40000000) is set, indicating presenc= e of > > > > a writer. When writer takes write lock, it sets the top usable bit = to > > > > indicate its presence. If there are readers, writer will wait using= newly > > > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in= write > > > > mode first, there can be only one writer at a time. The last reader= to > > > > release the lock will signal the writer to wake up. > > > > refcount might overflow if there are many competing readers, in whi= ch case > > > > read-locking will fail. Readers are expected to handle such failure= s. > > > > In summary: > > > > 1. all readers increment the vm_refcnt; > > > > 2. writer sets top usable (writer) bit of vm_refcnt; > > > > 3. readers cannot increment the vm_refcnt if the writer bit is set; > > > > 4. in the presence of readers, writer must wait for the vm_refcnt t= o drop > > > > to 1 (ignoring the writer bit), indicating an attached vma with no = readers; > > > > 5. vm_refcnt overflow is handled by the readers. > > > > > > > > Suggested-by: Peter Zijlstra > > > > Suggested-by: Matthew Wilcox > > > > Signed-off-by: Suren Baghdasaryan > > > > > > Reviewed-by: Vlastimil Babka > > > > > > But think there's a problem that will manifest after patch 15. > > > Also I don't feel qualified enough about the lockdep parts though > > > (although I think I spotted another issue with those, below) so best = if > > > PeterZ can review those. > > > Some nits below too. > > > > > > > + > > > > +static inline void vma_refcount_put(struct vm_area_struct *vma) > > > > +{ > > > > + int oldcnt; > > > > + > > > > + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { > > > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > > > > > Shouldn't we rwsem_release always? And also shouldn't it precede the > > > refcount operation itself? > > > > Yes. Hillf pointed to the same issue. It will be fixed in the next vers= ion. > > > > > > > > > + if (is_vma_writer_only(oldcnt - 1)) > > > > + rcuwait_wake_up(&vma->vm_mm->vma_writer_wait)= ; > > > > > > Hmm hmm we should maybe read the vm_mm pointer before dropping the > > > refcount? In case this races in a way that is_vma_writer_only tests t= rue > > > but the writer meanwhile finishes and frees the vma. It's safe now bu= t > > > not after making the cache SLAB_TYPESAFE_BY_RCU ? > > > > Hmm. But if is_vma_writer_only() is true that means the writed is > > blocked and is waiting for the reader to drop the vm_refcnt. IOW, it > > won't proceed and free the vma until the reader calls > > rcuwait_wake_up(). Your suggested change is trivial and I can do it > > but I want to make sure I'm not missing something. Am I? > > Ok, after thinking some more, I think the race you might be referring > to is this: > > writer reader > > __vma_enter_locked > refcount_add_not_zero(VMA_LOCK_OFFSET, ...) > vma_refcount_put > __refcount_dec_and= _test() > if > (is_vma_writer_only()) > rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, ...) > __vma_exit_locked > refcount_sub_and_test(VMA_LOCK_OFFSET, ...) > free the vma > > rcuwait_wake_up(&vma->vm_mm->vma_writer_wait); Sorry, this should be more readable: writer reader __vma_enter_locked refcount_add_not_zero(VMA_LOCK_OFFSET, ...) vma_refcount_put __refcount_dec_and_test() if (is_vma_writer_only()) rcuwait_wait_event() __vma_exit_locked refcount_sub_and_test(VMA_LOCK_OFFSET, ...) free the vma rcuwait_wake_up(); <-- access to vma->vm_mm > > I think it's possible and your suggestion of storing the mm before > doing __refcount_dec_and_test() should work. Thanks for pointing this > out! I'll fix it in the next version. > > > > > > > > > > + } > > > > +} > > > > + > > > > > > > static inline void vma_end_read(struct vm_area_struct *vma) > > > > { > > > > rcu_read_lock(); /* keeps vma alive till the end of up_read *= / > > > > > > This should refer to vma_refcount_put(). But after fixing it I think = we > > > could stop doing this altogether? It will no longer keep vma "alive" > > > with SLAB_TYPESAFE_BY_RCU. > > > > Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock() > > here can be safely removed. > > > > > > > > > - up_read(&vma->vm_lock.lock); > > > > + vma_refcount_put(vma); > > > > rcu_read_unlock(); > > > > } > > > > > > > > > > > > > > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(= struct mm_struct *mm, > > > > #endif > > > > > > > > #ifdef CONFIG_PER_VMA_LOCK > > > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, = unsigned int tgt_refcnt) > > > > +{ > > > > + /* > > > > + * If vma is detached then only vma_mark_attached() can raise= the > > > > + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_a= ttached(). > > > > + */ > > > > + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) > > > > + return false; > > > > + > > > > + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); > > > > + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, > > > > + refcount_read(&vma->vm_refcnt) =3D=3D tgt_refcnt, > > > > + TASK_UNINTERRUPTIBLE); > > > > + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > > > + > > > > + return true; > > > > +} > > > > + > > > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, b= ool *detached) > > > > +{ > > > > + *detached =3D refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm= _refcnt); > > > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > > > +} > > > > + > > > > void __vma_start_write(struct vm_area_struct *vma, unsigned int mm= _lock_seq) > > > > { > > > > - down_write(&vma->vm_lock.lock); > > > > + bool locked; > > > > + > > > > + /* > > > > + * __vma_enter_locked() returns false immediately if the vma = is not > > > > + * attached, otherwise it waits until refcnt is (VMA_LOCK_OFF= SET + 1) > > > > + * indicating that vma is attached with no readers. > > > > + */ > > > > + locked =3D __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1); > > > > > > Wonder if it would be slightly better if tgt_refcount was just 1 (or = 0 > > > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in > > > __vma_enter_locked() itself as it's the one adding it in the first pl= ace. > > > > Well, it won't be called tgt_refcount then. Maybe "bool vma_attached" > > and inside __vma_enter_locked() we do: > > > > unsigned int tgt_refcnt =3D VMA_LOCK_OFFSET + vma_attached ? 1 : 0; > > > > Is that better? > > > > >