From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80744E77188 for ; Fri, 10 Jan 2025 15:57:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EFDBE8D0006; Fri, 10 Jan 2025 10:57:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E866C8D0001; Fri, 10 Jan 2025 10:57:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDA358D0006; Fri, 10 Jan 2025 10:57:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A8DAC8D0001 for ; Fri, 10 Jan 2025 10:57:03 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5A053140BD2 for ; Fri, 10 Jan 2025 15:57:03 +0000 (UTC) X-FDA: 82991995926.28.3EA448E Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf23.hostedemail.com (Postfix) with ESMTP id 656A314001A for ; Fri, 10 Jan 2025 15:57:01 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cUMKTUSk; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736524621; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sgIuWlG0iYUIu/hJwevRd5bkgowBtO2Ectr/9l9FLrM=; b=KjVnpZB8A+jVA8BpKLPL7o0PbU2Km/jGmZInos1YUPS3RgaERo7I7oU5VtJ6JIGdwNufkT 7/4MWCMQcJvF3aU5wRe/8F2eO0UaXEPBPjZHeNC4dqYk5VdKYscZjfjWT5/OYd/Mo5qlL9 CY1W3BRw9gDWRCs71h3Zsuj/K6fn1+I= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cUMKTUSk; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736524621; a=rsa-sha256; cv=none; b=OGQ+rWzON+zB/WDEf+rFRSh+rwy6lbJoWDQA6OqZD+bcCQ1sOc3wla3IyaEgKt5qIq1T+y f0jAFHMNtqBZZkSxLIlJxzDQwC8jqEWJRfXMuojjQ4UQZNGHJ4OYoskR5kMHCW1+ezpskM Go+j31OAb2cxv3bHEDJAdN9q/2beZYg= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-467896541e1so278801cf.0 for ; Fri, 10 Jan 2025 07:57:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736524620; x=1737129420; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sgIuWlG0iYUIu/hJwevRd5bkgowBtO2Ectr/9l9FLrM=; b=cUMKTUSki+dDKjturWJ71hGEwNLWmmxJZ4v+jkA90f+tIAiNeTQI5aV/m1bdawIBIi cZ0h5fYsiBcmNNmg8sQUE5cZ1v+Fv7GYEn6yo5ccMU/8lktmotARxlTPloYonvXP6wO1 gQzwMhCKYSDQS6FdPdAtLflpzm6+fnS18JAG2Ecd92VqjmK5FZxw3oHwqc+F+9KUsrSR 5wrPsEfzFbjSd+r/kRqtvIT9rA3yfNvhj7Gaso+oalsd2J6PHbq8VDNbgltqm7ohZ+ZG JY3ezyL84fY7V9V0ZpYps8AWqQotq1Cb0BGUf+o4w0JOIzF8bwvLgxExqM36ZFIAiXPQ iihg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736524620; x=1737129420; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sgIuWlG0iYUIu/hJwevRd5bkgowBtO2Ectr/9l9FLrM=; b=oJ2seDSBo4SdWTcH8IaY/pTwdS9yD5PrjrvpGJHujjA3cSQE3ph1nJCQMgTvXghqgg 4oj1GZxicqOdFPnRhhHoWAZl5b3ogeG4IF0HToyjgZmW5jDaIy59UaHiicv7v8l6P0Sb sRV/Kxf8T1qs00XDfncpB4gQeOzo1iN8JVA8TJVJCEl8CQb5Hh/9VgkVkh8f32LDFMUq H7cl8adh8rG1jU/AGjAqUqxN7+WSJs/b7LYvHQqyFGROtIb1oZ4Fugv7Az+VW/K5svkC FdOHHb1+Od796Vg9/4N9htRFzJKRervOVe+rT+7MWJ2OI8e/82aDONls0s8sWpJOJ81z gcbw== X-Forwarded-Encrypted: i=1; AJvYcCUg0FqICLvtd+llzU4kk9XB4IP5LN8mShzYb2Q85D0cqsu9eZ+ZiCKJ0UQfl/6iGT9285Kv5RZqLA==@kvack.org X-Gm-Message-State: AOJu0YzYtphGJX7fSiiTWnzd/3mKf6hScHztEb9M3WnNa8VwD/gUUHI9 YlNlL7wHCeoPu+mz/elPO7ALNw4Hw1KlLH5xo19KIiPQFhdWSRdwg0zoI1sirnGhEp2+flX0j+s fPnFxBCiA3UGZ4KSyAKwoaVIGbYM4aZmWJamh X-Gm-Gg: ASbGncswl9V3j2G+Tvtr0HjuE5Dgb/3lgssG+4sIZsj4jFdGwhwGmdNzUi6UKROI1VQ 0kGsDP3rIWJC96AfW3hLLAf1YKmdoDCwKIhyC0A== X-Google-Smtp-Source: AGHT+IGLb2CAeqiZD/8GbTG1MnBA0ktLsQb5WmpcoG2ySAzx/iBiB28LB/q5DMRnDRzipnuqZH4XzPnlVHDk4mlp/44= X-Received: by 2002:ac8:5dcb:0:b0:460:463d:78dd with SMTP id d75a77b69052e-46c87e0774bmr3897261cf.4.1736524620163; Fri, 10 Jan 2025 07:57:00 -0800 (PST) MIME-Version: 1.0 References: <20250109023025.2242447-1-surenb@google.com> <20250109023025.2242447-12-surenb@google.com> <95e9d80e-6c19-4a1f-9c21-307006858dff@suse.cz> In-Reply-To: <95e9d80e-6c19-4a1f-9c21-307006858dff@suse.cz> From: Suren Baghdasaryan Date: Fri, 10 Jan 2025 07:56:49 -0800 X-Gm-Features: AbW1kvavpdXYE4--jH1sVL17CixKdlsI4s7MECSReNY6bDKG9VehthZoGzAjGbU Message-ID: Subject: Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count To: Vlastimil Babka Cc: akpm@linux-foundation.org, peterz@infradead.org, willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, richard.weiyang@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 656A314001A X-Rspamd-Server: rspam12 X-Stat-Signature: wqas1yzoomfco6ndhg186zujtqgriqrm X-Rspam-User: X-HE-Tag: 1736524621-179063 X-HE-Meta: U2FsdGVkX1//f2qKS5qX+V0HfyXlQ2wROpNQTql10DMDNK5EVfW0oTchOdOd568X6YPplElsRkMx4d+jrSdi0j1MV0uazvb2sqHhMUedOqTyZvxqbQkUdyKY3+ZmqsJi7o2DRMACWlmPL0cQXkmP6gcnQrk2/FMBpa/pLXnLnOG+9NHwp+GvmlIA5r3Ax5B4JpIy+sYaiDEcP89hfDJ2NO0CWeiOZgQDtO4nE4YvFWvjXeHjbdgvgBQ7anEpp0ybRnVwH5jALmUzoh45o8dZ+BTaFHFN1fasYr8Exa5+DIWhHA5VuNSpthBA9j9ETEeZxjUHU3fUe+JggNC7t8fUZn+jJYO6QUq4nwPFTd8tAhQholXoX/SXofj+tSmGPwhcQWHAG94WSFAY41cmoY5gV1iZdT8WHlvien8By9hhVQWWOQa3PmOFGD4rTN1j9rrcKFyk8Ctxq4Ppvj31fNGtIxoeMZ4YWWHO08MtE63JRTbBu5J88yhmlQhX1Z9yiJ7X742YrJNAaWXWP4q84/3W2YD0BKXo+jE6v5NwGe1oPx+0RPdc9gna4EY1RDy6tpQuzEM4F6DPrqk2xLlfPu4/y7Ny9b8NbR6qlcE3UbmwSrSygUMGf6jd/vrm95tgpWgbcKTCWclyPnAwtLbkMv1iZUU+UecRmkEXAIuZM01qIhe6jlCdBS1yc5zFRWKuhWyBKujUsKCqTvqyBy3yLGygAmdCVIS5dFmS5xy0edxGy7WKcYS6oEJAA1sBxR+caNrr/Vd0IV4z7hYgV06JZqK2VUrBDuFO7HysChXBwvgrKmZAHkFemhBACBdLTa6LpvbY3QFzbMJ4shcCeX+CUiNrZbMQu2KUXnIcnJPsLAIb4lag1HMf1UF4Ns6bFY4WxLS2NJs9H3RIi4KqeZYyCejtmrpIWS2ETLsl3A0dEUtG8Ck0gYwD3o9O4QpybED5tv52KqEd8qZnYnrLrjIULI8 dUM+Ln49 l/rdOX7PP+jonKHoNzfCfM0fm6aI0Qp3swWqnd9hCJfX/IBMJW27IUEOeXUhwVrD/gAPb0mShFUhhe0gufI+vCpaz6oq8JkpszIUmk57lx/mhZsLORMS3JjCwED88umNjcbFmFiOn8hZi+5vz5u0Z0LdvBKkVii/cIP+dsHBqLWU7s+AVAqCCyQUCM5Nmdb3172R9FFG5kSYazo8L4QIryGP3mz1o57alPlIni2Rxujnvsk5RuNmGCQ6zWA1yVcZETAIw25VYXBI8Mk48Yizmda4FJ5f4bXCAUbcN74rnVqKRbM/M/cXeYEeBu6wlGfvLCMrmYWrydM+1ENtaVbnhOvBl5ONcGFbd4RRADpaqrbeamtAZ3397snFqvXU9lbz9rTumZyC6OLsmXgQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 6:33=E2=80=AFAM Vlastimil Babka wr= ote: > > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote: > > rw_semaphore is a sizable structure of 40 bytes and consumes > > considerable space for each vm_area_struct. However vma_lock has > > two important specifics which can be used to replace rw_semaphore > > with a simpler structure: > > 1. Readers never wait. They try to take the vma_lock and fall back to > > mmap_lock if that fails. > > 2. Only one writer at a time will ever try to write-lock a vma_lock > > because writers first take mmap_lock in write mode. > > Because of these requirements, full rw_semaphore functionality is not > > needed and we can replace rw_semaphore and the vma->detached flag with > > a refcount (vm_refcnt). > > When vma is in detached state, vm_refcnt is 0 and only a call to > > vma_mark_attached() can take it out of this state. Note that unlike > > before, now we enforce both vma_mark_attached() and vma_mark_detached() > > to be done only after vma has been write-locked. vma_mark_attached() > > changes vm_refcnt to 1 to indicate that it has been attached to the vma > > tree. When a reader takes read lock, it increments vm_refcnt, unless th= e > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of > > a writer. When writer takes write lock, it sets the top usable bit to > > indicate its presence. If there are readers, writer will wait using new= ly > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in wri= te > > mode first, there can be only one writer at a time. The last reader to > > release the lock will signal the writer to wake up. > > refcount might overflow if there are many competing readers, in which c= ase > > read-locking will fail. Readers are expected to handle such failures. > > In summary: > > 1. all readers increment the vm_refcnt; > > 2. writer sets top usable (writer) bit of vm_refcnt; > > 3. readers cannot increment the vm_refcnt if the writer bit is set; > > 4. in the presence of readers, writer must wait for the vm_refcnt to dr= op > > to 1 (ignoring the writer bit), indicating an attached vma with no read= ers; > > 5. vm_refcnt overflow is handled by the readers. > > > > Suggested-by: Peter Zijlstra > > Suggested-by: Matthew Wilcox > > Signed-off-by: Suren Baghdasaryan > > Reviewed-by: Vlastimil Babka > > But think there's a problem that will manifest after patch 15. > Also I don't feel qualified enough about the lockdep parts though > (although I think I spotted another issue with those, below) so best if > PeterZ can review those. > Some nits below too. > > > + > > +static inline void vma_refcount_put(struct vm_area_struct *vma) > > +{ > > + int oldcnt; > > + > > + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > Shouldn't we rwsem_release always? And also shouldn't it precede the > refcount operation itself? Yes. Hillf pointed to the same issue. It will be fixed in the next version. > > > + if (is_vma_writer_only(oldcnt - 1)) > > + rcuwait_wake_up(&vma->vm_mm->vma_writer_wait); > > Hmm hmm we should maybe read the vm_mm pointer before dropping the > refcount? In case this races in a way that is_vma_writer_only tests true > but the writer meanwhile finishes and frees the vma. It's safe now but > not after making the cache SLAB_TYPESAFE_BY_RCU ? Hmm. But if is_vma_writer_only() is true that means the writed is blocked and is waiting for the reader to drop the vm_refcnt. IOW, it won't proceed and free the vma until the reader calls rcuwait_wake_up(). Your suggested change is trivial and I can do it but I want to make sure I'm not missing something. Am I? > > > + } > > +} > > + > > > static inline void vma_end_read(struct vm_area_struct *vma) > > { > > rcu_read_lock(); /* keeps vma alive till the end of up_read */ > > This should refer to vma_refcount_put(). But after fixing it I think we > could stop doing this altogether? It will no longer keep vma "alive" > with SLAB_TYPESAFE_BY_RCU. Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock() here can be safely removed. > > > - up_read(&vma->vm_lock.lock); > > + vma_refcount_put(vma); > > rcu_read_unlock(); > > } > > > > > > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(stru= ct mm_struct *mm, > > #endif > > > > #ifdef CONFIG_PER_VMA_LOCK > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsi= gned int tgt_refcnt) > > +{ > > + /* > > + * If vma is detached then only vma_mark_attached() can raise the > > + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attac= hed(). > > + */ > > + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) > > + return false; > > + > > + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); > > + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, > > + refcount_read(&vma->vm_refcnt) =3D=3D tgt_refcnt, > > + TASK_UNINTERRUPTIBLE); > > + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > + > > + return true; > > +} > > + > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool = *detached) > > +{ > > + *detached =3D refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_ref= cnt); > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > +} > > + > > void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_loc= k_seq) > > { > > - down_write(&vma->vm_lock.lock); > > + bool locked; > > + > > + /* > > + * __vma_enter_locked() returns false immediately if the vma is n= ot > > + * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET = + 1) > > + * indicating that vma is attached with no readers. > > + */ > > + locked =3D __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1); > > Wonder if it would be slightly better if tgt_refcount was just 1 (or 0 > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in > __vma_enter_locked() itself as it's the one adding it in the first place. Well, it won't be called tgt_refcount then. Maybe "bool vma_attached" and inside __vma_enter_locked() we do: unsigned int tgt_refcnt =3D VMA_LOCK_OFFSET + vma_attached ? 1 : 0; Is that better? >