From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66264E77188 for ; Wed, 8 Jan 2025 17:53:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D442D6B007B; Wed, 8 Jan 2025 12:53:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CF3F86B0083; Wed, 8 Jan 2025 12:53:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BBD7A6B0085; Wed, 8 Jan 2025 12:53:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9D84B6B007B for ; Wed, 8 Jan 2025 12:53:51 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3FC0581193 for ; Wed, 8 Jan 2025 17:53:51 +0000 (UTC) X-FDA: 82985032662.11.B9CD415 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf17.hostedemail.com (Postfix) with ESMTP id 5A5B44001D for ; Wed, 8 Jan 2025 17:53:49 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Ii3pBAtT; spf=pass (imf17.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736358829; a=rsa-sha256; cv=none; b=yuvjx6tjpcMULygVr62GnGGPKNoWph3esGk0eorcmhRYxbM86GC/YUTET4T/OPlg3THnwp d0nCuonk1a5gzEyPOdWVsBmigVQvw1y8NKlmqKO30SRpmSfEz5ItPCp/ssCsxucBjnmez0 /0Ql0FjWXpqxEXqGrYvi1hwYRVryXrM= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Ii3pBAtT; spf=pass (imf17.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736358829; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5EoM3QLEADWKdisZlB73KCViB6HUGAhzb6FiYcO4P+w=; b=f9UbrcA0nf6KBSQEA7UidAhLM9Y5i+PlfaOywo1UcqMuezu2s7djcOMonb1AkfUtxnYj+4 IfSVWhvG5dpDMZEYhkFIPLcn0KJdJxvEKIH5XV4Z5SepLh0/tmPG0JNImTt/FSGbcevG5V oOinOZ2HwNVzFYnpHXWYMVSuZF5d+TQ= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-4679b5c66d0so2681cf.1 for ; Wed, 08 Jan 2025 09:53:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736358828; x=1736963628; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5EoM3QLEADWKdisZlB73KCViB6HUGAhzb6FiYcO4P+w=; b=Ii3pBAtTWsDq8H4B15UwhpiRbmxJTykAXoGTqrwT/WHnKNTbDjDbHuDKmHADZQhFnE q/OI9NMhWgxoYV9D+X7JNhoZo70fk+YX0theUs3nvLSog1RpeqiWxEML9dbDk1hzFWLU gbvtBJEqlAK0spV/ub8esD+3QNwaLJGiM4AvsOYFTzCREiUUV0CqKyaeQFjiB0BmgmZl xswSc0e3WostZPrz+h0gig4eq/xpbNQIh9TWRJAVIT1GsnTSzJ0zdc4XYRPaeQu6227z Sh33FTWLNgajcWxaWIfElRIHjy9INNys+6LHlw2oUBcNBoLttmLZUpRVrMUPMHA3IY93 bgBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736358828; x=1736963628; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5EoM3QLEADWKdisZlB73KCViB6HUGAhzb6FiYcO4P+w=; b=OdeB3NDI3xNZmNBcl77tcb/ysdlnn0TCIdJ2W9GLjW7YFyiPfxPlxo1k+0XsX5DNvm 4FnvGP0dJJFGjcoKE3wmIwF3Yfrq3AXnZmCJuiiODMGcagjCJ6PHSL8zrK96YCAxnnWd NsHzhey2Uh4PQDeyAlgW4xDn8lOuryoYz+Mzp+OWIrbkZv0DAodR9cAZos1HryY8dC68 jxlmi5JeA/NatTeKFInYOuxjpGgzdrDS2xfkFU3sZFZPaUGkTDzY4ME2KxR1sHL3nLZ6 hrWIgL+TVsdFbzepDJkW6q+VDw8EAGGh9BeYiVTzPZmwqY54PSbbZQdLWQrXjaeFY7o2 6g+A== X-Forwarded-Encrypted: i=1; AJvYcCVfGecSK1K3PPXCK/oX9oIC0BkMPFzF8hOyIKfF77O1S3zd9jptTgMhsM9ouPjkneG6dewTcGTCZg==@kvack.org X-Gm-Message-State: AOJu0Yw/kWlQDALmNcaAu2S/bZt77vR8+pbhIF59gl8YrCF/Pbrt9bwg 0Xf2rnpUB02i2+W8mRjzFLT8Wk2pvbcad4Gi87w4qJfshY1k/p/6i9/scGxkoqAf8eWIRq77pkr 4C9O/+xkxAeNeXxMpJTa7AXEzam3RktUT/PSv X-Gm-Gg: ASbGncvAslk9kxmOKoYzjSGYljQii70ifCPZZTRKJ744BnS9f4/MZC2mUccju5wizxK 4QJ9aRjGhgQToP0JRISY1ACAEctAPdIVdGK5KudU/SdDVfeRIwrBBTyQvjiZ/5g8jfxgb X-Google-Smtp-Source: AGHT+IF3onN1dSxPeEH4rOHY3xCaJmWbin0sIkVhLw2wPpH9Ldpr9m3xKR8GbjCV0NjyNax+7aY2kg58Y2EpHSV7lT8= X-Received: by 2002:ac8:5702:0:b0:463:95e2:71f8 with SMTP id d75a77b69052e-46c71fcc5cbmr3060671cf.15.1736358828108; Wed, 08 Jan 2025 09:53:48 -0800 (PST) MIME-Version: 1.0 References: <20241226170710.1159679-1-surenb@google.com> <20241226170710.1159679-13-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 8 Jan 2025 09:53:37 -0800 X-Gm-Features: AbW1kvYkb7SBofGVioBAJ4Dim5wk2UnfLEOoRN87bgU15ULOmiC1cK-0_FEvh7k Message-ID: Subject: Re: [PATCH v7 12/17] mm: replace vm_lock and detached flag with a reference count To: Vlastimil Babka Cc: akpm@linux-foundation.org, peterz@infradead.org, willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5A5B44001D X-Stat-Signature: dsysutgpdm335uapapcakgxt7m6unpss X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736358829-290314 X-HE-Meta: U2FsdGVkX19ZzWKaSBs5pABpFYzUZWNMUaXrtsNLGeA/oHY6MY3+aMsszG6LKDMNlpg7LLdFCAByAKdJIQRgRptH9UOa3W5hwrZ2MzId9XO3ouqr+xApcV5wHD1Xk+s3g2mBmDwFMzcHB5HTaBChFaM9IKmr33pKlbalUfwdi+rqcuYxfV6XOMoawXHSzJq8VjGR8AUYDX+k2oVyUWfQfiVtqgFtxVRHo+pFOz5z1wrYQx2tG9yO/SzqAaXUUQQVgymUKwKVvvMnie1r1JVxgdzwvFPFumgZDAsipZEEEP+z4oaM8g03uQifz4BkUDtyIbBGdhFx+02tKFxJJFsSToDLovPgF7FLMx1Y5pC1t+q6KG4KJnpsGZXd+lJglW/NhRt+VvDf4k7b2hQfHZe5FC4gvokl+VUtw3PvwgLsukgAWx65fOrgAdamIIWjIGqn1t/pHWKDB4WgNdR3KQdiP4YRczYj3JZZKb5la6BKAR5FmZdEQhqJmEXfga8hQx+uJk5UxkgVFcdGiZfGKs9psmwIkqFPNoKslm5r3vxHOkpc83/bWIT9yd39JugV4F0nxPyjdYOhhPdjYL93FcDuFaD2QFiWABzHCCrphwBOxX0kMivz70g+jEAzmxW1CnFEbllnimaFLNAr/5xYfND+1CUrOooeB9OExBBRPZPDFFtOP7FdBPTRGbiau3t1GpxezM7qmS7MpgoagEpqmby1SGdzpNEAtpO53uxys0sVOWHTQSE9QkT7xB+AKPvcxUp0uwhf2A0XFXHoRIdqc5rL87932SRbRp1MUa3q5TJ/mZT2yQwindfMRP86vzSBXDowV5q6shoQJIm4DynSEXZVOZVG+Ya651fkp3ImI6jaumBDZueeKSdP86Sb1Q2GECMCQ0kHGmW7+zej+ue+c0mnjhvLsLBGcVcqQHjr4Uzho7igmJZvQhoHpkkqiKXtyKKrPhDlRshgSmbUdeZA6sA KQPql329 G+YPRkJAjFpu9mjop25q+D3yaoru8ZJBXFoEGcOB+SAyjC7OSDhwSC2KKC2fs6W7jScOSyNh7grm1ASd1mmARby0y9Jh1LTdp9CStEudd6Ej6bqbjs/sBzKMih7Nxkb5sQDm54lYkA1TqywJ5WkBnnbUBkWhqqUDpesl3pvMOPQwbEdkcx7u71lfg6njKkKFz7+xrWQkXLfENwk0r5GWCfhtw2/VEMh6iJLhC0IKvgmQMN99tCzN3WVykGKMOQvnNiP1BhZv3JBVuqERTbGx9sXgepLJHzxJkRUW2VOoOQtP8zOQuwv/kKH/XxNMWuQHDg30NsRKPpTZyM14= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 8, 2025 at 3:52=E2=80=AFAM Vlastimil Babka wro= te: > > On 12/26/24 18:07, Suren Baghdasaryan wrote: > > rw_semaphore is a sizable structure of 40 bytes and consumes > > considerable space for each vm_area_struct. However vma_lock has > > two important specifics which can be used to replace rw_semaphore > > with a simpler structure: > > 1. Readers never wait. They try to take the vma_lock and fall back to > > mmap_lock if that fails. > > 2. Only one writer at a time will ever try to write-lock a vma_lock > > because writers first take mmap_lock in write mode. > > Because of these requirements, full rw_semaphore functionality is not > > needed and we can replace rw_semaphore and the vma->detached flag with > > a refcount (vm_refcnt). > > When vma is in detached state, vm_refcnt is 0 and only a call to > > vma_mark_attached() can take it out of this state. Note that unlike > > before, now we enforce both vma_mark_attached() and vma_mark_detached() > > to be done only after vma has been write-locked. vma_mark_attached() > > changes vm_refcnt to 1 to indicate that it has been attached to the vma > > tree. When a reader takes read lock, it increments vm_refcnt, unless th= e > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of > > a writer. When writer takes write lock, it both increments vm_refcnt an= d > > sets the top usable bit to indicate its presence. If there are readers, > > writer will wait using newly introduced mm->vma_writer_wait. Since all > > writers take mmap_lock in write mode first, there can be only one write= r > > at a time. The last reader to release the lock will signal the writer > > to wake up. > > refcount might overflow if there are many competing readers, in which c= ase > > read-locking will fail. Readers are expected to handle such failures. > > > > Suggested-by: Peter Zijlstra > > Suggested-by: Matthew Wilcox > > Signed-off-by: Suren Baghdasaryan > > > */ > > static inline bool vma_start_read(struct vm_area_struct *vma) > > { > > + int oldcnt; > > + > > /* > > * Check before locking. A race might cause false locked result. > > * We can use READ_ONCE() for the mm_lock_seq here, and don't nee= d > > @@ -720,13 +745,20 @@ static inline bool vma_start_read(struct vm_area_= struct *vma) > > if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_mm->mm_l= ock_seq.sequence)) > > return false; > > > > - if (unlikely(down_read_trylock(&vma->vm_lock.lock) =3D=3D 0)) > > + > > + rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_); > > I don't know much about lockdep, but I see that down_read() does > > rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); > > down_read_trylock() does > > rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_); > > This is passing the down_read()-like variant but it behaves like a tryloc= k, no? Yes, you are correct, this should behave like a trylock. I'll fix it. > > > + /* Limit at VMA_REF_LIMIT to leave one count for a writer */ > > It's mainly to not increase as much as VMA_LOCK_OFFSET bit could become > false positively set set by readers, right? Correct. > The "leave one count" sounds > like an implementation detail of VMA_REF_LIMIT and will change if Liam's > suggestion is proven feasible? Yes. I already tested Liam's suggestion and it seems to be working fine. This comment will be gone in the next revision. > > > + if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &o= ldcnt, > > + VMA_REF_LIMIT))) { > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > return false; > > + } > > + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > > > /* > > - * Overflow might produce false locked result. > > + * Overflow of vm_lock_seq/mm_lock_seq might produce false locked= result. > > * False unlocked result is impossible because we modify and chec= k > > - * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock= _seq > > + * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lo= ck_seq > > * modification invalidates all existing locks. > > * > > * We must use ACQUIRE semantics for the mm_lock_seq so that if w= e are > > @@ -734,10 +766,12 @@ static inline bool vma_start_read(struct vm_area_= struct *vma) > > * after it has been unlocked. > > * This pairs with RELEASE semantics in vma_end_write_all(). > > */ > > - if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&vma->vm_m= m->mm_lock_seq))) { > > - up_read(&vma->vm_lock.lock); > > + if (unlikely(oldcnt & VMA_LOCK_OFFSET || > > + vma->vm_lock_seq =3D=3D raw_read_seqcount(&vma->vm_m= m->mm_lock_seq))) { > > + vma_refcount_put(vma); > > return false; > > } > > + > > return true; > > } > > > > @@ -749,8 +783,17 @@ static inline bool vma_start_read(struct vm_area_s= truct *vma) > > */ > > static inline bool vma_start_read_locked_nested(struct vm_area_struct = *vma, int subclass) > > { > > + int oldcnt; > > + > > mmap_assert_locked(vma->vm_mm); > > - down_read_nested(&vma->vm_lock.lock, subclass); > > + rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_); > > Same as above? Ack. > > > + /* Limit at VMA_REF_LIMIT to leave one count for a writer */ > > Also Ack. > > > + if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &o= ldcnt, > > + VMA_REF_LIMIT))) { > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > + return false; > > + } > > + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > return true; > > } > >