From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32D27E77188 for ; Wed, 18 Dec 2024 15:57:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9CA986B009E; Wed, 18 Dec 2024 10:57:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 97AA16B00A0; Wed, 18 Dec 2024 10:57:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 841F06B00A2; Wed, 18 Dec 2024 10:57:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 67B706B009E for ; Wed, 18 Dec 2024 10:57:32 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D6E53160F58 for ; Wed, 18 Dec 2024 15:57:31 +0000 (UTC) X-FDA: 82908533442.25.BD2EA01 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf11.hostedemail.com (Postfix) with ESMTP id D0E654001F for ; Wed, 18 Dec 2024 15:57:00 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wHGFOu40; spf=pass (imf11.hostedemail.com: domain of surenb@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734537428; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NRWkLlH/0KVuqivr+nZJc3MknxGivPkYfiY4TYmOyLY=; b=hz2Mnp8G2XW/6kW+T6VBkHCjm61idiOU2e9SaJ34odfk5He7QKgfRrPHF8SdxmCICtDw8z g1rfl2Ib1gz9LfwuTp6Glm5BqoS1Wbboz0obkoYfMoLi+eVmYrDyHZlbTRdvTiqWdzvQkd thM+QzPiEnbOgODeLn3QgNQTD4v4YFg= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wHGFOu40; spf=pass (imf11.hostedemail.com: domain of surenb@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734537428; a=rsa-sha256; cv=none; b=t6IZ/dbB0ehGm4Q7lAiSEFUDeXFBGtJrNMMvXSw1+6htKNE8pIT/xbGqV35i62RWwnV+pE 6Mv+4LkpEoKPO4ejhGVHCm0I4kSTprNM8ebVqHXCNsm+P6BxxZOpj0bBwuSuTAe203toGl 8mkHeMx/Yrqu3QiP7Yzu51R0kWb+Or8= Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-467896541e1so300931cf.0 for ; Wed, 18 Dec 2024 07:57:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734537449; x=1735142249; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NRWkLlH/0KVuqivr+nZJc3MknxGivPkYfiY4TYmOyLY=; b=wHGFOu409zto1ZOtVJcOFhELQ9u+6soi91zg8to2s4YucRNu+o5bGoNj/4iOvXb/LI vJFaBDUT5fVu21uWZoV2tHjYxB70bGlY5eg/IA5dMLrXy5lx+BM1PKREdwg++fWEVJge DW4MZ5+cUvDLP4A5xtKQGLBlvX1NxM/LOhmxIy8Z1uUKPHf2OTTcjvBamGjpdkAqveWG sjV3AcIRZbe3ZgnL6THW4lYQCjO3r5zUfbS/nCkrpHKITjKsRFworTcH///LKnv0Q9sA JYVhbruQSp/cmIkE9ctjM0ltIsJTzaGtlkBypTLZVheSMO3X3ZxFIxbIn07+33omlbv+ duBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734537449; x=1735142249; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NRWkLlH/0KVuqivr+nZJc3MknxGivPkYfiY4TYmOyLY=; b=lemQUbGWv6/nQhanKd5H/DJqvm2nphhpVANXG4q53TkXmXhRJitP82Lts7ZxlFE4db r2furviDXuQ8+KUWmryOqirxUk8T5eIs4jjtzhfkDgusajxL+kSFJL77+Y4v0JbeDkrZ AbP3VmyJMfc690btWGslNAq8WQEA0hsX87elW2Aq27B0Pjc4gEoy04afdlgvGlbsUZPs Y2hMkRjyA9rw1+Qd7SImNbwE6HgTdyk1+NWexQhFrID8/cwAR95n7cjGsyJah6aLLRR3 NUZvlFsT2oTmvMDIjnSQYdQ/pCRLKeFombtnlnJwiw2aH6lbgjUJC9C7PkBY+U0asmaa I/Dw== X-Forwarded-Encrypted: i=1; AJvYcCXkjvyFO+XuDh/Ri3tQfvw2hvqapnUsNa8fb/uL7ZlQveUeGbSygzjlBvW+WoWf+cSAt8vDVrnWug==@kvack.org X-Gm-Message-State: AOJu0Yyk/unWwNX6sAALErzVEjHECgpgx5YPWd6gK2RMQPb0LKm2urbv xYkBVxjsT85i6L5FLEyv4E2CQ+WZ44REUu4HpRIsKw7mgu8ZdWVvpyHzjgo2Xzkd2626QyPskaZ PahXyc2rgvkImCZYVBnCMg+dIiM3vPwwHH43X X-Gm-Gg: ASbGnctKPhEHPPHJMi5W7a69kOABHzZcZeRMJ6jEnYTMGAzMeSfw8qR/HBjry3VuA5q ogtFD17lIlly2Limdz55XUVjDSvcn6gtNIeMp8A== X-Google-Smtp-Source: AGHT+IHJObuMKVfq5IR6Cj6tOI9yODnZO1treWPzCs0UHl4OVIXh5EgRvjmsadV5f2ofVc+ag073+Ch83CgpcvfZnL4= X-Received: by 2002:a05:622a:14d0:b0:460:f093:f259 with SMTP id d75a77b69052e-469091652d0mr3931961cf.22.1734537448820; Wed, 18 Dec 2024 07:57:28 -0800 (PST) MIME-Version: 1.0 References: <20241216192419.2970941-1-surenb@google.com> <20241216192419.2970941-11-surenb@google.com> <20241216213753.GD9803@noisy.programming.kicks-ass.net> <20241217103035.GD11133@noisy.programming.kicks-ass.net> <20241218094104.GC2354@noisy.programming.kicks-ass.net> <20241218100601.GI12500@noisy.programming.kicks-ass.net> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 18 Dec 2024 07:57:17 -0800 Message-ID: Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count To: "Liam R. Howlett" , Peter Zijlstra , Suren Baghdasaryan , akpm@linux-foundation.org, willy@infradead.org, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Stat-Signature: xiruqsoccke3uuqby1wu3yp565s8dtcp X-Rspamd-Queue-Id: D0E654001F X-Rspam-User: X-HE-Tag: 1734537420-418117 X-HE-Meta: U2FsdGVkX191KNDOJdN8Q+7UIlLGg112lSESihRwLlNIuunU7t/iz/9u/Jk+Zdbtmb95JWxj57bfLbfYy9lDL/jPj0/ZzWkOyuQq0l8YEj9QU3i+LibF7S0KiNvi3r18jVdR0fcsiXqqV0U4tw7H3bweaxJ8XcuA3YNtp2sbhmO6qdPLMut3szZR8E8zZe0Zz9Z/Sq72+h9ncBc0JGe9B5Cj2UB18fYWt6PLKMYQYyeZ+cn20kAHp3wr1s/UNf+nGam72IA/5zkS36uK95vw8xx79JGouvjIWvqm9zah9UIzNKqD9QHWpBKxMmi0T8dVBgcSIeI6Lf2Q0N2cZ3UGFCT5xC9F/ArGSIb1W23XIJsBiPOBDIxnRCYC+W49jjyq9bdnmFbthLDmgt7//G1oz0cBM+VNGzRYDBT6e2Ubwr6VoBIVxPrScvT5jhFrK5HV9ncNgQyYSOzusrDU/IkAa3lsci/RNTZljcEX2aEuODRwXJVIAWHPZJp+wl6oM1+KuSUQ2NjF2fIQTHuj3KMEWx6zL05Nbv8hpgiqcaSIvNIzg1PCcyuo0a9ZlbS3Cbu+Wg1MxOiegOVIfP+JzO/RrRh2vauF5Sx9PFf6fEbJCvXgY0n7hvQTqd6Ul5Ppiv2+9d1bmw1If080x0l4R9ZL1Rla6uEXzncd+6jFGebnrNpu3Rx+6/cblL2m9mhhFk0if9+GTYBtCD2yP++yhOGmUX3ZGCRxEwiQPg89GOsbINKh9N8COQSw2a9vz2ETxZJDxEyZrtf04gGuEwAfh0kQSbdEO8xDO2PZ3ryBg3x5Opo2tLygI5RX7+4Zvs3gjH4XKTHX2iq+OtNbB3Fg6mejG8I4GeHI+ia+rnPrs1ZSwTsJhBeZHk/wUEARTQlYIX65D2vUANDZ0ir/xc4AMB4QMfSVX/ZfpXDRDX/iui5PNA8x0zTPpGGXWXcYW+M4z2PLCDik6vYsSZFM84zYxkS DpS/Qoeu G+laL//Xs20GIsSMZfwQKl6Asvi3kcm93MzxygyWP5LkLT3DE+Z6xalr+5omZ4c46mPL+lJBwkHiohmm+7TJ9YaCvhcSHCH73/94dDQ+ZgrufOAQaViXHO8iCs7QufT3oiigkKzfp8IrDqP7RTKi4zO6TZJsVUs66LsJoqVCArVig93ndFGzuJT1/sRpET4EL3/Eh3zxR6deaSz6enwK29kaK/dhT2fh2ajfH61A8BQpL0hFsgokagFTq6PCASUiFIWUiMdIk/Ry5PUTsO9EblfI0Q9zETW5eqQDT62ORYROZX8ceU1+3ITmOOwIfZg1Ydxs6lUaUJoQsdwflcp6MtM7Zcinfv/uivjw7rza2/39h9+kpeezkf8LUqlZwVX7xxPIa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000227, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 18, 2024 at 7:37=E2=80=AFAM Liam R. Howlett wrote: > > * Peter Zijlstra [241218 05:06]: > > On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote: > > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote: > > > > > > > > So I just replied there, and no, I don't think it makes sense. Ju= st put > > > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0. > > > > > > > > That's very appealing indeed and makes things much simpler. The > > > > problem I see with that is the case when we detach a vma from the t= ree > > > > to isolate it, then do some cleanup and only then free it. That's d= one > > > > in vms_gather_munmap_vmas() here: > > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and = we > > > > even might reattach detached vmas back: > > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW= , > > > > detached state is not final and we can't destroy the object that > > > > reached this state. > > > > > > Urgh, so that's the munmap() path, but arguably when that fails, the > > > map stays in place. > > > > > > I think this means you're marking detached too soon; you should only > > > mark detached once you reach the point of no return. > > > > > > That said, once you've reached the point of no return; and are about = to > > > go remove the page-tables, you very much want to ensure a lack of > > > concurrency. > > > > > > So perhaps waiting for out-standing readers at this point isn't crazy= . > > > > > > Also, I'm having a very hard time reading this maple tree stuff :/ > > > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a > > > second tree, it does not in fact unlink them from the mm yet. > > Yes, that's correct. I tried to make this clear with a gather/complete > naming like other areas of the mm. I hope that helped. > > Also, the comments for the function state that's what's going on: > > * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tr= ee > * for removal at a later date. Handles splitting first and last if nece= ssary > * and marking the vmas as isolated. > > ... might be worth updating with new information. > > > > > > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from th= e > > > mm -- and that being able to fail is mind boggling and I suppose is w= hat > > > gives rise to much of this insanity :/ > > This is also correct. The maple tree is a b-tree variant that has > internal nodes. When you write to it, including nulls, they are tracked > and may need to allocate. This is a cost for rcu lookups; we will use > the same or less memory in the end but must maintain a consistent view > of the ranges. > > But to put this into perspective, we get 16 nodes per 4k page, most > writes will use 1 or 3 of these from a kmem_cache, so we are talking > about a very unlikely possibility. Except when syzbot decides to fail > random allocations. > > We could preallocate for the write, but this section of the code is > GFP_KERNEL, so we don't. Preallocation is an option to simplify the > failure path though... which is what you did below. > > > > > > > Anyway, I would expect remove_vma() to be the one that marks it detac= hed > > > (it's already unreachable through vma_lookup() at this point) and the= re > > > you should wait for concurrent readers to bugger off. > > > > Also, I think vma_start_write() in that gather look is too early, you'r= e > > not actually going to change the VMA yet -- with obvious exception of > > the split cases. > > The split needs to start the write on the vma to avoid anyone reading it > while it's being altered. > > > > > That too should probably come after you've passes all the fail/unwind > > spots. > > Do you mean the split? I'd like to move the split later as well.. > tracking that is a pain and may need an extra vma for when one vma is > split twice before removing the middle part. > > Actually, I think we need to allocate two (or at least one) vmas in this > case and just pass one through to unmap (written only to the mas_detach > tree?). It would be nice to find a way to NOT need to do that even.. I > had tried to use a vma on the stack years ago, which didn't work out. > > > > > Something like so perhaps? (yeah, I know, I wrecked a bunch) > > > > diff --git a/mm/vma.c b/mm/vma.c > > index 8e31b7e25aeb..45d43adcbb36 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_= munmap_struct *vms, > > struct vm_area_struct *vma; > > struct mm_struct *mm; > > > > mas_set(mas_detach, 0); > > > + mas_for_each(mas_detach, vma, ULONG_MAX) { > > + vma_start_write(next); > > + vma_mark_detached(next, true); > > + } > > + > > mm =3D current->mm; > > mm->map_count -=3D vms->vma_count; > > mm->locked_vm -=3D vms->locked_vm; > > @@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_de= tach) > > struct vm_area_struct *vma; > > > > > mas_set(mas_detach, 0); > Drop the mas_set here. > > > - mas_for_each(mas_detach, vma, ULONG_MAX) > > - vma_mark_detached(vma, false); > > - > > __mt_destroy(mas_detach->tree); > > } > > > > @@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_mu= nmap_struct *vms, > > if (error) > > goto end_split_failed; > > } > > - vma_start_write(next); > > mas_set(mas_detach, vms->vma_count++); > > error =3D mas_store_gfp(mas_detach, next, GFP_KERNEL); > > if (error) > > goto munmap_gather_failed; > > > > - vma_mark_detached(next, true); > > nrpages =3D vma_pages(next); > > > > vms->nr_pages +=3D nrpages; > > @@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vm= i, struct vm_area_struct *vma, > > struct vma_munmap_struct vms; > > int error; > > > > The preallocation needs to know the range being stored to know what's > going to happen. > > vma_iter_config(vmi, start, end); > > > + error =3D mas_preallocate(vmi->mas); > > We haven't had a need to have a vma iterator preallocate for storing a > null, but we can add one for this. > > > + if (error) > > + goto gather_failed; > > + > > init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock); > > error =3D vms_gather_munmap_vmas(&vms, &mas_detach); > > if (error) > > goto gather_failed; > > > > Drop this stuff. > > error =3D vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); > > - if (error) > > - goto clear_tree_failed; > > + VM_WARN_ON(error); > > Do this instead > vma_iter_config(vmi, start, end); > vma_iter_clear(vmi); Thanks for the input, Liam. Let me try to make a patch from these suggestions and see where we end up and what might blow up. > > > > > /* Point of no return */ > > vms_complete_munmap_vmas(&vms, &mas_detach);