From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92730E77187 for ; Wed, 18 Dec 2024 15:50:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1AD7B6B0092; Wed, 18 Dec 2024 10:50:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1372D6B009B; Wed, 18 Dec 2024 10:50:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ECAE76B009C; Wed, 18 Dec 2024 10:50:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CB4386B0092 for ; Wed, 18 Dec 2024 10:50:49 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 639E6A0FAA for ; Wed, 18 Dec 2024 15:50:49 +0000 (UTC) X-FDA: 82908516348.08.E2D54E2 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf08.hostedemail.com (Postfix) with ESMTP id 2334B160008 for ; Wed, 18 Dec 2024 15:50:25 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Iegoc+5/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734537017; a=rsa-sha256; cv=none; b=Tko/yz95w/VcHnquHim7hLwC9AoX0MylTI+NyQ+UFtSVgnMqTMrbvpRbjbviOaPaQpUwxQ sa50oXVMlmqN2dVMyzKH4p2LwZBWfySNNzrcLSTHMS0x6xvU0Y2rXpS9uFvI/sstfYA7WZ 3NbzkSvmt1dotnn1SkrY0gaispIKjf8= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Iegoc+5/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734537017; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K34TMKXepcjf4T0QsGUjvfCGqdKHia7AmTftSourduQ=; b=BeHvU+mpQN4blMyfDKDPib3OjxoK0vFSvc4HAMot1uMqY+q+Ty5Qwm2Fc+jpZBqasbM6tT GiDVy3fa3oi58IjJ+lg5yqrratFAkQ/7YglaTv8hZgVIgtQGrmROuu6FKSL6SRZjfDTBa7 BFdTyM7o0O6V4g8PvdIgsQoEKPWjXvE= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-467abce2ef9so309261cf.0 for ; Wed, 18 Dec 2024 07:50:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734537046; x=1735141846; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K34TMKXepcjf4T0QsGUjvfCGqdKHia7AmTftSourduQ=; b=Iegoc+5/dsrT9ageKkCMILxpqFKFY4hqfV2h0haw+vbyWmRtA5HUwcfVvdlo6TfP0h 3l3zUpOCq63Ct4DIgoybfY5M3wajPaZCKKKKS806wME0ZBx2OkoRliOHv8v2QZ5L0qmm 8ziCJbc8h5t4SUekSkr109EligxbVFSWw7nbwhlfcctnqMws4282Tzg7nWPj2Y7h+lRn WdPe+qSiz7Q91YfcJA4hDYvGp261aVkjbgIeWxu4CctVK13o9R7cG/KxCGXGV0yp+IKN j1DuQo7oyCiuJXfSV5W19VQImYv2iMbUETnQvyrYWjvJZLJ5s0SYcO/ZOWEkCyPAvreN pK0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734537046; x=1735141846; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K34TMKXepcjf4T0QsGUjvfCGqdKHia7AmTftSourduQ=; b=lqgUxmbaVxfn00OVopCyNp6IXYfQ6OPwbifPer/boEb8/1ObzuGS27Zyjp6Q7I0n47 d0KQoBT+BRL0Vo/xIpLwmVVGVxSmWRIHyycGHqy65bxMNdild72sOg6eabwM+F1yGpyn xfeWIzPmwoH/XiNDLJUAUB6lh0CC+PnY+TGUQjvDXjMsJm6jdqREZsR5IZh0294bwGoM 3LDlxmGBRDVkQ0oIucp+wL7Cq6DT5x1nHGuk+aLWXbOKGb9MNGn0sRSPp4+io9AE4WOc TPUHhzbmbmZloC3lzRqeU+Jk0rpxuclASiC2yKRliZxj0nfott0gKDTveUEVwXPqRHWr 9rHg== X-Forwarded-Encrypted: i=1; AJvYcCUiG8JWULRgRJJe/fXAEAZ2LCYE32B47chr86HNTjCqe0jbMkmdYwDwBUrjl1P7RsxwW0rjriAuZQ==@kvack.org X-Gm-Message-State: AOJu0YywzSwKuYQUkkShMP9qdVwRLn7X43vtX+7VlakSJaH8cFqnmagw RuY7p1HWWzu1tBI8ri+BaRL8fW7NLULWJzYehbKB4eAZMyC6mAwqLTE9ks+U0i7kBLUJLAsKURq BYGJu/MjIgB4o2/oChXS3tZnTlahFlU3oQdTB X-Gm-Gg: ASbGncuUP+H7QAB65k4NUrfc1AgbQpLq6dP9XhGI57nq7Q9vR0wzO6fvvWOA9vjDnW0 fOKAT7vCdojUajOSHk/V+32/2g/ibW0809mn7Bg== X-Google-Smtp-Source: AGHT+IF9bvtdeVbXYNcj4WUKuNA1W1UsTchoCsMgUFXrZYQwJlH3Dyx1XbAac6JvaKxNQoGXbevnPte6ewyt6JYlXKk= X-Received: by 2002:ac8:5dc7:0:b0:466:8c23:823a with SMTP id d75a77b69052e-46935529712mr2812431cf.17.1734537046012; Wed, 18 Dec 2024 07:50:46 -0800 (PST) MIME-Version: 1.0 References: <20241216192419.2970941-1-surenb@google.com> <20241216192419.2970941-11-surenb@google.com> <20241216213753.GD9803@noisy.programming.kicks-ass.net> <20241217103035.GD11133@noisy.programming.kicks-ass.net> <20241218094104.GC2354@noisy.programming.kicks-ass.net> <20241218100601.GI12500@noisy.programming.kicks-ass.net> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 18 Dec 2024 07:50:34 -0800 Message-ID: Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count To: "Liam R. Howlett" , Peter Zijlstra , Suren Baghdasaryan , akpm@linux-foundation.org, willy@infradead.org, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2334B160008 X-Stat-Signature: 1owweokzyhbmgug6fyh7namk7hq6kz6o X-Rspam-User: X-HE-Tag: 1734537025-577237 X-HE-Meta: U2FsdGVkX19ofCSJR/sD5DqlgC9dh2jw9OtQHX7UWUWSy166KptjMTMLGkkriC7rE+yMn0t6VUKIQoJc8+zgjHeiQIZ06XOXndoeC+0ZiaHHzmJuZFAZWWMel7DRE/Muu1uS5nPAWJMDdAwEvIYZlAe8zU9fPDuf2xdGYseq7++Nh4JEksKcOL5poGhpIjcRGXMFYMjKewsdZOzfbnoPk/7H0kerqwG7VAjtYJHQB095GTYWeh40b+XgcmWiZIVgicDHzPYRr9uVJYG3uQaMDBQBWQluBCOubtKUd1tbE0OuWAdvCts5VAUcxha94sHK22wmDOBo5wYy4aqasCWYA6gXg8MdII7x1gLzPPjdsqpBvif2vtDZIUgJrLbUhl+gZ36nlT0Am307krmyEiZE+TiydgDpga6DsPa0eOVIkSps3NPrt3ZVuCe/ONlh9/pmoLU9oZLj5FgFnWaUSotHriZBv7LV/Q4hRpt2YUz6cXiUeEZgXr2VykrZyxQfi5MthLrVxwzfq88TpGXOexZsqBwJAknVQ7wyaAuzcdPDiPjhyJjHbpDbmyytpxGt7udmvCqDn+AvJVRBOGyII7K2YCtmKDtnVG8YjcFN1GE62c8LQA/9gx+am6eCY2BAoOw7H93N9KvMcIdouwJXjNOGL4+mvtF88QRHUn9ZGsH9rNRqqo5oTRXvyEFSGxC/eVOWdR525pz5BdllIqGC/N1uYA+g0tVaquS5iktfQJPW6r3PQTY+bhZHyP7xX1qwwgZc4aPArSOHIUzOhpx+82PzqbCPaUwyjJ9zRDYJpWcWkp9wPBG4L5+nwVj9IXXXSOxCz+ut9l+ZAuUJdtMShLCQSD16zLwWDZ0S4CP6LPvADMuKB39eH6j+VPs94rSYRtji6x5Z9H6OPQCDXbixBjewcqxtUrYjt9JF700iFkbaR8y94pCDSsWN3tAz7bGLQlHVbq871Lv454/CYrHBWQx NaS9oEUz xIF7IvSAfRZMlCmdXE+uXDGrFh7ku1NIErscixpSZYnuH5MelZsi0lRrlNIxCAPjSe2rqBkD4xl8BqGztV1OSYpv84K31M7s2LsJnCsh2uM/KAfMmDlalN59JBypFKG14Kzw4YcTG9DWtPRDexSamJEtWqtANcHICnDyHPV+Se56P6yAl7zWYP4ttVomk3NpFLDBGiOwifaw6KIxbSpF/LWF3tFJMXu/SGZzrJCTYNEUUr7fcSLFVvpInt04ag5NAqnGwHOn5JARI1kIO3+0cwP+5bCOeifPNXoDui9ACqViJE3ch5dXETnQb/qZ3iQaGLQLFsbSZ0Pdz0rMyCEo5bnaGsdmnqv72ElItvvR0XqNF5fpDhb4x1CvzA4qF8EcuDK2M X-Bogosity: Ham, tests=bogofilter, spamicity=0.001388, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 18, 2024 at 7:37=E2=80=AFAM Liam R. Howlett wrote: > > * Peter Zijlstra [241218 05:06]: > > On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote: > > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote: > > > > > > > > So I just replied there, and no, I don't think it makes sense. Ju= st put > > > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0. > > > > > > > > That's very appealing indeed and makes things much simpler. The > > > > problem I see with that is the case when we detach a vma from the t= ree > > > > to isolate it, then do some cleanup and only then free it. That's d= one > > > > in vms_gather_munmap_vmas() here: > > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and = we > > > > even might reattach detached vmas back: > > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW= , > > > > detached state is not final and we can't destroy the object that > > > > reached this state. > > > > > > Urgh, so that's the munmap() path, but arguably when that fails, the > > > map stays in place. > > > > > > I think this means you're marking detached too soon; you should only > > > mark detached once you reach the point of no return. > > > > > > That said, once you've reached the point of no return; and are about = to > > > go remove the page-tables, you very much want to ensure a lack of > > > concurrency. > > > > > > So perhaps waiting for out-standing readers at this point isn't crazy= . > > > > > > Also, I'm having a very hard time reading this maple tree stuff :/ > > > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a > > > second tree, it does not in fact unlink them from the mm yet. > > Yes, that's correct. I tried to make this clear with a gather/complete > naming like other areas of the mm. I hope that helped. > > Also, the comments for the function state that's what's going on: > > * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tr= ee > * for removal at a later date. Handles splitting first and last if nece= ssary > * and marking the vmas as isolated. > > ... might be worth updating with new information. > > > > > > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from th= e > > > mm -- and that being able to fail is mind boggling and I suppose is w= hat > > > gives rise to much of this insanity :/ > > This is also correct. The maple tree is a b-tree variant that has > internal nodes. When you write to it, including nulls, they are tracked > and may need to allocate. This is a cost for rcu lookups; we will use > the same or less memory in the end but must maintain a consistent view > of the ranges. > > But to put this into perspective, we get 16 nodes per 4k page, most > writes will use 1 or 3 of these from a kmem_cache, so we are talking > about a very unlikely possibility. Except when syzbot decides to fail > random allocations. > > We could preallocate for the write, but this section of the code is > GFP_KERNEL, so we don't. Preallocation is an option to simplify the > failure path though... which is what you did below. > > > > > > > Anyway, I would expect remove_vma() to be the one that marks it detac= hed > > > (it's already unreachable through vma_lookup() at this point) and the= re > > > you should wait for concurrent readers to bugger off. > > > > Also, I think vma_start_write() in that gather look is too early, you'r= e > > not actually going to change the VMA yet -- with obvious exception of > > the split cases. > > The split needs to start the write on the vma to avoid anyone reading it > while it's being altered. I think vma_start_write() should be done inside vms_gather_munmap_vmas() for __mmap_prepare() to work correctly: __mmap_prepare vms_gather_munmap_vmas vms_clean_up_area // clears PTEs ... __mmap_complete vms_complete_munmap_vmas If we do not write-lock the vmas inside vms_gather_munmap_vmas(), we will be clearing PTEs from under a discoverable vma. There might be other places like this too but I think we can move vma_mark_detach() like you suggested without moving vma_start_write() and that should be enough. > > > > > That too should probably come after you've passes all the fail/unwind > > spots. > > Do you mean the split? I'd like to move the split later as well.. > tracking that is a pain and may need an extra vma for when one vma is > split twice before removing the middle part. > > Actually, I think we need to allocate two (or at least one) vmas in this > case and just pass one through to unmap (written only to the mas_detach > tree?). It would be nice to find a way to NOT need to do that even.. I > had tried to use a vma on the stack years ago, which didn't work out. > > > > > Something like so perhaps? (yeah, I know, I wrecked a bunch) > > > > diff --git a/mm/vma.c b/mm/vma.c > > index 8e31b7e25aeb..45d43adcbb36 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_= munmap_struct *vms, > > struct vm_area_struct *vma; > > struct mm_struct *mm; > > > > mas_set(mas_detach, 0); > > > + mas_for_each(mas_detach, vma, ULONG_MAX) { > > + vma_start_write(next); > > + vma_mark_detached(next, true); > > + } > > + > > mm =3D current->mm; > > mm->map_count -=3D vms->vma_count; > > mm->locked_vm -=3D vms->locked_vm; > > @@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_de= tach) > > struct vm_area_struct *vma; > > > > > mas_set(mas_detach, 0); > Drop the mas_set here. > > > - mas_for_each(mas_detach, vma, ULONG_MAX) > > - vma_mark_detached(vma, false); > > - > > __mt_destroy(mas_detach->tree); > > } > > > > @@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_mu= nmap_struct *vms, > > if (error) > > goto end_split_failed; > > } > > - vma_start_write(next); > > mas_set(mas_detach, vms->vma_count++); > > error =3D mas_store_gfp(mas_detach, next, GFP_KERNEL); > > if (error) > > goto munmap_gather_failed; > > > > - vma_mark_detached(next, true); > > nrpages =3D vma_pages(next); > > > > vms->nr_pages +=3D nrpages; > > @@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vm= i, struct vm_area_struct *vma, > > struct vma_munmap_struct vms; > > int error; > > > > The preallocation needs to know the range being stored to know what's > going to happen. > > vma_iter_config(vmi, start, end); > > > + error =3D mas_preallocate(vmi->mas); > > We haven't had a need to have a vma iterator preallocate for storing a > null, but we can add one for this. > > > + if (error) > > + goto gather_failed; > > + > > init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock); > > error =3D vms_gather_munmap_vmas(&vms, &mas_detach); > > if (error) > > goto gather_failed; > > > > Drop this stuff. > > error =3D vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); > > - if (error) > > - goto clear_tree_failed; > > + VM_WARN_ON(error); > > Do this instead > vma_iter_config(vmi, start, end); > vma_iter_clear(vmi); > > > > > /* Point of no return */ > > vms_complete_munmap_vmas(&vms, &mas_detach);