From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D981E77188 for ; Wed, 18 Dec 2024 21:56:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A10176B0083; Wed, 18 Dec 2024 16:56:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C0C36B0085; Wed, 18 Dec 2024 16:56:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 860496B0088; Wed, 18 Dec 2024 16:56:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 691406B0083 for ; Wed, 18 Dec 2024 16:56:11 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1BFB8A124C for ; Wed, 18 Dec 2024 21:56:11 +0000 (UTC) X-FDA: 82909437240.25.1652E4B Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf08.hostedemail.com (Postfix) with ESMTP id D7502160006 for ; Wed, 18 Dec 2024 21:55:47 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VpGxGvTz; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734558944; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tmtMmyE9unnAeQch+DCGTQFYo43RX5HpvanBLAqIy/g=; b=gbJrLjpDw6ncGCpUFUB/QkLcyq0lJwSnUne9/CyHQT/RbKq6woB8qptlR1geM6XcpHoyXc tTBnP4AVIO1+TDK8D8fWB87LTb7V/3Ijlbo7v/nslN2V8gLRfw5+NHgl/VyFZl+IRCcIKI T2NnQbZ3cR8lWaZGYTTTtOFS3dSEM48= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VpGxGvTz; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734558944; a=rsa-sha256; cv=none; b=z1Fzk6PV65POsPnSdp/kwyP16KoYhTyJ4CC+e4Z3Q0KHTxcjMiye3FO4DRD9jVLVbFqF+y klYXGWH1UFdap9S7zqTOWT6Vnmvt9XJxPnujDp8RadHJToBFTRULgRQNbI3LiE4AWwCBRO j1G8RBMZe0F2WM8l6oBA464FAU8jQl4= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4679b5c66d0so10961cf.1 for ; Wed, 18 Dec 2024 13:56:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734558968; x=1735163768; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tmtMmyE9unnAeQch+DCGTQFYo43RX5HpvanBLAqIy/g=; b=VpGxGvTzkWRZVym4++hq8z8c2+lAYRRDXaN4vRJj7CHSYyZbl0XEjsTnd3tNi98YrD o1MNiJGgElJ+eyNnXd6HGg6KpyAuI9BI8piGaz3xapAPco/jW1Pt9OkaFQH+J7p6l1av SXG+6xplBsI+wfccjoBQLB51WFntJ9HkjdUAgFdPvLUNqr/ol1gli2n0oNRTefjsZfxf BsnEAAwwR9XJBHG5jzXzjcweEv910qo5uAwNRgYhZ+U/Gd+V+oEoR4GMuIxvqBFUEF5c oCosaQ7CNn/VHmBQ1qCQI0kMAn1YuFa2s8w8tNwqso2bDyR6Z54dnlBdG9wa3P2FEeLa mJLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734558968; x=1735163768; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tmtMmyE9unnAeQch+DCGTQFYo43RX5HpvanBLAqIy/g=; b=atXV42gi7neF/c+O05XWm/6fDA4TfvxRfN0XIVVbSF5aCgFFMRo4pyRmSZ8eK/AdQ6 uYZ7cscU3njUSPTCwgbXVnNrb2yg//q71nrdSBrTToSVMSZUIXcPs/Nyiz3WQKe3fcmt 4IemPs+VHVs48i8yWpvJl3amnLbXCt02rjf7zMXyafTIOWLRs648v/d+RVnCrJUwHBjY OAmGUggxRZg9kwfYTzIAEqSjCJ1yFlQbpL2N5LWqd6PzQSpA2licdlKe11X59XjuZwnr sYmQvS8ykAartOYXhOS9Byxj86bOlI0cHBaIEhnujgcIXoevm7t49CVmWD4fTZtNrVdH puWQ== X-Forwarded-Encrypted: i=1; AJvYcCXSmEyfkSTdxqjRxcEREtVr8MX5ls7fEPCZKsR85RdI/NmLDZI5bvKhEYj9uFRTG/pJVReAE5Y4uw==@kvack.org X-Gm-Message-State: AOJu0YzKMzOcH6pVM0DMi75g3C12rf0nt02C+HNEPgmuXUCs04SO7qF5 xgT5GVNi1SOXkBayz764la8EUpyoM8GOMB4kxP7Nt+t+a/dSXDg4s1Sk3g+BytLmq5au/hkS6b9 O/DAsjQAwTnrX8a8lIuOrqGDQEF0jxtvc8c8q X-Gm-Gg: ASbGncuiBsF90zJwoUoEVUUwXHyKtY/aGF554lhFKf7SghcDGZTwCaqKD5w2peYOnd9 bZG+gBU87e78JThv785r2BRhMUG38gvfr+hWBBcGg0Bo8fVrqDZpXmPHwZ1IffVT+NaU/ X-Google-Smtp-Source: AGHT+IFaSx7VNvYcnc7jt9cgbQhcmvnVPtJBSA7nC/xLQItxx6f1/c4vGsBZpU9i2zMCtpnFOn7hEmDMBWdeAk5VlvY= X-Received: by 2002:a05:622a:14c7:b0:461:67d8:1f50 with SMTP id d75a77b69052e-46a3b9797a9mr862371cf.4.1734558968078; Wed, 18 Dec 2024 13:56:08 -0800 (PST) MIME-Version: 1.0 References: <20241218161850.GG2354@noisy.programming.kicks-ass.net> <20241218174428.GQ2354@noisy.programming.kicks-ass.net> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 18 Dec 2024 13:55:56 -0800 Message-ID: Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count To: "Liam R. Howlett" , Suren Baghdasaryan , Peter Zijlstra , akpm@linux-foundation.org, willy@infradead.org, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D7502160006 X-Rspamd-Server: rspam12 X-Stat-Signature: xeox6iriuz433apn89ohnb76zdypqqxh X-Rspam-User: X-HE-Tag: 1734558947-622320 X-HE-Meta: U2FsdGVkX1/CfhbJuVlbuL21y1diwn2Wkx+f3xL/ehNGZWboin/JCSmbCVeqZHSVUvI71gaid1TQQbAzYj09pxUhFIPsErfHSHhVrgWL+7K8z094vKM0LqPKxEaMSlvae+uWZI9icIhi1Ytn9YDHEAxwzoa7D991G+u+xIakqIF3fWch5dPl6du279UAZkH54d6Hpf0gxNiPO1gZVZnEYiWW0vJkfFVAbALaEhMblQbM5wpEMkp7+YGAZa3leee0ivFr1PPycWjoq9hBwXyNXZagTd/ZVzknZq7Vpv4cZZxtZJummyp/30gjHrm6mnkUBCAnZXqMjwT+AZuVq7gpYr5IeRYQYnMi63IZO8Yr9nQXZx9Z7F3CUV3I5CUPMKiYp2dWWowezkIJLnCnkMPbrTb2610HtpZYP6KJCpdsKd8yx6IVk9/KleAXwn7KaVxRRq2VueGdDXxmhIJQNTA0+6k35zpKQvVm63r2gjY+NejBkjdsUxWoc5bBk/QBflz0YHy/QuTXstM6qfJ4Vws79HBUuj0QUtZObpqCDpOrmcRdcDoETqz18dcIg6ReyCy8YEaFuGZn+NIETjQSl3SWwwJJ6GbJJAz0gpBhUijQKsynJjfUv31gUXHKd5S95oWijN8J1Sie7z5ZpC8kEqkFJDgV3Su5VB9DEoMau0qSMACYSb7cbw7IJU+zbf/xZAOOHkL5lpYMVAaNsd6GbacVMOZIiYALG8id7xr1b633oui25Cb/8Gq8KliJPchJePuPENAxIu9gkKjZqAQSG04EKoK9BGnuIwja+JSDIawgeD5W31uKXxzs9Marl/5w8dXjceimpU49XbZkoFSRWOo+qI8Nu9/G85CbyBvXL9rqAowOyR14Bxw8fFWx5kuVWO/1mCc7C0KpYmcGqsYx0/04DKuERdRb0dK05ouls5zJc6ydgsUV1fCWuubc3qnLy5Q61fOFROSEu14Hru+kI5C ubycoapl cU9aTHGZnn1YZJMmozTD+Sbecq4iZeWzLh0aKpoTaYyUyADZ42wkkedRpvbKd/Q4cQdLa2tUU7I2ZyQla4q0/RO0EeS/wbHn4QhnQ91XdSqWxOpRuH5EdUkpUoFfcRoc4IFkFltFuiMIOXuUuC1U337YENL93hO5KT4HrUSVe45DOFVXfwPlzTfT/aadNR5s7sqh12rfDWQqew8Q/MmObiLLZzZAb+ZEs/65lmfMDjiIzIODEmlMRkGSd6dWjG3EyhTjSvO+gHxPbOe4Z9bAhG48tytOSAf7xvZ8rAfjmCJ0McxCcA0jnTVnoj2Aea4shMJdZBAT37Vi2drflJhmzCNM3H0CnmUaSEmW1Ub17ky06FcxKdJwXlbPRriPePyjgby5BP4Pffshc3C+ym6J/7SxEyGEDJZsLttM5 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000959, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 18, 2024 at 1:53=E2=80=AFPM Suren Baghdasaryan wrote: > > On Wed, Dec 18, 2024 at 12:38=E2=80=AFPM Liam R. Howlett > wrote: > > > > * Suren Baghdasaryan [241218 15:01]: > > > On Wed, Dec 18, 2024 at 11:38=E2=80=AFAM 'Liam R. Howlett' via kernel= -team > > > wrote: > > > > > > > > * Suren Baghdasaryan [241218 14:29]: > > > > > On Wed, Dec 18, 2024 at 11:07=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > > > > > > > On Wed, Dec 18, 2024 at 11:00=E2=80=AFAM 'Liam R. Howlett' via = kernel-team > > > > > > wrote: > > > > > > > > > > > > > > * Suren Baghdasaryan [241218 12:58]: > > > > > > > > On Wed, Dec 18, 2024 at 9:44=E2=80=AFAM Peter Zijlstra wrote: > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasar= yan wrote: > > > > > > > > > > > > > > > > > > > > You will not. vms_complete_munmap_vmas() will call re= move_vma() to > > > > > > > > > > > remove PTEs IIRC, and if you do start_write() and det= ach() before > > > > > > > > > > > dropping mmap_lock_write, you should be good. > > > > > > > > > > > > > > > > > > > > Ok, I think we will have to move mmap_write_downgrade()= inside > > > > > > > > > > vms_complete_munmap_vmas() to be called after remove_vm= a(). > > > > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove= _vma() before > > > > > > > > > > mmap_write_downgrade(). > > > > > > > > > > > > > > > > > > Why ?! > > > > > > > > > > > > > > > > > > vms_clear_ptes() and remove_vma() are fine where they are= -- there is no > > > > > > > > > concurrency left at this point. > > > > > > > > > > > > > > > > > > Note that by doing vma_start_write() inside vms_complete_= munmap_vmas(), > > > > > > > > > which is *after* the vmas have been unhooked from the mm,= you wait for > > > > > > > > > any concurrent user to go away. > > > > > > > > > > > > > > > > > > And since they're unhooked, there can't be any new users. > > > > > > > > > > > > > > > > > > So you're the one and only user left, and code is fine th= e way it is. > > > > > > > > > > > > > > > > Ok, let me make sure I understand this part of your proposa= l. From > > > > > > > > your earlier email: > > > > > > > > > > > > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas= (struct > > > > > > > > vma_munmap_struct *vms, > > > > > > > > struct vm_area_struct *vma; > > > > > > > > struct mm_struct *mm; > > > > > > > > > > > > > > > > + mas_for_each(mas_detach, vma, ULONG_MAX) { > > > > > > > > + vma_start_write(next); > > > > > > > > + vma_mark_detached(next, true); > > > > > > > > + } > > > > > > > > + > > > > > > > > mm =3D current->mm; > > > > > > > > mm->map_count -=3D vms->vma_count; > > > > > > > > mm->locked_vm -=3D vms->locked_vm; > > > > > > > > > > > > > > > > This would mean: > > > > > > > > > > > > > > > > vms_complete_munmap_vmas > > > > > > > > vma_start_write > > > > > > > > vma_mark_detached > > > > > > > > mmap_write_downgrade > > > > > > > > vms_clear_ptes > > > > > > > > remove_vma > > > > > > > > > > > > > > > > And remove_vma will be just freeing the vmas. Is that corre= ct? > > > > > > > > I'm a bit confused because the original thinking was that > > > > > > > > vma_mark_detached() would drop the last refcnt and if it's = 0 we would > > > > > > > > free the vma right there. If that's still what we want to d= o then I > > > > > > > > think the above sequence should look like this: > > > > > > > > > > > > > > > > vms_complete_munmap_vmas > > > > > > > > vms_clear_ptes > > > > > > > > remove_vma > > > > > > > > vma_start_write > > > > > > > > vma_mark_detached > > > > > > > > mmap_write_downgrade > > > > > > > > > > > > > > > > because vma_start_write+vma_mark_detached should be done un= der mmap_write_lock. > > > > > > > > Please let me know which way you want to move forward. > > > > > > > > > > > > > > > > > > > > > > Are we sure we're not causing issues with the MAP_FIXED path = here? > > > > > > > > > > > > > > With the above change, we'd be freeing the PTEs before markin= g the vmas > > > > > > > as detached or vma_start_write(). > > > > > > > > > > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside > > > > > > mas_detach have been already write-locked, no? > > > > > > > > That's the way it is today - but I thought you were moving the lock= to > > > > the complete stage, not adding a new one? (why add a new one otherw= ise?) > > > > > > Is my understanding correct that mas_detach is populated by > > > vms_gather_munmap_vmas() only with vmas that went through > > > __split_vma() (and were write-locked there)? I don't see any path tha= t > > > would add any other vma into mas_detach but maybe I'm missing > > > something? > > > > No, that is not correct. > > > > vms_gather_munmap_vmas() calls split on the first vma, then adds all > > vmas that are within the range of the munmap() call. Potentially > > splitting the last vma and adding that in the > > "if (next->vm_end > vms->end)" block. > > > > Sometimes this is a single vma that gets split twice, sometimes no > > splits happen and entire vmas are unmapped, sometimes it's just one vma > > that isn't split. > > > > My observation is the common case is a single vma, but besides that we > > see 3, and sometimes 7 at a time, but it could be any number of vmas an= d > > not all of them are split. > > > > There is a loop for_each_vma_range() that does: > > > > vma_start_write(next); > > mas_set(mas_detach, vms->mas_count++); > > mas_store_gfp(mas_detach, next, GFP_KERNEL); > > Ah, ok I see now. I completely misunderstood what for_each_vma_range() > was doing. > > Then I think vma_start_write() should remain inside > vms_gather_munmap_vmas() and all vmas in mas_detach should be > write-locked, even the ones we are not modifying. Otherwise what would > prevent the race I mentioned before? > > __mmap_region > __mmap_prepare > vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detac= h, > // some locked > by __split_vma(), some not locked > > lock_vma_under_rcu() > vma =3D mas_walk // finds > unlocked vma also in mas_detach > vma_start_read(vma) // > succeeds since vma is not locked > // vma->detached, vm_start, > vm_end checks pass > // vma is successfully read-locked > > vms_clean_up_area(mas_detach) > vms_clear_ptes > // steps on a cleared PTE > __mmap_new_vma > vma_set_range // installs new vma in the range > __mmap_complete > vms_complete_munmap_vmas // vmas are write-locked and detached > but it's too late Sorry about the formatting. Without comments should look better: __mmap_region __mmap_prepare vms_gather_munmap_vmas lock_vma_under_rcu() vma =3D mas_walk vma_start_read(vma) // vma is still valid and attached vms_clean_up_area(mas_detach) vms_clear_ptes // steps on a cleared PTE __mmap_new_vma vma_set_range __mmap_complete vms_complete_munmap_vmas > > > > > > > > > > > > > > > > > > > > > > > > Yeah, I think we can simply do this: > > > > > > > > > > vms_complete_munmap_vmas > > > > > vms_clear_ptes > > > > > remove_vma > > > > > vma_mark_detached > > > > > mmap_write_downgrade > > > > > > > > > > If my assumption is incorrect, assertion inside vma_mark_detached= () > > > > > should trigger. I tried a quick test and so far nothing exploded. > > > > > > > > > > > > > If they are write locked, then the page faults are not a concern. = There > > > > is also the rmap race that Jann found in mmap_region() [1]. This i= s > > > > probably also fine since you are keeping the write lock in place ea= rlier > > > > on in the gather stage. Note the ptes will already be cleared by t= he > > > > time vms_complete_munmap_vmas() is called in this case. > > > > > > > > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=3D-5O_uGQ0xKXOmbjeQ0= LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/ > > > > > > > > Thanks, > > > > Liam > > > > > > > > To unsubscribe from this group and stop receiving emails from it, s= end an email to kernel-team+unsubscribe@android.com. > > > >