From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 258F3E77187 for ; Wed, 18 Dec 2024 21:53:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5A1B36B007B; Wed, 18 Dec 2024 16:53:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 550986B0082; Wed, 18 Dec 2024 16:53:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 418FC6B0083; Wed, 18 Dec 2024 16:53:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 244B46B007B for ; Wed, 18 Dec 2024 16:53:32 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9DEC6451C9 for ; Wed, 18 Dec 2024 21:53:31 +0000 (UTC) X-FDA: 82909431066.15.44F3614 Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf14.hostedemail.com (Postfix) with ESMTP id 2DA24100011 for ; Wed, 18 Dec 2024 21:52:54 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xSEww4hn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734558779; a=rsa-sha256; cv=none; b=T55Q/rd/mKvXP4mcdsEtPezgl16zHkIXhYh/fw2WMtS8W7xkqKKZZhTDgdOc4LAZbsuJZd 7J5eRZ2Bsl0DWHqyCI2KDaV92+ycWKifKfKB6sjrY1OUmbjxZOGIE/t2/cWPP1iN3bZgAz dx3VQI52GSqNUYZMiYcX6YMuxALikVY= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xSEww4hn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734558779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r/JLu6T0oQIGjWC8Vgw4keUhQ07kt7/iqExEFuVaHgE=; b=umEqUUspEOCz+ysylTe8z6/HjMrUI4xZwwxrv1eF0TgyAEXB0lmH0rnGuVlmdBKS+/Esae VDm1EPbEJveM0EXsPPx3r65rCL1hBFJFbvNcwG4PY3HXOphkzBoCvlerqPbw2ZxtjSQ3Yh UuJxFVbhdZLk5by0cUP3FpuzK0ZVcI0= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-4678c9310afso10261cf.1 for ; Wed, 18 Dec 2024 13:53:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734558809; x=1735163609; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=r/JLu6T0oQIGjWC8Vgw4keUhQ07kt7/iqExEFuVaHgE=; b=xSEww4hngWGDNFieNkhwXATem5xt7k2Bg6sFIg4lVKDcbm5dFxJWzPD2aSzPJNg2GY vZCf4+vGTVpwlyDR7FnmlA1q39zdnlu2RBGCFh95TU9W6B1DERPZ6KAVz6EpqFBtf0uW XNt70cVlfM9gIlvqb2+oMaMUUGZPwF+MeBKFD8Jnzr4dF9y4RrjPfe22skwDhKqUz3bz 9pcqEV4/E9SutKPwj/3uHTYlREZ+HEbgNtwy1AD9QffuqsQzFTel7/rH7y/O8Jr/3n6r SuCpOVo1mM4e9nCXFY8FtrZn9UFuwJWxdGJamqvYm78c28Q5/7VLn/QkbtlaRfhyO32W hQeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734558809; x=1735163609; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r/JLu6T0oQIGjWC8Vgw4keUhQ07kt7/iqExEFuVaHgE=; b=FGpuhFeEJBW37qoGKvcfyVQUj+kTYcUIJJ37Vy01lfspAPV3jDIoX85yILZpRD412M /FtBXB5BJoAHCJ68xjAHFaJWNaSPg1FFKAsmxV0NuQ1LSbhmVJGWjrYTupD4/75A1VZr pd3nz6yYR267GGm+fz31h9kydP/ULPiq10e3vo1B4IxxqUVJXA1pwB+wOk6HwfqOq93A t1s5MjEq1p6Fp4O9jWoxOTSQFlwJ3mgbMybiGH31MjGbLAfjQp8SIm433zMg4obJi8m7 zbsGVdttvPu0Ht/MBIJ7p/0WeFPhdQv5ApU1zI6EBIhKEfqpiAJi+y+PJturPzydTycR T0uA== X-Forwarded-Encrypted: i=1; AJvYcCUQmtIkC0d8QjbStADEdTXjC6q4TDPAk5+LJlMIDDdZ3hV31ChCxZGZxLlmOYkwgjb66zwlaKk6CQ==@kvack.org X-Gm-Message-State: AOJu0YxaIS3Aoc4guZvjW3chOkDRK0tzgea6I389s5ttYfBgAXicAshS Wikk1FuCfBPVI6m1wbfjakfq+tLNn0EP7Cj/NoXs5Sol5kZVBYHTat0KrA+nvYuFmwFA/rBw/iS T7KWRvA5MVoZEXKobzP3wswZKF0ULO5D5ZFqM X-Gm-Gg: ASbGncuPb6lY0SFJKTjvUSp38XNU1NWfJM9DjxDuxzrOrx978LJZ/tqWGqS9DXNxEe3 EQUOg5QfHTohhTT/D8WIM4Bs2QGcWkAwanimH6HKolPUtSxWUgFysrsu/jEk5XwdflD2B X-Google-Smtp-Source: AGHT+IGTvjtHfkSnsti1gp6AxGHe6GoyfbxsCNyIl+Zw83NzTtoMzAa2jWNZftX/4UQEDHCkuCHPwDVAFj/JXl9yarI= X-Received: by 2002:a05:622a:4cf:b0:461:4150:b835 with SMTP id d75a77b69052e-46a3b94fdc2mr817961cf.6.1734558808650; Wed, 18 Dec 2024 13:53:28 -0800 (PST) MIME-Version: 1.0 References: <20241218161850.GG2354@noisy.programming.kicks-ass.net> <20241218174428.GQ2354@noisy.programming.kicks-ass.net> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 18 Dec 2024 13:53:17 -0800 Message-ID: Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count To: "Liam R. Howlett" , Suren Baghdasaryan , Peter Zijlstra , akpm@linux-foundation.org, willy@infradead.org, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: yyxw59ebt9n9m7j9koxhpqwha4xsz8ee X-Rspam-User: X-Rspamd-Queue-Id: 2DA24100011 X-Rspamd-Server: rspam08 X-HE-Tag: 1734558774-327388 X-HE-Meta: U2FsdGVkX1/G2vYGEU0wp7OSgyYQ0SRE7bOvSixx93uPcrBRhk09Mjod9n04G8PPV3ZvDF0sG+BY8awpU77uLt2NG+haPKotkdgWx35gRMf4c6OjTVOfV5KR91rPLYdm5ukcyc/oSqh7zxvySfy5iuzcXHjnk/PlbhdIp7qfd/dgZnsf/o5iJTklZbhoFO9VZm0w0/BWe12G5a/sGmYHuhE0U5UJmQoVQnKwS+8AxQYKoR9Os0OKdFoTiWjqcbXELzeGmEokugeMRPQIbxwWbeeEwJyNtcmN/xh/26KzgS1OAAhJe0gM2RAbzvRpE+Ao3f31K3dvBwbEjqdtG4sVztj1SuU+4kt8K7Hzt/wKp8YGZHjzPJo704vZrmjLO7KlG8Tuk2yF4jblqCbEBHFfUR788V1ngHTzcbdmurfIbh1IGidSA0rUSOSv7/mzWsJje+6e3QqT95IsbvIZ31zvRX2cTjwrgAr84nwlGXIvf934BxjTk2rruZyo1sfepRc+MTCDBKYMj1gIdEpnMFHb3xES6Yj0aiNZsOIRZ+0rXLwX9Hv0ia+yWpwjlkJZCtlHEZ4hCxJkSk2U2pBuEIW9ghBuFpZLu1/1ZwC8n0snISCdXdFlX72p13dB6pB/9FKMv0n/4nNvXmcSMZTEMeu1CuF5bHXOsRALE/UMeYm0y4CLhxE0WmDFiY6GWCY+cb4Ee9e5crpATK+y5EdqJdq/elVoYzK2MG2kw8oL7NbIyEteX0aXY5/jZUHzyD1P343HdA9S8yzEMsm1yQ8zgN+bO6Cm/tslOWbwkPx3NLZl5/Jm6s5E0jKuxDRLBODh169EAwpDIkmdi0sJPU1RpICdmL9N//2b6+2X/y9S7lpQbDWaP0Ea+fuw8vHim+t3NrZqw/tJijF7sm6CcsxP3dx5yPSXvSujBOR7sDASn1nDsdcZ0OAdNj+5WiJcXxpT1hN/9QDojMYWbyczHqM175O 9tQ8ZoF9 JsSCaqx9VnB4Qxz4HOhEEwosKwPgJ8BH+towFxxU2uJ/uwS707StTYQ3kRPrF2eNzSNUJH2HuSfuhNXUAcjzjUHt2gHAmEnI3g48NyrsuYjTHTcV+g8wuFYlmFuhp211tlkKZx7bVtxAukcmtxVRs7d6v6ICn9aIZw+83mkOX9Din2Q15oiBHqPsj3+xmwmlOhKsp6YPiK5k8tKfP7+R206VB+PYBIUUSK+Rd66FL0+Hx5/tI5xLw8yUXXzb3FponftmNTzH8lK1HpoijQ/CvbsQzRH7j8krfuUfNJ2F7d5XmSaatcqgMSCPLpFxWKJN9gITOzttIUn2a6/xSJi6Va3xxyeCUspYZK66cEQze9E0mPB5WVVvJAKJ4sAKZim9bb+7U6yrsI26krZcdQU31PpAkVAVDqueLXmiA X-Bogosity: Ham, tests=bogofilter, spamicity=0.001719, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 18, 2024 at 12:38=E2=80=AFPM Liam R. Howlett wrote: > > * Suren Baghdasaryan [241218 15:01]: > > On Wed, Dec 18, 2024 at 11:38=E2=80=AFAM 'Liam R. Howlett' via kernel-t= eam > > wrote: > > > > > > * Suren Baghdasaryan [241218 14:29]: > > > > On Wed, Dec 18, 2024 at 11:07=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > > > > > On Wed, Dec 18, 2024 at 11:00=E2=80=AFAM 'Liam R. Howlett' via ke= rnel-team > > > > > wrote: > > > > > > > > > > > > * Suren Baghdasaryan [241218 12:58]: > > > > > > > On Wed, Dec 18, 2024 at 9:44=E2=80=AFAM Peter Zijlstra wrote: > > > > > > > > > > > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasarya= n wrote: > > > > > > > > > > > > > > > > > > You will not. vms_complete_munmap_vmas() will call remo= ve_vma() to > > > > > > > > > > remove PTEs IIRC, and if you do start_write() and detac= h() before > > > > > > > > > > dropping mmap_lock_write, you should be good. > > > > > > > > > > > > > > > > > > Ok, I think we will have to move mmap_write_downgrade() i= nside > > > > > > > > > vms_complete_munmap_vmas() to be called after remove_vma(= ). > > > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_v= ma() before > > > > > > > > > mmap_write_downgrade(). > > > > > > > > > > > > > > > > Why ?! > > > > > > > > > > > > > > > > vms_clear_ptes() and remove_vma() are fine where they are -= - there is no > > > > > > > > concurrency left at this point. > > > > > > > > > > > > > > > > Note that by doing vma_start_write() inside vms_complete_mu= nmap_vmas(), > > > > > > > > which is *after* the vmas have been unhooked from the mm, y= ou wait for > > > > > > > > any concurrent user to go away. > > > > > > > > > > > > > > > > And since they're unhooked, there can't be any new users. > > > > > > > > > > > > > > > > So you're the one and only user left, and code is fine the = way it is. > > > > > > > > > > > > > > Ok, let me make sure I understand this part of your proposal.= From > > > > > > > your earlier email: > > > > > > > > > > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(s= truct > > > > > > > vma_munmap_struct *vms, > > > > > > > struct vm_area_struct *vma; > > > > > > > struct mm_struct *mm; > > > > > > > > > > > > > > + mas_for_each(mas_detach, vma, ULONG_MAX) { > > > > > > > + vma_start_write(next); > > > > > > > + vma_mark_detached(next, true); > > > > > > > + } > > > > > > > + > > > > > > > mm =3D current->mm; > > > > > > > mm->map_count -=3D vms->vma_count; > > > > > > > mm->locked_vm -=3D vms->locked_vm; > > > > > > > > > > > > > > This would mean: > > > > > > > > > > > > > > vms_complete_munmap_vmas > > > > > > > vma_start_write > > > > > > > vma_mark_detached > > > > > > > mmap_write_downgrade > > > > > > > vms_clear_ptes > > > > > > > remove_vma > > > > > > > > > > > > > > And remove_vma will be just freeing the vmas. Is that correct= ? > > > > > > > I'm a bit confused because the original thinking was that > > > > > > > vma_mark_detached() would drop the last refcnt and if it's 0 = we would > > > > > > > free the vma right there. If that's still what we want to do = then I > > > > > > > think the above sequence should look like this: > > > > > > > > > > > > > > vms_complete_munmap_vmas > > > > > > > vms_clear_ptes > > > > > > > remove_vma > > > > > > > vma_start_write > > > > > > > vma_mark_detached > > > > > > > mmap_write_downgrade > > > > > > > > > > > > > > because vma_start_write+vma_mark_detached should be done unde= r mmap_write_lock. > > > > > > > Please let me know which way you want to move forward. > > > > > > > > > > > > > > > > > > > Are we sure we're not causing issues with the MAP_FIXED path he= re? > > > > > > > > > > > > With the above change, we'd be freeing the PTEs before marking = the vmas > > > > > > as detached or vma_start_write(). > > > > > > > > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside > > > > > mas_detach have been already write-locked, no? > > > > > > That's the way it is today - but I thought you were moving the lock t= o > > > the complete stage, not adding a new one? (why add a new one otherwis= e?) > > > > Is my understanding correct that mas_detach is populated by > > vms_gather_munmap_vmas() only with vmas that went through > > __split_vma() (and were write-locked there)? I don't see any path that > > would add any other vma into mas_detach but maybe I'm missing > > something? > > No, that is not correct. > > vms_gather_munmap_vmas() calls split on the first vma, then adds all > vmas that are within the range of the munmap() call. Potentially > splitting the last vma and adding that in the > "if (next->vm_end > vms->end)" block. > > Sometimes this is a single vma that gets split twice, sometimes no > splits happen and entire vmas are unmapped, sometimes it's just one vma > that isn't split. > > My observation is the common case is a single vma, but besides that we > see 3, and sometimes 7 at a time, but it could be any number of vmas and > not all of them are split. > > There is a loop for_each_vma_range() that does: > > vma_start_write(next); > mas_set(mas_detach, vms->mas_count++); > mas_store_gfp(mas_detach, next, GFP_KERNEL); Ah, ok I see now. I completely misunderstood what for_each_vma_range() was doing. Then I think vma_start_write() should remain inside vms_gather_munmap_vmas() and all vmas in mas_detach should be write-locked, even the ones we are not modifying. Otherwise what would prevent the race I mentioned before? __mmap_region __mmap_prepare vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach, // some locked by __split_vma(), some not locked lock_vma_under_rcu() vma =3D mas_walk // finds unlocked vma also in mas_detach vma_start_read(vma) // succeeds since vma is not locked // vma->detached, vm_start, vm_end checks pass // vma is successfully read-locked vms_clean_up_area(mas_detach) vms_clear_ptes // steps on a cleared PTE __mmap_new_vma vma_set_range // installs new vma in the range __mmap_complete vms_complete_munmap_vmas // vmas are write-locked and detached but it's too late > > > > > > > > > > > > > > > Yeah, I think we can simply do this: > > > > > > > > vms_complete_munmap_vmas > > > > vms_clear_ptes > > > > remove_vma > > > > vma_mark_detached > > > > mmap_write_downgrade > > > > > > > > If my assumption is incorrect, assertion inside vma_mark_detached() > > > > should trigger. I tried a quick test and so far nothing exploded. > > > > > > > > > > If they are write locked, then the page faults are not a concern. Th= ere > > > is also the rmap race that Jann found in mmap_region() [1]. This is > > > probably also fine since you are keeping the write lock in place earl= ier > > > on in the gather stage. Note the ptes will already be cleared by the > > > time vms_complete_munmap_vmas() is called in this case. > > > > > > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=3D-5O_uGQ0xKXOmbjeQ0Lj= ZsRJ1Qtf2X5eOr1w@mail.gmail.com/ > > > > > > Thanks, > > > Liam > > > > > > To unsubscribe from this group and stop receiving emails from it, sen= d an email to kernel-team+unsubscribe@android.com. > > >