From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B12BCE77187 for ; Wed, 18 Dec 2024 15:42:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 418686B0099; Wed, 18 Dec 2024 10:42:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C6266B00A2; Wed, 18 Dec 2024 10:42:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21B046B00A4; Wed, 18 Dec 2024 10:42:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id F1EA06B00A2 for ; Wed, 18 Dec 2024 10:42:44 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7F75F140FB6 for ; Wed, 18 Dec 2024 15:42:44 +0000 (UTC) X-FDA: 82908495180.17.16903BB Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf18.hostedemail.com (Postfix) with ESMTP id 364151C0004 for ; Wed, 18 Dec 2024 15:42:28 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Mu/tMNMD"; spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734536532; a=rsa-sha256; cv=none; b=egslpd/rWQOtFmNm9KboZaaG3YkwP6w7Dy2BwlgTYnxlbUYwNqteHc5rtozAs0QgKLPXgr uHjC8UP7Ut9Bu125I2sX6jA6t2HSYPF5tdSS2qal/PfTGSuNZ8anMqS2CF7XCrWPoisRtJ NDJPDYSy7g4DvAD9tBPvIOlDhuyXWoM= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Mu/tMNMD"; spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734536532; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RhLwZHFFLal8r/YFxj6rYNit8ZadJWy9LCabm24mSbs=; b=vEQbDbxk4QNFzq11NGfBt1fiYA5HftGJ0HLoNG0uNvDLLBhNlS1pt9dyqf2vRuIOHWe4r0 VcDP6jmEEFHyAzlwsWsPND0T/9QKo59oWKbmIH+BXYvsUAGtplNUEXLN8E/tIDjONsioMW hUzoMx5Ky0YwnTeprTSKNoQSOoS6Yjw= Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-467abce2ef9so307241cf.0 for ; Wed, 18 Dec 2024 07:42:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734536562; x=1735141362; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RhLwZHFFLal8r/YFxj6rYNit8ZadJWy9LCabm24mSbs=; b=Mu/tMNMDvF0zyoSh1dHXeme48EMoqbtdKqbVPdU5Afhatwg4bsjaQkRJcjx/cRD2u+ /83Gh2rJUiDe3X9asggVPrUuYnnK7uHq7jv0tkr5bEHu4mIZYgUM31bRIU/KiM4h/Yls UWdjPHjtwH3lFBI8UVHaFTIIpmbgkU4giMhRyOw173FtDrcZHKTaIqGNfUNh5wXgRUC9 AAIC19ESyxV2Z6fz7bpVg2dpDiyGSYBM9PBL6E8CxtUDJArlswnHyvGsgLaF126CUj91 f2A4oyK329E7qv+rmohNa1sogQUyRG5XaUonDawr3cX2Y/Ht3i/NNAB+NPzqQ7w8dd7h tBJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734536562; x=1735141362; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RhLwZHFFLal8r/YFxj6rYNit8ZadJWy9LCabm24mSbs=; b=G1ActwOJEN56ZOnJvqxbU4/H8gwViS/q0rsxdwrxZ3IerGJavZhI+DeJuh68szJQwa 436XDPEgfIB2J4r13RQRUbSlhhHDAEkakV2IRWUMp1lZUW05iJS/wYBB9vyDp51iV4pv maqTP086UYn5nYkkVkwcxvRtb0DsPiUlv9C4e6nabe8JD64s/3AOeTH+4cN2pmR8v+EA xzAcpFRMYfOR67sXlIX08SV4l0T6x7urTcTi/cbRPshQpy47Zpl17l3eYhsfRdX1d91L AW34Yf1/2uY2IZ3cWMfRfNHcBjYYSazwn8BaNGOPARlRrZNFfBwb8rPQIV8R9O/+Laja 6PXA== X-Forwarded-Encrypted: i=1; AJvYcCV3g8747EgK9RUmRJ1G9BQQBqfEUruab05/N+LbDwe37SBCmvxZzp7b/kWF1Rowx85mDAHWjwxIQw==@kvack.org X-Gm-Message-State: AOJu0Yx1PMz+uAjXMCMkiitpO+VZIxImH/CHXDT4dmonw8FXKCwUZ1g0 Tk0Iszwcs40Pa1So1QLKP8qQPEL5zij7FIsyBqm0fr0aa4n+MtPNkXLMUilKSwWgP1FVU0KJtFc 7swHmAqOzJkZxu+DiQuNaNK4/6ij2vpzpbejy X-Gm-Gg: ASbGncv5+l39UkM2Os7Eycl3sO85e97BGlgV4CNOE/eRSDtVEXMPwTM9SYEbG2pbtBf Cp8erTtJSO+Y9Cfp71wKCPAszqnIcjp+JfpQ3UQ== X-Google-Smtp-Source: AGHT+IHIVrlAuAEmTItLgcIAZKH1cOIzgAhlTg75ribMwVg4yaOTqgzEJEFSBKfxanrmcExN/1AW/CMo/9X1UD4BkME= X-Received: by 2002:a05:622a:2298:b0:46a:312a:54bc with SMTP id d75a77b69052e-46a312a580emr2326541cf.24.1734536561349; Wed, 18 Dec 2024 07:42:41 -0800 (PST) MIME-Version: 1.0 References: <20241216192419.2970941-1-surenb@google.com> <20241216192419.2970941-11-surenb@google.com> <20241216213753.GD9803@noisy.programming.kicks-ass.net> <20241217103035.GD11133@noisy.programming.kicks-ass.net> <20241218094104.GC2354@noisy.programming.kicks-ass.net> In-Reply-To: <20241218094104.GC2354@noisy.programming.kicks-ass.net> From: Suren Baghdasaryan Date: Wed, 18 Dec 2024 07:42:30 -0800 Message-ID: Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count To: Peter Zijlstra Cc: akpm@linux-foundation.org, willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 364151C0004 X-Stat-Signature: smgoceumrn3g68ka5f8qgfznj463pgmh X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1734536548-406053 X-HE-Meta: U2FsdGVkX19th5s6S5/FlxWQUGfuYnfuCLYcMzdShoOB9WcYIyJu9T7KbQj1OvdxbXKXEHI2Dc92Q0YzmfWaPhIExWJU/vglxXZZ0AI5RFuPEAsNfHa1feRdlHdg6XVAjnPdeSvLEzc0r1EL66QsN8goXkBZ6yT1SUnhwHlDiMPip+TbYEVkpcJu3S7yMN7eEn28QhTJsmKcWwDj5aZYS6g+M5tRQYZxNV4Zx4G8CFpw4UEgihBJI15IigZYZaINYckzuIG2txW2ZR5kiDCFVTFYtvA0KGz1GAivVQhcICgxYjoYBmb+EQdDFhaZNts7BY5oK9jSxfb45aL4GRkXdSn+ucySeK3NDtpVvEuhZa1yJ7GDsHFx2BPEk6uGR5ZoHa44BVn6wDZbNBYXt1lNi+OsG6oxZSEzGg5vgkdOxy9WJwvlFyxd5qtA05IbNv/H7BPgTVb3jcu56LSaIQhrk6edsD/cdklvfZ8kC6XtBG5bpbWnQbQfVSauIVy66v3hFyHKV74tOcYgKRLENRv7dePHgrowCGfGfOzDoHXXjINUkie/ER0YlvLZNvLXOoX8hiQYOKSgx8scB+w2CHTbKbPNN/Pt9bM7W96IwbQ5s+wYKJphS/PIXCFxPoCsuP2CjRlDkH/EFOMezQ8Tb3Fnn6o66idvbYO1X1c+xQOooQmw9OXxxLv39PbNPJQOqHybiangb6e0/PQ6zetHHgHwnWSVYA4fdFsyFbXTzCpUNh8UJ2o3NYWt7Zgmmb5ErBnicRnA3v4G8DwbtbV4lRDJXdR2WxdkLJ3VDvXE6KpxzdmSk3fIk7wT7cj9jWfrWlJV5UeDv0SU5QC8P0Id2KFgXaepiFK8bxJsGOnD/7ixagzvapuW0KFAkFd6Guh+fSPpNXL6WnpcrDFhATh+465Qo//EfpKGrDiv6Ixf58wDa+8UKrBm7pD2AyLEjOO9qAOUnAp8fPf/QArd3UdQMIE wrHIiIHD jeWEnnEvsgJQOHgrL10X04JXoJFVfjkY9u8sCVx3eJy2VKGu8qW6dy1Yo5Q+8BvmGgy54Q++UX3xG8IBKo0BF800XC6r/+75tvWlOlJzGPNiRUEChAByIIndmMuOBZIa0xF7lmgwduIFCGWeylqGdYHO6aAck6gJ7r24VOfpyMaST85rR0jb7U34DV1ADmNl3uCMAnTbNea1aqtFmuyjGnWih89AaPt1jf598YKD01qmf3/ZehjADLP2R1YUXltgkLmuHyHDb8klQ/8JnEWfr9RNgNr8CNogSHmgSaLQyiIQkCfrS6/ZEUUj1kW4SWunbMQDMtjX8nitm4/QRn+9X9G0CsQlgbGMzW+P3SreB2H/ElUg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.233842, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 18, 2024 at 1:41=E2=80=AFAM Peter Zijlstra wrote: > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote: > > > > So I just replied there, and no, I don't think it makes sense. Just p= ut > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0. > > > > That's very appealing indeed and makes things much simpler. The > > problem I see with that is the case when we detach a vma from the tree > > to isolate it, then do some cleanup and only then free it. That's done > > in vms_gather_munmap_vmas() here: > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we > > even might reattach detached vmas back: > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW, > > detached state is not final and we can't destroy the object that > > reached this state. > > Urgh, so that's the munmap() path, but arguably when that fails, the > map stays in place. > > I think this means you're marking detached too soon; you should only > mark detached once you reach the point of no return. > > That said, once you've reached the point of no return; and are about to > go remove the page-tables, you very much want to ensure a lack of > concurrency. > > So perhaps waiting for out-standing readers at this point isn't crazy. > > Also, I'm having a very hard time reading this maple tree stuff :/ > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a > second tree, it does not in fact unlink them from the mm yet. Yes, I think you are correct. > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the > mm -- and that being able to fail is mind boggling and I suppose is what > gives rise to much of this insanity :/ > > Anyway, I would expect remove_vma() to be the one that marks it detached > (it's already unreachable through vma_lookup() at this point) and there > you should wait for concurrent readers to bugger off. There is an issue with that. Note that vms_complete_munmap_vmas() that's calling remove_vma() might drop the mmap write lock, so detaching without a write lock would break current rules. > > > We could change states to: 0=3Dunused (we can free > > the object), 1=3Ddetached, 2=3Dattached, etc. but then vma_start_read() > > should do something like refcount_inc_more_than_one() instead of > > refcount_inc_not_zero(). Would you be ok with such an approach? > > Urgh, I would strongly suggest ditching refcount_t if we go this route. > The thing is; refcount_t should remain a 'simple' straight forward > interface and not allow people to do the wrong thing. Its not meant to > be the kitchen sink -- we have atomic_t for that. Ack. If we go this route I'll use atomics directly. > > Anyway, the more common scheme at that point is using -1 for 'free', I > think folio->_mapcount uses that even. For that see: > atomic_add_negative*(). Thanks for the reference. > > > > Additionally, having vma_end_write() would allow you to put a lockdep > > > annotation in vma_{start,end}_write() -- which was I think the origin= al > > > reason I proposed it a while back, that and having improved clarity w= hen > > > reading the code, since explicitly marking the end of a section is > > > helpful. > > > > The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the > > vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep > > tracking that rw_semaphore). If I implement vma_end_write() then it > > will simply be something like: > > > > void vma_end_write(vma) > > { > > vma_assert_write_locked(vma); > > vma->vm_lock_seq =3D UINT_MAX; > > } > > > > so, vmlock_dep_map would not be involved. > > That's just weird; why would you not track vma_{start,end}_write() with > the exclusive side of the 'rwsem' dep_map ? > > > If you want to track vma->vm_lock_seq with a separate lockdep, that > > would be more complicated. Specifically for vma_end_write_all() that > > would require us to call rwsem_release() on all locked vmas, however > > we currently do not track individual locked vmas. vma_end_write_all() > > allows us not to worry about tracking them, knowing that once we do > > mmap_write_unlock() they all will get unlocked with one increment of > > mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all() > > with vma_end_write() and unlock vmas individually across the mm code, > > that would be a sizable effort. If that is indeed your ultimate goal, > > I can do that as a separate project: introduce vma_end_write(), > > gradually add them in required places (not yet sure how complex that > > would be), then retire vma_end_write_all() and add a lockdep for > > vma->vm_lock_seq. > > Yeah, so ultimately I think it would be clearer if you explicitly mark > the point where the vma modification is 'done'. But I don't suppose we > have to do that here. Ack.