From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67BDDE7718A for ; Wed, 18 Dec 2024 09:41:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0C746B0085; Wed, 18 Dec 2024 04:41:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9BD726B0088; Wed, 18 Dec 2024 04:41:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 884AD6B0089; Wed, 18 Dec 2024 04:41:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 69A966B0085 for ; Wed, 18 Dec 2024 04:41:24 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1E68B160CDA for ; Wed, 18 Dec 2024 09:41:24 +0000 (UTC) X-FDA: 82907584620.20.638AEA1 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) by imf10.hostedemail.com (Postfix) with ESMTP id 0911DC000E for ; Wed, 18 Dec 2024 09:41:07 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=infradead.org header.s=desiato.20200630 header.b=FqvimX+6; spf=none (imf10.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.92.199) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734514849; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+fAmL7GJf78nGDkmx//Bq8X/X+N3cQCfDHep37xE3Ik=; b=NIx5jMKXa0iK4jvo2jpHrz4BbqgJbvEtGPDRaL8AtbcUO86N7r6fWXNw88QKpFXCGRcV6h Eqcdm6tTG9K2KakVsE74Z76Xt9dSdxz7w3TfajFl6TJT8gUl9JGSkcy9s79fLzqcgzsTBy 2Fjq9QxesSy1s6Kx837AQgY7APLSnRU= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=infradead.org header.s=desiato.20200630 header.b=FqvimX+6; spf=none (imf10.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.92.199) smtp.mailfrom=peterz@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734514849; a=rsa-sha256; cv=none; b=ia2IEAazjz1GXafFttCb80pg646RHAajAYSum0YGG+uvtOkItUOpMGPMhPYBbbBYUOj1ee nJNabip44HSTMQEGBA8S8AZtIm7q9KSMvTkLNAe44OHA8SXL719ZqpwVTOLps5r9sYOemu v8MvfSh/qZY+saIYzqxxelw+GtomNx4= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=+fAmL7GJf78nGDkmx//Bq8X/X+N3cQCfDHep37xE3Ik=; b=FqvimX+6+e0AWVocIqyU2abwr7 YLayjW0MU500m5v+sbMVzluwXVJd7sWBk+g2/I3RTTSNM3X9rCz1oMyqJoRGWrUXBgtWacOTW5ce+ dlyiidEz9YmObS2/QYrYGydGQsSIVpehi6f5WRhNvy5hRozxRrtNzLqWSD1TQJz5MGQn7rcslupXk 3Nb+MO7B1kOIImsv0yLHzHutIEz17zT+HQBFN90Xnw3Ykz4RWbvvgY5xkiESQiShExuIwH2eq2ZdG xTbV5Uwbc+OYKpFgnGUqIIW9dIqHn9jG0UoTl5SKNdZFYJBhwwF9jX7hJEaS16btKGUC8QV3PayOK LXLB2G7g==; Received: from 77-249-17-89.cable.dynamic.v4.ziggo.nl ([77.249.17.89] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98 #2 (Red Hat Linux)) id 1tNqYD-00000005DVr-0ty3; Wed, 18 Dec 2024 09:41:05 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 5B9253002A0; Wed, 18 Dec 2024 10:41:04 +0100 (CET) Date: Wed, 18 Dec 2024 10:41:04 +0100 From: Peter Zijlstra To: Suren Baghdasaryan Cc: akpm@linux-foundation.org, willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, lokeshgidra@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, klarasmodin@gmail.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Message-ID: <20241218094104.GC2354@noisy.programming.kicks-ass.net> References: <20241216192419.2970941-1-surenb@google.com> <20241216192419.2970941-11-surenb@google.com> <20241216213753.GD9803@noisy.programming.kicks-ass.net> <20241217103035.GD11133@noisy.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0911DC000E X-Rspam-User: X-Stat-Signature: 9hpemf9ia4mxcihuhd8a8x3gtyg4u3n1 X-HE-Tag: 1734514867-756061 X-HE-Meta: U2FsdGVkX19D81YAYpqx6ZVP/jAXODSh3RynJmANyjaE5CihCDc/+qf7h9o3X+0880klWtWHqyzv6e4L0P3MPTQBpwSXBj6YaKaZM73LMVmPDijIfnEdzoVglFL8j/jM69nmEwXOV95SPISca4xu4TaDK/EDnMHR/CCFtMx199bl1v2KuqVLqVSy6ZURVAGe4LhmVpeeSifiScDwjbqyWGMxA5xsZjRRKFSfYMlBeh/1QsZq4m3WgtIHz8PWfMFDvIwlQ1TZiBSrsiId3h23ot+GVFIU2LNtBZn1fzrdduOtfjIzcAHgUBkbVJJ0OQ2FRvV3aoj7+48z1fUPnFxS56TsW71t8k54fn7Af0GO4XQvw7q7y9yNsmh/oKekgFBoiLjsZ9ZJ4qVf8QyxNEqdqIu7wTyoHlkHrOJBsyUEwM0s4P+KZK3J+L6GvIDcZ1xmOf5qNpFJ+F8XnuNqd3Rh2QN4LXgfGEJvXLmmmdBXEYVMVmS6/mifoQYxMsNgouSVE4/v7NUT+dofhYzozWjaEHK+KCltun1ro4H1VvPR4oMZ3otAOJDATs8ewNm4/slvNNDt15NcX55IPN/dIQhn7C5SOiKUiG9HNKekbvhURfdWcgzv7fdDUtVXJ6UwGm8mMnLCtm4MNJ8+prvWTmTt1T7YsSPQ9HhxKCQZZiNbPD8ZhK4UDtQS9F+BltuPH/e7f971W0Il6I9mo5XnOGSDLHBJvB9VidCod9/32kHaxHgw1rM0eoNnfo7/wBMTyK1orlaCNGQA2w3rTw23dmdHq9vxZxMY8XrXySyfvIn6KqVvWHQ4vzN3mkqzgWnQ76/2ip8KET787Lc5Kh8OmWl87l6+hdyuEcmA6ZbOcClpj2Gec2OorfAgPSZsPtKzAip35hTZdiUD4IrROl3qA2pfwFL7fGOByAzNHi2H9vHdmaWfA7+vVSCAHx1SMDYYR0N+8jlSZ/3eBXkpfNrsrEn RhW6RHpV iU1BZdGY84/bVsBlrmIxMjngD6SeZhirCOViAuYf6o12bNANX7eJeCq0Xo8LrjYvsViraT7z73xhWA3MVLJJWK4YNoTnp1DdS4VewxQsmY2+0PCqVdA75sSCVTKxKCHYwIP0orL51HIY3z/wCy7zcRF2lQtZ7OlheMGLBim6ew3cldvK1GRxkfzfwMFEQEc7IT423DY5ozFkr48KGocrMPTf61h/trYs0xT+Sr8C6J4YP8c5zwtf9L+d5Es6QCLggHSvEA2zlCuDGLGzMcnru5M/ozN2Bk/JhbQEfCMmQhk6epuc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote: > > So I just replied there, and no, I don't think it makes sense. Just put > > the kmem_cache_free() in vma_refcount_put(), to be done on 0. > > That's very appealing indeed and makes things much simpler. The > problem I see with that is the case when we detach a vma from the tree > to isolate it, then do some cleanup and only then free it. That's done > in vms_gather_munmap_vmas() here: > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we > even might reattach detached vmas back: > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW, > detached state is not final and we can't destroy the object that > reached this state. Urgh, so that's the munmap() path, but arguably when that fails, the map stays in place. I think this means you're marking detached too soon; you should only mark detached once you reach the point of no return. That said, once you've reached the point of no return; and are about to go remove the page-tables, you very much want to ensure a lack of concurrency. So perhaps waiting for out-standing readers at this point isn't crazy. Also, I'm having a very hard time reading this maple tree stuff :/ Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a second tree, it does not in fact unlink them from the mm yet. AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the mm -- and that being able to fail is mind boggling and I suppose is what gives rise to much of this insanity :/ Anyway, I would expect remove_vma() to be the one that marks it detached (it's already unreachable through vma_lookup() at this point) and there you should wait for concurrent readers to bugger off. > We could change states to: 0=unused (we can free > the object), 1=detached, 2=attached, etc. but then vma_start_read() > should do something like refcount_inc_more_than_one() instead of > refcount_inc_not_zero(). Would you be ok with such an approach? Urgh, I would strongly suggest ditching refcount_t if we go this route. The thing is; refcount_t should remain a 'simple' straight forward interface and not allow people to do the wrong thing. Its not meant to be the kitchen sink -- we have atomic_t for that. Anyway, the more common scheme at that point is using -1 for 'free', I think folio->_mapcount uses that even. For that see: atomic_add_negative*(). > > Additionally, having vma_end_write() would allow you to put a lockdep > > annotation in vma_{start,end}_write() -- which was I think the original > > reason I proposed it a while back, that and having improved clarity when > > reading the code, since explicitly marking the end of a section is > > helpful. > > The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the > vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep > tracking that rw_semaphore). If I implement vma_end_write() then it > will simply be something like: > > void vma_end_write(vma) > { > vma_assert_write_locked(vma); > vma->vm_lock_seq = UINT_MAX; > } > > so, vmlock_dep_map would not be involved. That's just weird; why would you not track vma_{start,end}_write() with the exclusive side of the 'rwsem' dep_map ? > If you want to track vma->vm_lock_seq with a separate lockdep, that > would be more complicated. Specifically for vma_end_write_all() that > would require us to call rwsem_release() on all locked vmas, however > we currently do not track individual locked vmas. vma_end_write_all() > allows us not to worry about tracking them, knowing that once we do > mmap_write_unlock() they all will get unlocked with one increment of > mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all() > with vma_end_write() and unlock vmas individually across the mm code, > that would be a sizable effort. If that is indeed your ultimate goal, > I can do that as a separate project: introduce vma_end_write(), > gradually add them in required places (not yet sure how complex that > would be), then retire vma_end_write_all() and add a lockdep for > vma->vm_lock_seq. Yeah, so ultimately I think it would be clearer if you explicitly mark the point where the vma modification is 'done'. But I don't suppose we have to do that here.