From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F89CC3ABAA for ; Mon, 5 May 2025 13:50:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5F6A56B0085; Mon, 5 May 2025 09:50:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 57D466B0089; Mon, 5 May 2025 09:50:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 41EBC6B0092; Mon, 5 May 2025 09:50:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 20CBD6B0085 for ; Mon, 5 May 2025 09:50:52 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id DBA6D1CC3CB for ; Mon, 5 May 2025 13:50:53 +0000 (UTC) X-FDA: 83408989986.21.32B7168 Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf15.hostedemail.com (Postfix) with ESMTP id 40B7EA0017 for ; Mon, 5 May 2025 13:50:52 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="hTQlqip/"; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf15.hostedemail.com: domain of brauner@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746453052; a=rsa-sha256; cv=none; b=Il72c8caIfaOAQqcjsmHnJTn9xShlC2xEHhb9kiV237bokc+FG4Vfao6FmBUis1JBJ0zME K8OV+otjoQRGeIAkvs9eH2KW2PShzt7pWD4HrH2Zdu1WPc9ociLBjZDhZj/d9Y+qo6WxOJ A1+vCNAWAY8JnDNLQJc8z00T72xUINw= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="hTQlqip/"; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf15.hostedemail.com: domain of brauner@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746453052; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xo1jXVoCiMxXMbAACLU6SK5LvOzTpNeLH4R9QNIQZWQ=; b=QkJc8oFS23Hkb6pvmadGxzHn60Oy5ezEkdAQzZ6/WFH+HRoo3uCtKP4q9r1rhiVcKQC56p pPGLBrgFojhSSPOLLKawoCs0Js16w9j8IDrH3BwyiJh7Szmlu97evKKASNRiF6Y794DFWc 9bnt8jXNwI4pl8Fne9uPsz0nOnUIMxM= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 2F8F561135; Mon, 5 May 2025 13:50:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 17FD5C4CEE4; Mon, 5 May 2025 13:50:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1746453051; bh=Vqk0ZpZc2qCwVlVPa1gNgdMWQzO0ZUreacqfzsriyUM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=hTQlqip/9RtDq3faJnMsVwE+9Nde3Mnr3H4IzjZ81T1juQF7EOoMHDLstfmKej10J H5gITaG4z0iFzIuIUu5YI7joHG0oT4jg+r6UknQ66QsthJD0whwF68MulFkkvvx6et WST+LOzBuWuy5KdJQ27kmB+EfVrnyofS9Igy5t4TJA6uKGA0YIxV3KfJwtKhE0pqQt 4t7Fv+aAPqZhR3s55I2uhb1r5gusiltCLG2pwI7myW53nhgqHhnin9Pc/ah2May61z 7/WfFkRLcv9dcAN0AftB7MhohCeBd6cdZO5VhJRPo7G8KY1PzQPG8f2UkLMhs1fkaW 5roincr1ws3Uw== Date: Mon, 5 May 2025 15:50:43 +0200 From: Christian Brauner To: Jann Horn Cc: Suren Baghdasaryan , Al Viro , linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Message-ID: <20250505-wachen-konform-3fe08f1b3214@brauner> References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 40B7EA0017 X-Rspamd-Server: rspam04 X-Stat-Signature: 7qwj793pues3ozoebhuykp9qrrph9g3b X-HE-Tag: 1746453052-273735 X-HE-Meta: U2FsdGVkX18hytkiJHf+zHkWegYlq8gWczhT84vFpEx7AMZi33vH52IrPK5VL8wQ111FxrKxga7Vvm17irNyUGfpk1PvT41JBZXTFKP0LxghpkcCk1HABCsH4Yics31wM3N8eoBZREwmOGbcSzmgEuDGPiqUTOYouBb8uJLOJApNJuHvnHykbn4WUT31SqK2fDkTNFCukN4KGrNNbOfeGWOPzr3PYT2xw95LVxOA96PDDa8cx4H7nF99u8p+6UJQk9ueAf0CSuoDcaY0IGqz2wM4rxNK6dU5GtSr+j7fZNqG8UZWNGSZ8gT1GSidCBeNOPLvM5hj0BvZRtFwImmvRyWntEUu49TC2R3pmROApc30N+tDbUUr0GHFq28n0Tf5M9YaYayeGpN2zVweLavBunlkPMlzR4hIdimyV5WcE1bW1BDrnIz4tNWbE9NpLPIsK1H6wf/kOIDZSoCYjRvILz9HRQCF33X118kjAmbeB+A54+dtnvGV6W8DyBUvLwOyOweAwM9tzNZ6xxrmOurQeYm3APOOCysoSWnZx+T4LJ0OQnsZzekB779L7VTzTdwLqVY7IUlIm4L0c2qae+xhrq6/CxBA9OAoypyx0nPpqgOXoT1UaEhPEDh6XNHcdH2/9MzqQimA2R/RWTeAz5GJYURcZixNYa1sFDkMkAbaA7KSp3DMJn/3ZbflyvFbJbkumsjKp3wFh9x6VK1guy+K09qeQZg7Q3mgFbIaA8rvvk1O0RXOBs0PIyqhgoH9/C7MMO3wG/OOrhjqLHkfYt8gr9Wi7quzkwIvvtecu5+41TX/8aK/x3IVjmj9KDpCdwPNySV0t3fDku1pUrOFa0UCoyhki+65qx64lE3zCKPCjMqUK/loUd1OGO/a2GD5GJp3IvfG8IsuDLMayN5o55pwbnuLYWnm47/F6sj2PsLXIuh5tfn3dxMariNn77XYzJzJKKesQscgNT+gk3ybeoE bIiJql0b cx8GyBd5b0eVdaQBACng/rpiWvhIGm4EqyHzM95WCBVQpsSMjg+35to6zZ9b7MZcBVxmqWWpFXUoLQC5MMWuaGRSvGXzGL+gYHQkh99VhZbWaug09lwiyaYGzb88KGcmXxpWJpTV/oMUSFHp9fwP4S0YmCD3z0NJ8AS3FM9iGYFzoUZ8CGpjwtL7pYsiycY6b251uIFsBt2asApj9FDRffJG8ryr0UEH7zLOg2zDRSnO0Pm6XoH8fnQPMKieOcN1AntjpzRDEnCID2Kniee0AtQWi6FahT0qfHe+ZZpWfvkI1ShoWjgYR8JyHS/REHBks70dj9EE40LuWlNKYC32DLqgCOwasmCrKj0K4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote: > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan wrote: > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > > > Hi! > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > forgot all about that...) > > > > Does this fact affect your previous comments? Just want to make sure > > I'm not missing something... > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > down a VMA, and using get_file_rcu() for the lockless lookup, I did > not realize that you could actually also race with all the other > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > did not think about whether any of those code paths might leave a VMA > with a dangling ->vm_file pointer. > > I guess maybe that means you really do need to do the lookup from the > copied data, as you did in your patch; and that might require calling > get_file_active() on the copied ->vm_file pointer (instead of > get_file_rcu()), even though I think that is not really how > get_file_active() is supposed to be used (it's supposed to be used > when you know the original file hasn't been freed yet). Really what I think it's fine for get_file_active() to be used in this way. That ->vm_file pointer usage should get a fat comment above it explaining how what you're doing is safe. > you'd want for that is basically a raw __get_file_rcu(), but that is > static and I think Christian wouldn't want to expose more of these > internals outside VFS... Yeah, no. I don't want that to be usable outside of that file. > (In that case, all the stuff below about get_file_rcu() would be moot.) > > Or you could pepper WRITE_ONCE() over all the places that write > ->vm_file, and ensure that ->vm_file is always NULLed before its > reference is dropped... but that seems a bit more ugly to me. > > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan wrote: > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > > pin pointer fields used when generating the output (currently only > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > the mmap_lock for reading during such contention, we do that momentarily > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > > > > (often a low priority task, such as monitoring/data collection services) > > > > > > from blocking address space updates. > > > > > [...] > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > --- a/fs/proc/task_mmu.c > > > > > > +++ b/fs/proc/task_mmu.c > > > > > [...] > > > > > > +/* > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > > > > + * show_map_vma. > > > > > > + */ > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma) > > > > > > +{ > > > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > > > + int ret = -EAGAIN; > > > > > > + > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > + goto out; > > > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > > pointer which always points to a file with a non-zero refcount (except > > > > > when it is NULL). That's why it takes a file** instead of a file* - if > > > > > it observes a zero refcount, it assumes that the pointer must have > > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > > > > pointer that points to a file with zero refcount, which I think can > > > > > happen with this patch, will cause an endless loop. > > > > > (Just as background: For other usecases, get_file_rcu() is supposed to > > > > > still behave nicely and not spuriously return NULL when the file* is > > > > > concurrently updated to point to another file*; that's what that loop > > > > > is for.) > > > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > > > > checking the return value of get_file_rcu() and retrying speculation > > > > if it changed. > > > > > > I think you could probably still end up looping endlessly in get_file_rcu(). > > (Just to be clear: What I meant here is that get_file_rcu() loops > *internally*; get_file_rcu() is not guaranteed to ever return if the > pointed-to file has a zero refcount.) > > > By "retrying speculation" I meant it in the sense of > > get_vma_snapshot() retry when it takes the mmap_read_lock and then > > does mmap_lock_speculate_try_begin to restart speculation. I'm also > > thinking about Liam's concern of guaranteeing forward progress for the > > reader. Thinking maybe I should not drop mmap_read_lock immediately on > > contention but generate some output (one vma or one page worth of > > vmas) before dropping mmap_read_lock and proceeding with speculation. > > Hm, yeah, I guess you need that for forward progress... > > > > > > (If my understanding is correct, maybe we should document that more > > > > > explicitly...) > > > > > > > > Good point. I'll add comments for get_file_rcu() as a separate patch. > > > > > > > > > > > > > > Also, I think you are introducing an implicit assumption that > > > > > remove_vma() does not NULL out the ->vm_file pointer (because that > > > > > could cause tearing and could theoretically lead to a torn pointer > > > > > being accessed here). > > > > > > > > > > One alternative might be to change the paths that drop references to > > > > > vma->vm_file (search for vma_close to find them) such that they first > > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > > > > > maybe with a new helper like this: > > > > > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > > > { > > > > > struct file *file = vma->vm_file; > > > > > > > > > > if (file) { > > > > > WRITE_ONCE(vma->vm_file, NULL); > > > > > fput(file); > > > > > } > > > > > } > > > > > > > > > > Then on the lockless lookup path you could use get_file_rcu() on the > > > > > ->vm_file pointer _of the original VMA_, and store the returned file* > > > > > into copy->vm_file. > > > > > > > > Ack. Except for storing the return value of get_file_rcu(). I think > > > > once we detect that get_file_rcu() returns a different file we should > > > > bail out and retry. The change in file is an indication that the vma > > > > got changed from under us, so whatever we have is stale. > > > > > > What does "different file" mean here - what file* would you compare > > > the returned one against? > > > > Inside get_vma_snapshot() I would pass the original vma->vm_file to > > get_file_rcu() and check if it returns the same value. If the value > > got changed we jump to /* Address space got modified, vma might be > > stale. Re-lock and retry. */ section. That should work, right? > > Where do you get an "original vma->vm_file" from? > > To be clear, get_file_rcu(p) returns one of the values that *p had > while get_file_rcu(p) is running.