From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 350FDC369DC for ; Tue, 29 Apr 2025 18:55:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA5BB6B0005; Tue, 29 Apr 2025 14:55:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C552B6B0006; Tue, 29 Apr 2025 14:55:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AF61F6B0007; Tue, 29 Apr 2025 14:55:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8EBAC6B0005 for ; Tue, 29 Apr 2025 14:55:38 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A9753C710C for ; Tue, 29 Apr 2025 18:55:38 +0000 (UTC) X-FDA: 83387985156.15.CDD9206 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf19.hostedemail.com (Postfix) with ESMTP id B6A531A0006 for ; Tue, 29 Apr 2025 18:55:36 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GOmfgvTf; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745952936; a=rsa-sha256; cv=none; b=pc0MgUVoL6qAYjB5aaE7l9WlSfiOSqyzgLvfksrKs7kSV3/oOcsMDg6WY3QGA4EVHfxLqb CnfmY8v/vlYjkPzoBxjQX+0SXJy9KCEymvIxudMZ1SSMoqVDWpRjdwNnCyMYwrHp9jtUau CNlFdTMghFGTcaTa0NdmZq2fq7wsEKU= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GOmfgvTf; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745952936; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fGnPHWb56pcSnBYhY15/Vh6WYCrspAIDxl9o83SLKTU=; b=eThy6O92C649KALZyONDMhP+VWm4mUZmswZN5QqAdenyHKjJCoLXmyQPVCz7X7NIeXTWZD fv7qiRO76XHhoNnXLWYCtsBw0L+Voai9N8CsvQsunt4CsdiCfIwrp9QxIinv/oVd9AnGze b41v4uchyNEHBkFa9zvpWB0QnkdewYw= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-5f632bada3bso4685a12.1 for ; Tue, 29 Apr 2025 11:55:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745952935; x=1746557735; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fGnPHWb56pcSnBYhY15/Vh6WYCrspAIDxl9o83SLKTU=; b=GOmfgvTfr0wOEqQnf3daMqjWZrSHKLX2+V5Bua0zpMRyCeBJYYx6jq/+uB/LOz4Ix1 BSizMDnTFTBA+sVZqpbpjpZhORbFF4FoYkMv8McoSpH6ijDirPDTZWWC7/NcmiQx5ma7 tJoZ+rXDB6NT23fAnfB2xGO5CgZZ/BbFsHw+bAH7gG1mCu6rE3gGYUul6Oxw4WbZvff3 rLBvkORtCTws4ya/NrzDZwNyEqAbIrpfHZsVuYcVIe5vzdYpG/36+l+84iaUS53eSgvM SMOKO8yCxHYx1ovAW8Pe00rGLvx2gVk0y0BucVb8iWWA/60tQDb2IPnaXkubL1V0HTHr kcEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745952935; x=1746557735; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fGnPHWb56pcSnBYhY15/Vh6WYCrspAIDxl9o83SLKTU=; b=SumrUcASa8St8qqjvOO9wRyQoFQgbJTIFY5zcOvn7Dwfi7tk/sS0ZmDX8jI1Q0g4+X 9rGvHrpF/vEfjl4L3cIO44YP54qqtXABxfKLK8T4fKqzcw9s69XM1yAzoKrfX1gGwFZh 4grM2qOOwKsd5nCpIiQU+rdIvoaXVRXDUsdfYGTFojzyBxJVynVlET2iEPrm8jQ/dtwM Itg0/UPg6AzmrPIShOzS/SWzVfkUVfsMelny5+7CB0qYqCWAmsr92D2zfKX8jPxMrllL QTAUimQwGe9FEr/46258jpKanhwgp/Wzk/r76GjUnVNkzME2OPMiJDqQNlYwA0mamCJQ 9KYQ== X-Forwarded-Encrypted: i=1; AJvYcCWznmlaJ96Q52hPgQBI3l3Qs+9+UdJcuv0DCGYsHKqtvW4qK2cxThc7W5mZRlaQbBWQeyz0PoU88g==@kvack.org X-Gm-Message-State: AOJu0YxRdfe/+JnQO9/L9IeM7XwXOX8sX24oCBMaZVFYwNij6bH4jo41 6D3nM8m+nKPH7xg8SL5n0uuRX1aIhZfsqx6s7T7flMr3VYSx2/+2Ih4ZOFJKoNPiScg/MLf9J8U votWP2+xkVgBscG31xGcenyjc5iMBujjYUpI9 X-Gm-Gg: ASbGncujv8XunSbRpp/6QdE1dBgnUtyW6CbZFo1XKBivLDj51q8Oj1AdCy4DwnWqLNp w+jLaAtqHWe53ygZ6Z4SEw2IwugucXMUtemQtD+TbO2qbk1FRiFcd9R43NW7rL9jdaGbyrl8S/C 7XDUHuXcl58v7Ripa6v7r42CTBmXETFQmoO/VKB9RiuYn/FEdM3g== X-Google-Smtp-Source: AGHT+IHmxauoWVK69ZX6csxpCjfd6+3DMPgSaiSiQF85VSZQGO77OL31TmN1qQanU5zZwaCdMYGDle7ECMfNccgth5k= X-Received: by 2002:aa7:d1cf:0:b0:5f8:6068:a45 with SMTP id 4fb4d7f45d1cf-5f89720c6c9mr13377a12.4.1745952934761; Tue, 29 Apr 2025 11:55:34 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> In-Reply-To: From: Jann Horn Date: Tue, 29 Apr 2025 20:54:58 +0200 X-Gm-Features: ATxdqUGSx2jmcnf6ztX09dbGC0pwTc-91Nr3Oa-BIYmoIdsaikOVBQ9Xr0kogDA Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: Suren Baghdasaryan , Al Viro , brauner@kernel.org, linux-fsdevel@vger.kernel.org Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B6A531A0006 X-Stat-Signature: jxf15g6y89715rq8satgtg1rm4yjs9ht X-Rspam-User: X-HE-Tag: 1745952936-235144 X-HE-Meta: U2FsdGVkX1+LZ6n/DGkUdnW41bkJ6L/IZPltTAgeHp2PG2EUYTyaA5ioM2cWVcnyYJhH/UtP+LaJ8DFTBFu5KVXYMKef1BpzznyGZofSgaShfwht1hJ70In4+eK+VjBnT81U/87yNbCDgpdyAtSjl7xPwON8Kl8QXSCjYEZ1GdmcIpULgjsRvXlrMlNoTLzftuYGVLp8H5RaB/CD5FWZOJSVMw/koIcxSoaohhN4HXdehPGn4UiLStkmfIAYKDbTc7houlIA71wDIY0ZcbC6yea/B94a4iRrpmsRMB9xQtaAwPA8jI+5EbWppzMxsqxN/j1qKrnyl+iUXWUIamPv4WpVYZHOo8I7OreKvhn+WE0dCKhbPMh8v8IWXGIFAM3V0qcQLdeFhZSQtmrtjbdLvjjWg55Mjne8TofJX2oQ5YkmBrKefN8vDXgCWGPMPFW8DrxbhADrEbbqOGjoh+xDdbx6FglcqhXqkE2s6tB+7ZAKIbaxcoZZkCoskGQ4bw9HlDj5F8K4tKQGiADz5f/rRxeBAX7ZzGRMNy/VwWx3UYFVNo1HkDkpLanUBN+dM+ZWhUZaUeNHoCZ39ZVLmQnrAQ8107KZGo6AmoirIeD89tYOrXBgVjCsiG6DZrRQHcQ3G6yPMcKWlkjGg7K0q4G0NkN9HhLM4982+poRcKLVhBferpQuUjo3F2HKHQ59ql/M+vyh39N+Gk/H4ZrXu6jGsTEMmdiBl5qaoPl/fBOH0CBMtrrMVRk/ARYnuL613lKVxry+kPz8gZBpqlqXbjUa/1M+1qgDMIznAB+PgrkQLDFdYlbouxixbPCekIq2OX5N9VXjp9KqJtSaaEeyid3MX5SuAfm64Y8xwnGDnBqjdiGhLZUZKQ+QSERA6yWCWa3HR+MUS4CbXRiLMDTX7gxb46fUdJjeHXcc0ZonkrfJhFdtr779ezFmnt2UffTgugA1z7v3jnLJtEGQHFxZkW6 3WfVsdac mOEJbHPkbdS1iQgRKfaL847SvaYDL71nBSxxKeiLR5HF6Krw3iFPHNjUrUAItjuDDCMHT0GC5dgaRTAtL7tRnFyyQ6e0MLPuTS9WVjz9iydF1oMrq6B/gwTKaB6B1Sh2FXVxD+oAnS21ILp0UzxG6wn1s2QhKQ9+KKAHRzF4bOdNqcU9A+jFX5XefQaR04qIS5gor/dbq7/1kq43s0SVU7kIe63ltLJxA69M1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 8:04=E2=80=AFPM Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 10:21=E2=80=AFAM Jann Horn wro= te: > > > > Hi! > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > forgot all about that...) > > Does this fact affect your previous comments? Just want to make sure > I'm not missing something... When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing down a VMA, and using get_file_rcu() for the lockless lookup, I did not realize that you could actually also race with all the other places that set ->vm_file, like __mmap_new_file_vma() and so on; and I did not think about whether any of those code paths might leave a VMA with a dangling ->vm_file pointer. I guess maybe that means you really do need to do the lookup from the copied data, as you did in your patch; and that might require calling get_file_active() on the copied ->vm_file pointer (instead of get_file_rcu()), even though I think that is not really how get_file_active() is supposed to be used (it's supposed to be used when you know the original file hasn't been freed yet). Really what you'd want for that is basically a raw __get_file_rcu(), but that is static and I think Christian wouldn't want to expose more of these internals outside VFS... (In that case, all the stuff below about get_file_rcu() would be moot.) Or you could pepper WRITE_ONCE() over all the places that write ->vm_file, and ensure that ->vm_file is always NULLed before its reference is dropped... but that seems a bit more ugly to me. > > On Tue, Apr 29, 2025 at 7:09=E2=80=AFPM Suren Baghdasaryan wrote: > > > On Tue, Apr 29, 2025 at 8:40=E2=80=AFAM Jann Horn = wrote: > > > > On Fri, Apr 18, 2025 at 7:50=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > With maple_tree supporting vma tree traversal under RCU and vma a= nd > > > > > its important members being RCU-safe, /proc/pid/maps can be read = under > > > > > RCU and without the need to read-lock mmap_lock. However vma cont= ent > > > > > can change from under us, therefore we make a copy of the vma and= we > > > > > pin pointer fields used when generating the output (currently onl= y > > > > > vm_file and anon_name). Afterwards we check for concurrent addres= s > > > > > space modifications, wait for them to end and retry. While we tak= e > > > > > the mmap_lock for reading during such contention, we do that mome= ntarily > > > > > only to record new mm_wr_seq counter. This change is designed to = reduce > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps= files > > > > > (often a low priority task, such as monitoring/data collection se= rvices) > > > > > from blocking address space updates. > > > > [...] > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > [...] > > > > > +/* > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are u= sed by > > > > > + * show_map_vma. > > > > > + */ > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, stru= ct vm_area_struct *vma) > > > > > +{ > > > > > + struct vm_area_struct *copy =3D &priv->vma_copy; > > > > > + int ret =3D -EAGAIN; > > > > > + > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > + goto out; > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > pointer which always points to a file with a non-zero refcount (exc= ept > > > > when it is NULL). That's why it takes a file** instead of a file* -= if > > > > it observes a zero refcount, it assumes that the pointer must have > > > > been updated in the meantime, and retries. Calling get_file_rcu() o= n a > > > > pointer that points to a file with zero refcount, which I think can > > > > happen with this patch, will cause an endless loop. > > > > (Just as background: For other usecases, get_file_rcu() is supposed= to > > > > still behave nicely and not spuriously return NULL when the file* i= s > > > > concurrently updated to point to another file*; that's what that lo= op > > > > is for.) > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable b= y > > > checking the return value of get_file_rcu() and retrying speculation > > > if it changed. > > > > I think you could probably still end up looping endlessly in get_file_r= cu(). (Just to be clear: What I meant here is that get_file_rcu() loops *internally*; get_file_rcu() is not guaranteed to ever return if the pointed-to file has a zero refcount.) > By "retrying speculation" I meant it in the sense of > get_vma_snapshot() retry when it takes the mmap_read_lock and then > does mmap_lock_speculate_try_begin to restart speculation. I'm also > thinking about Liam's concern of guaranteeing forward progress for the > reader. Thinking maybe I should not drop mmap_read_lock immediately on > contention but generate some output (one vma or one page worth of > vmas) before dropping mmap_read_lock and proceeding with speculation. Hm, yeah, I guess you need that for forward progress... > > > > (If my understanding is correct, maybe we should document that more > > > > explicitly...) > > > > > > Good point. I'll add comments for get_file_rcu() as a separate patch. > > > > > > > > > > > Also, I think you are introducing an implicit assumption that > > > > remove_vma() does not NULL out the ->vm_file pointer (because that > > > > could cause tearing and could theoretically lead to a torn pointer > > > > being accessed here). > > > > > > > > One alternative might be to change the paths that drop references t= o > > > > vma->vm_file (search for vma_close to find them) such that they fir= st > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that= , > > > > maybe with a new helper like this: > > > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > > { > > > > struct file *file =3D vma->vm_file; > > > > > > > > if (file) { > > > > WRITE_ONCE(vma->vm_file, NULL); > > > > fput(file); > > > > } > > > > } > > > > > > > > Then on the lockless lookup path you could use get_file_rcu() on th= e > > > > ->vm_file pointer _of the original VMA_, and store the returned fil= e* > > > > into copy->vm_file. > > > > > > Ack. Except for storing the return value of get_file_rcu(). I think > > > once we detect that get_file_rcu() returns a different file we shoul= d > > > bail out and retry. The change in file is an indication that the vma > > > got changed from under us, so whatever we have is stale. > > > > What does "different file" mean here - what file* would you compare > > the returned one against? > > Inside get_vma_snapshot() I would pass the original vma->vm_file to > get_file_rcu() and check if it returns the same value. If the value > got changed we jump to /* Address space got modified, vma might be > stale. Re-lock and retry. */ section. That should work, right? Where do you get an "original vma->vm_file" from? To be clear, get_file_rcu(p) returns one of the values that *p had while get_file_rcu(p) is running.