From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9160BC369DC for ; Tue, 29 Apr 2025 20:33:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 967556B0082; Tue, 29 Apr 2025 16:33:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8EE046B0083; Tue, 29 Apr 2025 16:33:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 766106B0085; Tue, 29 Apr 2025 16:33:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 29E206B0082 for ; Tue, 29 Apr 2025 16:33:19 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AD681C1210 for ; Tue, 29 Apr 2025 20:33:20 +0000 (UTC) X-FDA: 83388231360.09.2E0536F Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf02.hostedemail.com (Postfix) with ESMTP id D08AE80012 for ; Tue, 29 Apr 2025 20:33:18 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Bc0v3Rhb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of surenb@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745958798; a=rsa-sha256; cv=none; b=f5NCh3hsaseT2ddUkWr+/exvm2zo91EdrhzvQE9gvm2/uj4nQHDyt3biJ9v/8rRaykfSAg 9UieqAvOsNlcylN4SfeJeOhT0yaRDou/wAfJTRjPcxL8C1nmZipfQf8t3FiL/XYAfMkouZ xqsGXhiSx60XB5KAgXsNV4z+kgiCNRU= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Bc0v3Rhb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of surenb@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745958798; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6RuHqTNBtcvgW69uG+veZI2+mQpj/OYd2zcCul6gKi8=; b=p2n1UTqpG0YzaOKEEcCDENrwnSlhRLuK2anVwZpCrXUmUAPuaffiLyZdwqKNc/pF9jgZKo 0BSAtw8Y664XGK2uCJvZGeHkafRERng9P0aznSOOaQDmZ1YeTqX6eY85QeOEltv+qT+MAP fu6JCSO3cYCbf/5695yJT8xzyUL0YtU= Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-4774611d40bso347541cf.0 for ; Tue, 29 Apr 2025 13:33:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745958798; x=1746563598; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6RuHqTNBtcvgW69uG+veZI2+mQpj/OYd2zcCul6gKi8=; b=Bc0v3RhbEQPoUNXr9jG51/1HeT//7ZUBzveeThZ/6ZzDlhbXzc1XwtWhD2BnhcE3g8 W0JYTJuf6Oa7wWVXfA71nI+aLow7pR9AaAdcXJTspCsbAoz8j0LdBU3yArZxLvNrAZUU GwSAsMyCJP6ld3YAXS0sTl22QFHmDCgvLhPGiFCz9Sx3zuXtzPjaWUGZqqHZscGc4XW3 QJsfFmJDJ9KUSKN4bUYS4CcVGw9xBFFCCSr27W2jPBH0BjnWhNdJfX95iXiG8x5aG72z q53A60SHhU5uXe9t4VKqZavS4y2ZmSv57c4mN3yxccgf58LtGcdP4mlwlCgyvqwP3nfm 1eVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745958798; x=1746563598; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6RuHqTNBtcvgW69uG+veZI2+mQpj/OYd2zcCul6gKi8=; b=f7KmBXj/BC7zm8pli2wbJG5DAZ2ksgBpn5mKttHBf4UqAQExKnoXXjWyerj6nn6+OY z1JqRVoGdNUjfVbLHUMB/Gr88bDjsZTooqVU4o3rub8B9sxVIsuSnLlWCx7FY/fudzWp UsL1Z1ycu/udaolIOYXog5zaW/0N69T8DhbaKFprMQIeCy50ws6iDMH2p+6y9WJY21z2 Y0bQZEJASHjIpZ/f4SA5aR3fQsh/bHfEOUGtKwpbqV3HyOyFJp+fnd6sYUG02gILPkv+ ItzmAOM47xqyJeDrb28XTJBNAD4iu22IlNgJ8+mdXMRKxJK0dD/2+gMSQqvf0viLTqJr fOfQ== X-Forwarded-Encrypted: i=1; AJvYcCUriadSsirvS2dqZNCZKz0C/EGy3tGZ9Zihmxs/uPXt5gqAVnDQxUj8VQYT3pgiLgLCGjvPECsVxw==@kvack.org X-Gm-Message-State: AOJu0Yxpm+fEfbtJWbkAn8O4PLS/CIJtomS8usFiJWwseDqYlgEvlKw8 HddX6KdK+fnpybNhW/l70QVcfNW3VXXfcWeIgjex970V57VQuujNZvImg9Xsn34ZJCJVMjXnM7Z TXOQDlZhz0rbsZVxh/t/GXm2hR4qGf7aBHQeR X-Gm-Gg: ASbGnctcFde1AiVrlBO2QT9YgpeWt+BCFFmDnmFrJIyOfPwjIAjjhCvIYykayuD6Gky 12FHMTurP0WaaTf3e3+bzJGz4Xyg6y0T/QuHxyeFMrYJQh9y+WQ1NbwHAm8kX2gR1BTa9uuzw2h gauwfWQxfnyf0TMNnLXUiEMGN1t2dvOdisCAFoS/Zuyx9XP0M4hXJF3z+o9LPfTuE= X-Google-Smtp-Source: AGHT+IFfAqUaqm8TlmxiDs/TpA8pVnnKtFrG+o0jmeCwd2WChU9Uegcp05S3xNJHn3OsPIEFvPnGKyo2zqt4PAryZbg= X-Received: by 2002:ac8:578b:0:b0:489:7ea9:4263 with SMTP id d75a77b69052e-489b9838a60mr982441cf.10.1745958797430; Tue, 29 Apr 2025 13:33:17 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Tue, 29 Apr 2025 13:33:05 -0700 X-Gm-Features: ATxdqUEe002VsMDCR8tZV7bMaBYZb2LRjVYTvWo45vIOB09sAK6-nvvtt3xxSRU Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: Jann Horn Cc: Al Viro , brauner@kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D08AE80012 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: 9zbp5u5p63zifz6ihib4efriz5nbweag X-HE-Tag: 1745958798-716424 X-HE-Meta: U2FsdGVkX19RnWhaa75V9r9wRgPa+Eem4iBUCUQWEyqURYimY4Cz20AMReA6AG0lDHxxlsD3rQDoLg9oautJOMnPZ68JmN3+589pgH5pbyW31dJTQDdcwzcnNLWIH0yQA4bt0n5I8VqVn5pAwpN3ejYJINC6xwJ1zXTDcyayCw6Yg5X/ZYkqW4SdOdWygX1PiD4XK9fczakPCfTekVMDLqPwwhXhKvEmG6uZhq66S3LiD0cu0fonSl4AIq0/cIVIUZUgXWxWnxOH/hifEMACIGJ0NUDvqwX/HHGKxHdkD7ie/VCsPWcb3fbwa41a0zTwnMj5DrJXaVrJpY+l64qtRsciy+6PNmm3unU35IB1XGEQlcQxSZV9dsOGSe77NPhNIo2IvppxG/OWKRm39bvjCeYPo6mzQnCeudYk4n6UpY85y4NBcegPwvkjNjEcwxl+PMWK2NTIEtUVYdsY1AVAcWDRIhi+jp4aOBP4tyP/tdNwB5Gy4rKS46SZq4oxE9JhKwbl8ZobiNkL6WpjTnFUz5Y91cGlsX+Dt7i2neCAGIxNq0EFm/XjAtJzTCwFpbaOeJYvnoH164qns5CBje/KCgFitg2vp1+LzDqeF8ibigb5OlghO79Rp0qd69yL9j1hZoPXR9Lu93PmbI9bjzhztUQGPvKa84d+6vN7a6RvqLdtGhPqjkq2aWcTg0fYx7PHbyzBEIdtXzNPyj9cQutrRpHpA0MxSGh4AN+e5XSdS9xzEis4lSq9m3w7TwN6ReKwy/mPCg5Ag6oEqYQOV+neqyxKU3Vod2idMwHYvYfRmIdFE6E68nw0ISwHvEmSi0jbg5bmMNN8mgdVxG0vTmgcZD0w+bVrCKZoS9nXHic6U66KLRitynRgnGjrHx4XzSRTLq+FvzP8kxpf9gfHNzF4mHR9g4W6xq2ry2Ot+aoIoldb18qezlvWd/u5w94AMzkqc932l+g5WQDtRlVNOZT ftIIuckF 8mg5MZ9b0B0af0LWapsCfuXbnNvafvfk2N0v2DVkabc7z1v3EPn4r2VFTynvL0U067hld3ijJkbQ6kuN1B9QWjFrYx3zhigWm4QpZI0JxBEr+LD+871YKXf8ME7QvOdgKzeJpQgJqop0/WuXRnlfV79PwrS5gwHscUwI7BAYxEjmpAQrTrlsuQLq7WdfUDV/pvEf2YgkTbdLounTpdVafIxxsvksYwijn2/sW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 11:55=E2=80=AFAM Jann Horn wrote= : > > On Tue, Apr 29, 2025 at 8:04=E2=80=AFPM Suren Baghdasaryan wrote: > > On Tue, Apr 29, 2025 at 10:21=E2=80=AFAM Jann Horn w= rote: > > > > > > Hi! > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > forgot all about that...) > > > > Does this fact affect your previous comments? Just want to make sure > > I'm not missing something... > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > down a VMA, and using get_file_rcu() for the lockless lookup, I did > not realize that you could actually also race with all the other > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > did not think about whether any of those code paths might leave a VMA > with a dangling ->vm_file pointer. So, let me summarize my understanding and see if it's correct. If we copy the original vma, ensure that it hasn't changed while we were copying (with mmap_lock_speculate_retry()) and then use get_file_rcu(©->vm_file) I think we are guaranteed no UAF because we are in RCU read section. At this point the only issue is that copy->vm_file might have lost its last refcount and get_file_rcu() would enter an infinite loop. So, to avoid that we have to use the original vma when calling get_file_rcu() but then we should also ensure that vma itself does not change from under us due to SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change from under us we might end up accessing an invalid address if vma->vm_file is being modified concurrently. > > I guess maybe that means you really do need to do the lookup from the > copied data, as you did in your patch; and that might require calling > get_file_active() on the copied ->vm_file pointer (instead of > get_file_rcu()), even though I think that is not really how > get_file_active() is supposed to be used (it's supposed to be used > when you know the original file hasn't been freed yet). Really what > you'd want for that is basically a raw __get_file_rcu(), but that is > static and I think Christian wouldn't want to expose more of these > internals outside VFS... > (In that case, all the stuff below about get_file_rcu() would be moot.) > > Or you could pepper WRITE_ONCE() over all the places that write > ->vm_file, and ensure that ->vm_file is always NULLed before its > reference is dropped... but that seems a bit more ugly to me. Ugh, yes. We either ensure no vma->vm_file tearing or use __get_file_rcu() on a copy of the vma. Or we have to stabilize the vma by locking it... Let me think about all these options. Thanks! > > > > On Tue, Apr 29, 2025 at 7:09=E2=80=AFPM Suren Baghdasaryan wrote: > > > > On Tue, Apr 29, 2025 at 8:40=E2=80=AFAM Jann Horn wrote: > > > > > On Fri, Apr 18, 2025 at 7:50=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma= and > > > > > > its important members being RCU-safe, /proc/pid/maps can be rea= d under > > > > > > RCU and without the need to read-lock mmap_lock. However vma co= ntent > > > > > > can change from under us, therefore we make a copy of the vma a= nd we > > > > > > pin pointer fields used when generating the output (currently o= nly > > > > > > vm_file and anon_name). Afterwards we check for concurrent addr= ess > > > > > > space modifications, wait for them to end and retry. While we t= ake > > > > > > the mmap_lock for reading during such contention, we do that mo= mentarily > > > > > > only to record new mm_wr_seq counter. This change is designed t= o reduce > > > > > > mmap_lock contention and prevent a process reading /proc/pid/ma= ps files > > > > > > (often a low priority task, such as monitoring/data collection = services) > > > > > > from blocking address space updates. > > > > > [...] > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > --- a/fs/proc/task_mmu.c > > > > > > +++ b/fs/proc/task_mmu.c > > > > > [...] > > > > > > +/* > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are= used by > > > > > > + * show_map_vma. > > > > > > + */ > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, st= ruct vm_area_struct *vma) > > > > > > +{ > > > > > > + struct vm_area_struct *copy =3D &priv->vma_copy; > > > > > > + int ret =3D -EAGAIN; > > > > > > + > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > + goto out; > > > > > > > > > > I think this uses get_file_rcu() in a different way than intended= . > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > > pointer which always points to a file with a non-zero refcount (e= xcept > > > > > when it is NULL). That's why it takes a file** instead of a file*= - if > > > > > it observes a zero refcount, it assumes that the pointer must hav= e > > > > > been updated in the meantime, and retries. Calling get_file_rcu()= on a > > > > > pointer that points to a file with zero refcount, which I think c= an > > > > > happen with this patch, will cause an endless loop. > > > > > (Just as background: For other usecases, get_file_rcu() is suppos= ed to > > > > > still behave nicely and not spuriously return NULL when the file*= is > > > > > concurrently updated to point to another file*; that's what that = loop > > > > > is for.) > > > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable= by > > > > checking the return value of get_file_rcu() and retrying speculatio= n > > > > if it changed. > > > > > > I think you could probably still end up looping endlessly in get_file= _rcu(). > > (Just to be clear: What I meant here is that get_file_rcu() loops > *internally*; get_file_rcu() is not guaranteed to ever return if the > pointed-to file has a zero refcount.) > > > By "retrying speculation" I meant it in the sense of > > get_vma_snapshot() retry when it takes the mmap_read_lock and then > > does mmap_lock_speculate_try_begin to restart speculation. I'm also > > thinking about Liam's concern of guaranteeing forward progress for the > > reader. Thinking maybe I should not drop mmap_read_lock immediately on > > contention but generate some output (one vma or one page worth of > > vmas) before dropping mmap_read_lock and proceeding with speculation. > > Hm, yeah, I guess you need that for forward progress... > > > > > > (If my understanding is correct, maybe we should document that mo= re > > > > > explicitly...) > > > > > > > > Good point. I'll add comments for get_file_rcu() as a separate patc= h. > > > > > > > > > > > > > > Also, I think you are introducing an implicit assumption that > > > > > remove_vma() does not NULL out the ->vm_file pointer (because tha= t > > > > > could cause tearing and could theoretically lead to a torn pointe= r > > > > > being accessed here). > > > > > > > > > > One alternative might be to change the paths that drop references= to > > > > > vma->vm_file (search for vma_close to find them) such that they f= irst > > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after th= at, > > > > > maybe with a new helper like this: > > > > > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > > > { > > > > > struct file *file =3D vma->vm_file; > > > > > > > > > > if (file) { > > > > > WRITE_ONCE(vma->vm_file, NULL); > > > > > fput(file); > > > > > } > > > > > } > > > > > > > > > > Then on the lockless lookup path you could use get_file_rcu() on = the > > > > > ->vm_file pointer _of the original VMA_, and store the returned f= ile* > > > > > into copy->vm_file. > > > > > > > > Ack. Except for storing the return value of get_file_rcu(). I think > > > > once we detect that get_file_rcu() returns a different file we sho= uld > > > > bail out and retry. The change in file is an indication that the vm= a > > > > got changed from under us, so whatever we have is stale. > > > > > > What does "different file" mean here - what file* would you compare > > > the returned one against? > > > > Inside get_vma_snapshot() I would pass the original vma->vm_file to > > get_file_rcu() and check if it returns the same value. If the value > > got changed we jump to /* Address space got modified, vma might be > > stale. Re-lock and retry. */ section. That should work, right? > > Where do you get an "original vma->vm_file" from? > > To be clear, get_file_rcu(p) returns one of the values that *p had > while get_file_rcu(p) is running. ``````