From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4101EDE9BF for ; Tue, 10 Sep 2024 20:58:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3ADF98D00C4; Tue, 10 Sep 2024 16:58:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 35D908D0056; Tue, 10 Sep 2024 16:58:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 226298D00C4; Tue, 10 Sep 2024 16:58:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 050808D0056 for ; Tue, 10 Sep 2024 16:58:27 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AD71AA7EBF for ; Tue, 10 Sep 2024 20:58:27 +0000 (UTC) X-FDA: 82550041854.10.E6F472E Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf10.hostedemail.com (Postfix) with ESMTP id D6AA9C0015 for ; Tue, 10 Sep 2024 20:58:25 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L9WjmEaS; spf=pass (imf10.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726001803; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RZZkUf2WC6mfIlygG90tFs2dYgOhX5oqJ1ZGT74jCV0=; b=ZTadfPvhM4piP0X6zAuaSFHrvA6Liy/plwuF0Dd0Fg2RsNrWPWvYXt6Q4qjIBQz5hZAjJN UW7SRJFvCZAj707eSD8uswgJM5ed4d9NMkUuzsDuXi3xi2JqrG0x/iJLibAOYH+BxcehEh FB2EL0we+YWwmwzAlV3q1eZItFL17KU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726001803; a=rsa-sha256; cv=none; b=pJ+qWdXTOimF6BVg6lwvBFwZBaX8FJSTYBFgztpc6WyQRQbu7aPJ5TxBbAEY2Yw/w5z94L H7Zfq7xPs9biIYzt+JgU0iLVLPgv5cXqZah8srl6a0Rxz9zNbnYoAI6JNfypEF8g6rvCTu kjSZEh2eLrM4SM5awiRtVfB68JDDyH8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L9WjmEaS; spf=pass (imf10.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-2d8abac30ddso923097a91.0 for ; Tue, 10 Sep 2024 13:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726001904; x=1726606704; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RZZkUf2WC6mfIlygG90tFs2dYgOhX5oqJ1ZGT74jCV0=; b=L9WjmEaSbE6jXT590zdHkyHCF97XKJxah45XLyFaFFNxUZA+AZdmEnoiC2tf3BbTD3 fbfifJyn0wPHu8qgJv27AcP27uoaXD1/o/pRiCXtoj0vBeIpkqeuBLbHCClVPO0D4InC dng/AV5Smt6TCXAUHfVD85V8QY6tuYkU1CcA25Zd2lhnkO97n33AB/63amuG1aoH16ib DhNx9qnSValqavr41aeASxcquu99jKHbo21IyjrNK3lytNn3o9aGjqGB1fRZvxwrlBSJ LcCxHJlPrFtbGJRvP5tJBNpunzjlWPFsBffgxRkm7crF1irMv0HoqqEbDSmjw+7fTWcC lOEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726001904; x=1726606704; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RZZkUf2WC6mfIlygG90tFs2dYgOhX5oqJ1ZGT74jCV0=; b=wWYPmeO4FAM6roTYCsAJYb0Tste0heUc1DS6nJbrGVVNsky6h06CVxmG/PgPNrMPYG PUxuu8KRyBoZQfQpCnNsjgi28jN01muWbb4it4Yzq2hvVi4SWgp6BfkByrgf2C7t5Q4V WOL2lGeZYTfuH5BLDNSHD3Z8MEVdI32QKxulTdTYVONbv6iYnpcD1R/MhrLVTZLT1spD 0UwaLg8Yz+UHqcI+y+SSypRREM9SeP5ys49ZXJ6s+Ar1/Xh1J+YAVuyL8j/9oWXeUZQ0 WiCHKXd5g0PBgZMXCRForCHSBIW1QNtQkTpZzuSNDYU3Ts/YtCWSkr5y6uTOL1YKpszm 6lAw== X-Forwarded-Encrypted: i=1; AJvYcCV+NZo9R7pCR6I40c3Y2CvdPEckJOEOcp14gnEhVb1H3NDBxcbt2gRlvHN7Jrb7PuggB9fBxgKwvA==@kvack.org X-Gm-Message-State: AOJu0YzWowK0lWBFHKVyj0JCAIpKOJk5mdJXMHlKDKRwRZq6Nq8s2pvt gD3Gt7+vs7vwlhtQHQCvAAzs+Ln6hPyga3ljyHoASL2ncL//cmgjKkzV2lA37nriYEgMWiwfd2Q fenSyAS2gcWmSQZP+t/aWarpQEJw= X-Google-Smtp-Source: AGHT+IFlYwX3ooMzADHl3WRC56Z25rc41qODOOier4HGtUasmVGD5FRx8r8/jNu4rZMhmraa47eo2BEJ07U35fQBNJg= X-Received: by 2002:a17:90a:f004:b0:2d8:d254:6cda with SMTP id 98e67ed59e1d1-2db82fec7admr1019356a91.20.1726001904425; Tue, 10 Sep 2024 13:58:24 -0700 (PDT) MIME-Version: 1.0 References: <20240906051205.530219-1-andrii@kernel.org> <20240906051205.530219-3-andrii@kernel.org> In-Reply-To: From: Andrii Nakryiko Date: Tue, 10 Sep 2024 13:58:10 -0700 Message-ID: Subject: Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution To: Suren Baghdasaryan Cc: Christian Brauner , Jann Horn , Liam Howlett , Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org, mjguzik@gmail.com, Miklos Szeredi , Amir Goldstein , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D6AA9C0015 X-Stat-Signature: afud91t3sr1j93bzjz7nqgynd7b5f4w5 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1726001905-116669 X-HE-Meta: U2FsdGVkX18a9fj1JPN0JG/9/PtHYSE5WFD7quf6MLO2BfVvH0OR968nlhDTrnHXOXNSg9RwB9oaBNeXe/uWPDvpX9rfTX9Ods1m2dhSBkM5vEz8UxX4+ZHq66RCShsCpoVpood9LsRlzzFClrNR5vjEgDhOPZ+1BdwtxS0/cGDnc2ovlnuXJBZaZNx2RF3nGMBk/JbAkD0bV6vcqVsicrm1h1CaEOUOTRqf/FOepfOAzuIrJ1KrVzjvmImbdrr6uHjApCGUtmE9RSvCTnjzP+yy7Hzz6vAY75Wb2AjDyA7HurzcZbTvR4tkOhA8S+6TA+VoIFaTTec0IKwIFBA/bIL/T/0Euu75twSxk65u3spgcCJvBAIzBSfQKSIzNbeoLcT1TZY95Uz7fYkjRxjdARXUrV7D15Ey60b6SkU7n8cKGo0kbaazkU46lhmjotENoiTRC2HcV+GOBrSeHEDedE19ouJ6GUgo8+CLopSQblIjq9aZ2U6fNWF07rSc8NAc/90PANkKKfTJO1jfnwAxl9gTUHoa4CWmvoAAnupWvUC2r5MtMB03briYlfNLF9eTGPMAW9t7Hy1jHIE9FCV7aMXqEI0tRPwpQUhT1d9BC+RXZYq0LeqRXwpuTwWFaPg4Ur2+GKbBRn4Mry7gaOjs/3onjEIPs+A28W/vKAgxjEvRrtKb4wED4+YTcCasZ5h6A3sgbQbXR75H9BSBVpZejugrH211J9WWIF9/BpOxWN5YoPAhZTcFYVR+S96cMzV8DgsDMi7k3k79s7xzwXnJ32SsaHT0rupV6t5MSq0G0Vndxolk5KFp7HxJwF7PV9xBGua62brzKGnnetcEHnsMvI3hxU1jWaC4u35GcroKF6D08bJ3yv4ndwCvcIrxHu/ay+qEvTfbs4maZ+UO270v1UFeyY7FbMU6/JITmUZ4VtT1K6mkewc+rJodENjxw5dYI/OMoQoLhgHm37PsrE4 yjmo8Kd2 LouMIQxABvvdjxNpFqCxh/4japAsbDmBh9ZBSAm/1xuK1W1h3fOfEMjq1MegD3dPZqCPiMhNzIkK7tqEyqfZ9OrNQwQlIDKYfgpRWcQk8Z5PM6p8CW7v+Q/7YQDg+iQEl1byqrocp+ets0ycBQVzbFxjXPfHfCoZYIxzYg97WyB8VeZATATzuXeXAqs2iuLaSND/7UO2krAB9YTaLNk/+wRO5/ipk4iPO6FetOeQUbUuVzvrtxcaahAH6eBI5nT88KeZ9C4mTINNFCxVFEqGKp6fs6nLWvyg36OSH8U0V+dRHXRKo8D5V3/0hvsbcgvRUO9L2PGGvDZKllO4K6p60V8ACh8h5C+v5PkXL0/8OYhbE1Ji/LCXmi3eawjaUJCEGfVZpq4mS6/9iNlxOYu/K9mnewCpRC7ddqQyNohlzqE/83Wk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 10, 2024 at 9:32=E2=80=AFAM Suren Baghdasaryan wrote: > > On Mon, Sep 9, 2024 at 2:29=E2=80=AFPM Andrii Nakryiko > wrote: > > > > On Mon, Sep 9, 2024 at 6:13=E2=80=AFAM Jann Horn wro= te: > > > > > > On Fri, Sep 6, 2024 at 7:12=E2=80=AFAM Andrii Nakryiko wrote: > > > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can sa= fely > > > > access vma->vm_file->f_inode field locklessly under just rcu_read_l= ock() > > > > > > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example > > > ovl_mmap(), which uses backing_file_mmap(), which does > > > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s > > > "realfile", which comes from file->private_data, which is set in > > > ovl_open() to the return value of ovl_open_realfile(), which comes > > > from backing_file_open(), which allocates a file with > > > alloc_empty_backing_file(), which uses a normal kzalloc() without any > > > RCU stuff, with this comment: > > > > > > * This is only for kernel internal use, and the allocate file must n= ot be > > > * installed into file tables or such. > > > > > > And when a backing_file is freed, you can see on the path > > > __fput() -> file_free() > > > that files with FMODE_BACKING are directly freed with kfree(), no RCU= delay. > > > > Good catch on FMODE_BACKING, I didn't realize there is this exception, = thanks! > > > > I think the way forward would be to detect that the backing file is in > > FMODE_BACKING and fall back to mmap_lock-protected code path. > > > > I guess I have the question to Liam and Suren, do you think it would > > be ok to add another bool after `bool detached` in struct > > vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to > > add an extra bit into vm_flags_t? The latter would work without > > CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks. > > > > This flag can be set in vma_set_file() when swapping backing file and > > wherever else vma->vm_file might be set/updated (I need to audit the > > code). > > I understand that this would work but I'm not very eager to leak > vm_file attributes like FMODE_BACKING into vm_area_struct. > Instead maybe that exception can be avoided? Treating all vm_files I agree, that would be best, of course. It seems like [1] was an optimization to avoid kfree_rcu() calls, not sure how big of a deal it is to undo that, given we do have a use case that calls for it now. Let's see what Christian thinks. > equally as RCU-safe would be a much simpler solution. I see that this > exception was introduced in [1] and I don't know if this was done for > performance reasons or something else. Christian, CCing you here to > please clarify. > > [1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed70@b= rauner/ > > > > > > > > > So the RCU-ness of "struct file" is an implementation detail of the > > > VFS, and you can't rely on it for ->vm_file unless you get the VFS to > > > change how backing file lifetimes work, which might slow down some > > > other workload, or you find a way to figure out whether you're dealin= g > > > with a backing file without actually accessing the file. > > > > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long= bp_vaddr) > > > > +{ > > > > + const vm_flags_t flags =3D VM_HUGETLB | VM_MAYEXEC | VM_MAY= SHARE; > > > > + struct mm_struct *mm =3D current->mm; > > > > + struct uprobe *uprobe; > > > > + struct vm_area_struct *vma; > > > > + struct file *vm_file; > > > > + struct inode *vm_inode; > > > > + unsigned long vm_pgoff, vm_start; > > > > + int seq; > > > > + loff_t offset; > > > > + > > > > + if (!mmap_lock_speculation_start(mm, &seq)) > > > > + return NULL; > > > > + > > > > + rcu_read_lock(); > > > > + > > > > + vma =3D vma_lookup(mm, bp_vaddr); > > > > + if (!vma) > > > > + goto bail; > > > > + > > > > + vm_file =3D data_race(vma->vm_file); > > > > > > A plain "data_race()" says "I'm fine with this load tearing", but > > > you're relying on this load not tearing (since you access the vm_file > > > pointer below). > > > You're also relying on the "struct file" that vma->vm_file points to > > > being populated at this point, which means you need CONSUME semantics > > > here, which READ_ONCE() will give you, and something like RELEASE > > > semantics on any pairing store that populates vma->vm_file, which > > > means they'd all have to become something like smp_store_release()). > > > > vma->vm_file should be set in VMA before it is installed and is never > > modified afterwards, isn't that the case? So maybe no extra barrier > > are needed and READ_ONCE() would be enough. > > > > > > > > You might want to instead add another recheck of the sequence count > > > (which would involve at least a read memory barrier after the > > > preceding patch is fixed) after loading the ->vm_file pointer to > > > ensure that no one was concurrently changing the ->vm_file pointer > > > before you do memory accesses through it. > > > > > > > + if (!vm_file || (vma->vm_flags & flags) !=3D VM_MAYEXEC) > > > > + goto bail; > > > > > > missing data_race() annotation on the vma->vm_flags access > > > > ack > > > > > > > > > + vm_inode =3D data_race(vm_file->f_inode); > > > > > > As noted above, this doesn't work because you can't rely on having RC= U > > > lifetime for the file. One *very* ugly hack you could do, if you thin= k > > > this code is so performance-sensitive that you're willing to do fairl= y > > > atrocious things here, would be to do a "yes I am intentionally doing > > > a UAF read and I know the address might not even be mapped at this > > > point, it's fine, trust me" pattern, where you use > > > copy_from_kernel_nofault(), kind of like in prepend_copy() in > > > fs/d_path.c, and then immediately recheck the sequence count before > > > doing *anything* with this vm_inode pointer you just loaded. > > > > > > > > > > yeah, let's leave it as a very unfortunate plan B and try to solve it > > a bit cleaner. > > > > > > > > > > > + vm_pgoff =3D data_race(vma->vm_pgoff); > > > > + vm_start =3D data_race(vma->vm_start); > > > > + > > > > + offset =3D (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - v= m_start); > > > > + uprobe =3D find_uprobe_rcu(vm_inode, offset); > > > > + if (!uprobe) > > > > + goto bail; > > > > + > > > > + /* now double check that nothing about MM changed */ > > > > + if (!mmap_lock_speculation_end(mm, seq)) > > > > + goto bail; > > > > + > > > > + rcu_read_unlock(); > > > > + > > > > + /* happy case, we speculated successfully */ > > > > + return uprobe; > > > > +bail: > > > > + rcu_read_unlock(); > > > > + return NULL; > > > > +}