From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE059EED631 for ; Thu, 12 Sep 2024 17:54:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 438C36B0085; Thu, 12 Sep 2024 13:54:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C1E16B0088; Thu, 12 Sep 2024 13:54:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23C8F6B008A; Thu, 12 Sep 2024 13:54:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 01DF26B0085 for ; Thu, 12 Sep 2024 13:54:25 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 494721213AB for ; Thu, 12 Sep 2024 17:54:25 +0000 (UTC) X-FDA: 82556835690.22.4D96A73 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf07.hostedemail.com (Postfix) with ESMTP id 6DE304001B for ; Thu, 12 Sep 2024 17:54:22 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bh8UkJ7e; spf=pass (imf07.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726163522; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pnI/woVTgkNTSrRDHq2roOu8tQNQ4D8UQtZh3Y6WuYw=; b=FlTRGjfBE1YbzC23MCrOFPQJXNsRAIlPbAp1TRG1plWSTskt4QpFfk7E5tmEwsrkhruaks UdHHdacGej4L5Ohafq+iGjvITLlFQWzu8su41Hi27iFymctv8zAl0xUtNnq96asQ4FOeB0 Sp+oOLcSBidzoxOqhBBweeK58bSm7jg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bh8UkJ7e; spf=pass (imf07.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726163522; a=rsa-sha256; cv=none; b=fCviMKP2CZPw0COpsua4+h3BzrRIX9PxRsW3xM6riZxRTpsWv85SBvdnrrFpMAEawm+sGv UIkogqi+hZOyOyAEqibffChGWiF3/U8s7oT5WBbjGlVW3ClZjxAZdRwtcICNv34OzjVHND xPGpzGk///i61nOWEuY98nNFxFObVUU= Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2068acc8b98so12081365ad.3 for ; Thu, 12 Sep 2024 10:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726163661; x=1726768461; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pnI/woVTgkNTSrRDHq2roOu8tQNQ4D8UQtZh3Y6WuYw=; b=bh8UkJ7elaRqhYPQsus35EMReUpBo0hQcl57rtjQ44NYNfarq2TZXKX8m9HAw9JqBs rOk3d0g5M/nst0NJTHpH7wA0E9bSDRGfSFl1lXJv4YtPKO7P+ClwRZYfqEJvRsiloO4k rmM2sBXUAoAnUiLLcXWq4KV7XTpaoXIzIRKfYlhzARzdKCKO95k/o9ud9NaKhyEi693K 6nVmCNGE6ic8DbZmlIZBMfXXvQuegtEbFQ59ERy8mRZr8Ts0z/5u2CQGsyM+DpYc8Hfh YkjQd3nSJ9rd4jZYpS06amoTEQKq50Kw8ZcuTJeD4kokOygeF3lzOed9xilmJ4E2/f2d aVoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726163661; x=1726768461; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pnI/woVTgkNTSrRDHq2roOu8tQNQ4D8UQtZh3Y6WuYw=; b=qJRiSVzDKdGGSKNRlPXedQ3ewmkLpFQPlm5dUQACXUuGe+E3L8qezI+/3S6bn0Spls NflgmHvrBfRCJBpfqoKghoR1coAbg6KBIzeM0y/AD0FNY148Fm3kZXVGuZ1KC3szd4zc GcjjTsDYPciqG22YRKxLxmNyMmjf8sRs3363/cGwaM+6fVMyUzO2zK845FX+n9iaYTW/ ijDmENCxBLeLkFGwtUBpyLGQ/A3eaKYqX5GP6cS9JQIO1ZNvhos6dU62CIqSiG87qD2G HYAEYT8TVoh7iSQ7O9+tJlEaz82sloO7ShDyh+95C2r5AHHcRZ4o2vYQyTVTlRvkVTHY uCsg== X-Forwarded-Encrypted: i=1; AJvYcCWP/g3EW6c1KxD1vA1I6G9amoWz+A+ohmHOUsejy3OKkFFJpKVMuS5LPQ1XJlmPNboPWeqTVB3O8w==@kvack.org X-Gm-Message-State: AOJu0YzXD/iczBecVGuENY+CSWjpgTlIIJ6KZPquYFhHU+5c2g2Dm//j kWdKdf7UPTO9mQlT+Ti2xIcOrZ+n1ZyrAyQ87mYnVbfo5x4Vj/HoCe8fMi68BfQ+gOuA12oDRNG qkPKhSbL0wxk8GUKBWdXV6UncUzo= X-Google-Smtp-Source: AGHT+IF5Dr9dgApZdKn5lLwhFKi8T2rVtPmatqXg6u6KtfnDgDB7BVjPS6bSH+Hg14686iMVSU6nPTe/fII7VydFM+c= X-Received: by 2002:a17:902:f68e:b0:205:7829:9d83 with SMTP id d9443c01a7336-2076e412fafmr36150895ad.38.1726163660914; Thu, 12 Sep 2024 10:54:20 -0700 (PDT) MIME-Version: 1.0 References: <20240906051205.530219-1-andrii@kernel.org> <20240906051205.530219-3-andrii@kernel.org> <20240912-urenkel-umorientieren-c27ce893af09@brauner> In-Reply-To: <20240912-urenkel-umorientieren-c27ce893af09@brauner> From: Andrii Nakryiko Date: Thu, 12 Sep 2024 10:54:09 -0700 Message-ID: Subject: Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution To: Christian Brauner Cc: Suren Baghdasaryan , Jann Horn , Liam Howlett , Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org, mjguzik@gmail.com, Miklos Szeredi , Amir Goldstein , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6DE304001B X-Stat-Signature: zp9hefeie49us51fgk9mx8h8e3srd81o X-Rspam-User: X-HE-Tag: 1726163662-957203 X-HE-Meta: U2FsdGVkX1+sBuoAvxEa2g0W6GTooJUBBy0CgcQjjA9LfaU9R4FsCaUZMX6G6qvU0w4O3NEP7N3DGdoG8A3QJBsUL9OpUlosbYe9iuGsUEx3ynpksB/F5Df55dhy4soXLmkvq05juloqGsALyhbsc1FLEy4KAxfSB//wfJ+hbQrg9dBssuSBj4YNrd1eNlLaw26HDy/isZGmj2GGEs+gXSzcULIN44LBrk3WQVVWenInK7AY7dt9USINbOBQQAtLM6k4WbNf9yd8PvTWoN9odCH0adgUqgUxCm9a1tRwCLbQED4XkIMYdHixGFiAbIgggTuuxphTHG+Zxqf9ARlwFD+RPOgGO2v+BMKRETE6xeV48g0jaYgS0RIEvftHsBR/mWAHVuYeW4j4/P49nPaBfwbY4cEP7NwJIxzc//jJfB+jnPoDA3gmZ6XokiNgp6akiX52UMT//6teNvaQ7I5YjEFOHZH5gFZQBo/bGubG73gzFf4SJXL9H5flT73yGZU8ZOGRG6RJX50OWpt0Z9MlgwqmJ70RXHg0OLonU0E/Ah965ApByM9mTZGYezIr0+h1fwyyRuvXrHfuunOS3sz0V4KV8g90SQHENCYsyDN5oneRef9sXgMQsSCqrvC11PvfrYh1E0bZI36Qam4P0/B8lf6izVtdnOt0m9NBFFtNHmfqbizghQO9UKTZFbEOuU9llZ1yzSqoyXrjXbDX3jua7J8kIx3u+FyK34T5tCLU4EPeScwLytxx6IvzPP8yVqXhA42oUoKCEV5xU4UcjYZm3cuGGeS0t92FOTo6FAXPO2tEf5eYsBnAOJrU4DYUDOQlYqEuhF7agWeOTfvf30vu/0Ezh9fNeA4EvbcsLfYo0UsPFozX4RDJW2bFrJbP48JejJ76z/178GTR6lB/mIjSqZy7czyvoVmYRrceozNHM+pACmRVpXaTUAmcm4DguHjzAgVpNuwY2QReW94hNk8 UBpqf7tE UL3ERSGtDo0YxVSbVmMo8qnaLI1ChFCyYxK+RMiV8ns/i+iW5CVZJLhql/sytgu0FZ4QEVhM003vEalobxPjoNATXFfcFr7/kbFe3wTUcb736W/6jaYPMfoy0YmrVteOQVx4qUB9o6vYIyZAivljVdVKAo5kXReplDC/9zmV18G+TuSjJqQfcDCXCKs4tgtZqu+3USaHgqdOCoO5iRvLqPQVjDyh3P2NZOeySan3VKoqkl1BAE0mmh1tqpQNNfDXICaTc/U9Ou07nIPJfBE34pER4Vk852FTG6D41QMEMCrbw7Ian4uLd6uWHa1ADrKZ9iFXMWAQPcK6SxpMFOveWgVXE3wkwnHC1USdA1DCcNgBK3cDrMDrN67yJbBXk6bPeE/UvUxIH0hfGwqVK1zIMzyET9sjDAZNQ+NURuwbx/T1lrpI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Sep 12, 2024 at 4:17=E2=80=AFAM Christian Brauner wrote: > > On Tue, Sep 10, 2024 at 01:58:10PM GMT, Andrii Nakryiko wrote: > > On Tue, Sep 10, 2024 at 9:32=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > On Mon, Sep 9, 2024 at 2:29=E2=80=AFPM Andrii Nakryiko > > > wrote: > > > > > > > > On Mon, Sep 9, 2024 at 6:13=E2=80=AFAM Jann Horn = wrote: > > > > > > > > > > On Fri, Sep 6, 2024 at 7:12=E2=80=AFAM Andrii Nakryiko wrote: > > > > > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we ca= n safely > > > > > > access vma->vm_file->f_inode field locklessly under just rcu_re= ad_lock() > > > > > > > > > > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example > > > > > ovl_mmap(), which uses backing_file_mmap(), which does > > > > > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s > > > > > "realfile", which comes from file->private_data, which is set in > > > > > ovl_open() to the return value of ovl_open_realfile(), which come= s > > > > > from backing_file_open(), which allocates a file with > > > > > alloc_empty_backing_file(), which uses a normal kzalloc() without= any > > > > > RCU stuff, with this comment: > > > > > > > > > > * This is only for kernel internal use, and the allocate file mu= st not be > > > > > * installed into file tables or such. > > > > > > > > > > And when a backing_file is freed, you can see on the path > > > > > __fput() -> file_free() > > > > > that files with FMODE_BACKING are directly freed with kfree(), no= RCU delay. > > > > > > > > Good catch on FMODE_BACKING, I didn't realize there is this excepti= on, thanks! > > > > > > > > I think the way forward would be to detect that the backing file is= in > > > > FMODE_BACKING and fall back to mmap_lock-protected code path. > > > > > > > > I guess I have the question to Liam and Suren, do you think it woul= d > > > > be ok to add another bool after `bool detached` in struct > > > > vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try t= o > > > > add an extra bit into vm_flags_t? The latter would work without > > > > CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm fol= ks. > > > > > > > > This flag can be set in vma_set_file() when swapping backing file a= nd > > > > wherever else vma->vm_file might be set/updated (I need to audit th= e > > > > code). > > > > > > I understand that this would work but I'm not very eager to leak > > > vm_file attributes like FMODE_BACKING into vm_area_struct. > > > Instead maybe that exception can be avoided? Treating all vm_files > > > > I agree, that would be best, of course. It seems like [1] was an > > optimization to avoid kfree_rcu() calls, not sure how big of a deal it > > is to undo that, given we do have a use case that calls for it now. > > Let's see what Christian thinks. > > Do you just mean? > > diff --git a/fs/file_table.c b/fs/file_table.c > index 7ce4d5dac080..03e58b28e539 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -68,7 +68,7 @@ static inline void file_free(struct file *f) > put_cred(f->f_cred); > if (unlikely(f->f_mode & FMODE_BACKING)) { > path_put(backing_file_user_path(f)); > - kfree(backing_file(f)); > + kfree_rcu(backing_file(f)); > } else { > kmem_cache_free(filp_cachep, f); > } > > Then the only thing you can do with FMODE_BACKING is to skip it. I think > that should be fine since backing files right now are only used by > overlayfs and I don't think the kfree_rcu() will be a performance issue. Yes, something along those lines. Ok, great, if it's ok to add back kfree_rcu(), then I think that resolves the main problem I was running into. I'll incorporate adding back RCU-delated freeing as a separate patch into the future patch set, thanks! > > > > > > equally as RCU-safe would be a much simpler solution. I see that this > > > exception was introduced in [1] and I don't know if this was done for > > > performance reasons or something else. Christian, CCing you here to > > > please clarify. > > > > > > [1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed= 70@brauner/ > > > > > > > > > > > > > > > > > So the RCU-ness of "struct file" is an implementation detail of t= he > > > > > VFS, and you can't rely on it for ->vm_file unless you get the VF= S to > > > > > change how backing file lifetimes work, which might slow down som= e > > > > > other workload, or you find a way to figure out whether you're de= aling > > > > > with a backing file without actually accessing the file. > > > > > > > > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned = long bp_vaddr) > > > > > > +{ > > > > > > + const vm_flags_t flags =3D VM_HUGETLB | VM_MAYEXEC | VM= _MAYSHARE; > > > > > > + struct mm_struct *mm =3D current->mm; > > > > > > + struct uprobe *uprobe; > > > > > > + struct vm_area_struct *vma; > > > > > > + struct file *vm_file; > > > > > > + struct inode *vm_inode; > > > > > > + unsigned long vm_pgoff, vm_start; > > > > > > + int seq; > > > > > > + loff_t offset; > > > > > > + > > > > > > + if (!mmap_lock_speculation_start(mm, &seq)) > > > > > > + return NULL; > > > > > > + > > > > > > + rcu_read_lock(); > > > > > > + > > > > > > + vma =3D vma_lookup(mm, bp_vaddr); > > > > > > + if (!vma) > > > > > > + goto bail; > > > > > > + > > > > > > + vm_file =3D data_race(vma->vm_file); > > > > > > > > > > A plain "data_race()" says "I'm fine with this load tearing", but > > > > > you're relying on this load not tearing (since you access the vm_= file > > > > > pointer below). > > > > > You're also relying on the "struct file" that vma->vm_file points= to > > > > > being populated at this point, which means you need CONSUME seman= tics > > > > > here, which READ_ONCE() will give you, and something like RELEASE > > > > > semantics on any pairing store that populates vma->vm_file, which > > > > > means they'd all have to become something like smp_store_release(= )). > > > > > > > > vma->vm_file should be set in VMA before it is installed and is nev= er > > > > modified afterwards, isn't that the case? So maybe no extra barrier > > > > are needed and READ_ONCE() would be enough. > > > > > > > > > > > > > > You might want to instead add another recheck of the sequence cou= nt > > > > > (which would involve at least a read memory barrier after the > > > > > preceding patch is fixed) after loading the ->vm_file pointer to > > > > > ensure that no one was concurrently changing the ->vm_file pointe= r > > > > > before you do memory accesses through it. > > > > > > > > > > > + if (!vm_file || (vma->vm_flags & flags) !=3D VM_MAYEXEC= ) > > > > > > + goto bail; > > > > > > > > > > missing data_race() annotation on the vma->vm_flags access > > > > > > > > ack > > > > > > > > > > > > > > > + vm_inode =3D data_race(vm_file->f_inode); > > > > > > > > > > As noted above, this doesn't work because you can't rely on havin= g RCU > > > > > lifetime for the file. One *very* ugly hack you could do, if you = think > > > > > this code is so performance-sensitive that you're willing to do f= airly > > > > > atrocious things here, would be to do a "yes I am intentionally d= oing > > > > > a UAF read and I know the address might not even be mapped at thi= s > > > > > point, it's fine, trust me" pattern, where you use > > > > > copy_from_kernel_nofault(), kind of like in prepend_copy() in > > > > > fs/d_path.c, and then immediately recheck the sequence count befo= re > > > > > doing *anything* with this vm_inode pointer you just loaded. > > > > > > > > > > > > > > > > > > yeah, let's leave it as a very unfortunate plan B and try to solve = it > > > > a bit cleaner. > > > > > > > > > > > > > > > > > > > + vm_pgoff =3D data_race(vma->vm_pgoff); > > > > > > + vm_start =3D data_race(vma->vm_start); > > > > > > + > > > > > > + offset =3D (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr= - vm_start); > > > > > > + uprobe =3D find_uprobe_rcu(vm_inode, offset); > > > > > > + if (!uprobe) > > > > > > + goto bail; > > > > > > + > > > > > > + /* now double check that nothing about MM changed */ > > > > > > + if (!mmap_lock_speculation_end(mm, seq)) > > > > > > + goto bail; > > > > > > + > > > > > > + rcu_read_unlock(); > > > > > > + > > > > > > + /* happy case, we speculated successfully */ > > > > > > + return uprobe; > > > > > > +bail: > > > > > > + rcu_read_unlock(); > > > > > > + return NULL; > > > > > > +}