From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48F01C52D7C for ; Thu, 15 Aug 2024 16:47:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFF276B0178; Thu, 15 Aug 2024 12:47:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAF4A6B0179; Thu, 15 Aug 2024 12:47:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B76926B017A; Thu, 15 Aug 2024 12:47:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9946F6B0178 for ; Thu, 15 Aug 2024 12:47:42 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3CF8B80F48 for ; Thu, 15 Aug 2024 16:47:42 +0000 (UTC) X-FDA: 82455061164.28.DFDEF59 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf03.hostedemail.com (Postfix) with ESMTP id 2F25E20004 for ; Thu, 15 Aug 2024 16:47:39 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZU2xfTOl; spf=pass (imf03.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723740402; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VAffWbsLnPetVPci2wTuvab/xXJ2z1Q3fN0le8307es=; b=E4vt8IgKKzZN2GMWKMtfaj04N2f63hGs1XIuYPF5jeuDCIrwNYOB/dDP0Tsb8Wj2atVpaf oR4t083MO+gXptVNC/M404eOYSBa05al0NMODqTzz8oojvYkU80Pj2sh7RAO/AfZHqGLdo Vd1dyIkYSBg35z1+/6CZ0WBjTFZ4EWo= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZU2xfTOl; spf=pass (imf03.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723740402; a=rsa-sha256; cv=none; b=LvBpYINsWWndb/g8pun+u64MbN/P1PfF8PtDMT59GBvODYlV+AUxoXW+lBNMQ0Q1nT9p4a jpHWVBIaBu5gFU9y243I+A+3SMki1+Sp1bpl/Y0pi2UxDTBGf/i1PzaUU6SJGiKAmAAu0C XdvpghZFHDEAGS4S8k3eVOLw0iBOIT0= Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-7c3d9a5e050so831571a12.2 for ; Thu, 15 Aug 2024 09:47:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723740458; x=1724345258; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VAffWbsLnPetVPci2wTuvab/xXJ2z1Q3fN0le8307es=; b=ZU2xfTOlH7F8gfwvpuLfqD76dbn4q7EeMeJxWrW8dDnj/XcjxxgLqjL2LJ/zFoSEeb kgbbP8h1Mau3iv5s9cEDFjHhRLrVpnW5ctXaBEWBURkT9n3Va36nH9bLFDSNFsypEnXC pP3J5t20g3y+HAKtUda+0II5DN8ctSi8a0Hm8osSghdSebjE0dt7rzUAfhx0A29bL5dE CVGMBJjUVe4e+gzTuJMBM373/DgwqgCMmyS2/cxzrQInohUvntNvoXUSBfvhcMUW8mWL +bjlzNajLuJ9dGAlF9M8wgInKCaM8NDcPe0OHSvjgeDFRTKKL1Eq37W17nqzV5rQWzDV ZPWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723740458; x=1724345258; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VAffWbsLnPetVPci2wTuvab/xXJ2z1Q3fN0le8307es=; b=M5Yw0RKOBDcf9Hh8qObYTlDRGD91rjZ2L/9TDhfwMsRiLVBnBBZG3DonYfcMtwq+pP 9vznTZIu13z0gnbaeBJQ++uVDcqvgHW9g3khxcwh/DwFZFIUVo+0evedImhgYi5kg/7k 6HfqB1LfKD0k2lYQOK9ZkuHIYm+m6oXLawgMuFGOmKXCa9JLE51J7kfP4H2tdyHzYgGq sV3zS5XmiDat3BuLolBF85OVBNF0/6iXovDCXdtjg9n0XSYF7bPv9S0sjHEo6abtfGZU Nq35vimH7Ju2m1l60TTR3w7DACWfRezfofZ6vBfoR1X7/Ejys1Z4r2jocTvo2fLk4d+r pNEA== X-Forwarded-Encrypted: i=1; AJvYcCUDjJP9FN65No4wxJcqwUuj2mPjfMtDkfTSWryFxRWfspbj3sWNUg9kroCb1xIXyBWBMb5p6zGO1M6eBuvE7b9fSo0= X-Gm-Message-State: AOJu0YwGkGYlDpXTXRwqtnwIf5h7HpQwANnDU+Uz2nY27DJUS9ZYHdtG Q+7A5oV4dBA26RmIzpUUk9ZWcLnG7R7a5Y+xJT453Tnts3vqHHbnFSu/3DeXtiz6DkIWxMXl/RX pjTgQ3gH0MzGQfJEz/zNUKhqwGDA= X-Google-Smtp-Source: AGHT+IGPPhKOy6VU0AspQwojFzQHGwzc8lH4o3OUbuX1lktMpt4gThW4jG4kZW+gLk846vEjT7ApldS9rVr8CiXLS5s= X-Received: by 2002:a17:90a:6fe2:b0:2d3:d4eb:10e0 with SMTP id 98e67ed59e1d1-2d3e0f39057mr186867a91.43.1723740458382; Thu, 15 Aug 2024 09:47:38 -0700 (PDT) MIME-Version: 1.0 References: <20240813042917.506057-1-andrii@kernel.org> <20240813042917.506057-14-andrii@kernel.org> <7byqni7pmnufzjj73eqee2hvpk47tzgwot32gez3lb2u5lucs2@5m7dvjrvtmv2> In-Reply-To: From: Andrii Nakryiko Date: Thu, 15 Aug 2024 09:47:26 -0700 Message-ID: Subject: Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution To: Mateusz Guzik Cc: Suren Baghdasaryan , Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: rmmsge5pw3be8og3rqs7ipa4xedjkm8t X-Rspam-User: X-Rspamd-Queue-Id: 2F25E20004 X-Rspamd-Server: rspam02 X-HE-Tag: 1723740459-934776 X-HE-Meta: U2FsdGVkX1/jVpOM/rF1piaIfx9Aw5VBeF6bG4k9nyRTz8I4EiO2kzkSC0dwKCKgP49Q6mNKUsE98KOxIcuPbDg9QZfZiNpYJ4Z5KnB1/LxhoIEyeusHIXn5IwIkCim46lZ9mApR9pmOR5YRfB01qGF1Q4UBKGzr7gawDUeApYy/CPYAgtBq8QQbXqvCpaXWPLs10ped4AvXiUot1XSQQzPflMxxtdETO1+l3pq8uRnY+ubQp9FbMSEv1MBvxYKKCRW3trv9CURj56kRmj6x7kPDMlZcQmLW4fZcthEJ0GNsxSjPk86nRQkQEs6kJcSBKaXkV4fe1YeIJlOsojXTawUV3c/cdHtaGAy85bUqniXy/O6z+HvXm3Xn5JEgPSW6lq8hln1SIGoCaW3LundlRZAwuA/2gaggKgeMUzS9oW+uIpkQFPOPq8AObH6cgAuNov8TMtOsuQ4sewtYTjUmGv8c965wG4BkU5s3H19eultU5K+tOH3VnL8ucE+aUxyHDsGlw9QrgnHdu75/RYYsiT99u5PxFlH2B1wp/1lhkpDugpO+1/74PeUj9FPwbi9cpS+QRnoYQeRIdmVeXBF9B2EPdu7UgZbu0+0l9SueA2FailrVsqlvfM7Sx55X9Cv0kfr3BG7mQEiCfCh69YE8a0ifx7iMsg5VJyzfzs+4a4D8/FHkzE1SOWBjQXdUvD2ojFPjGwIr6uA/qPyP1Hw3NdZ6qZD51wqd+B520erLzEjw0Po5TjAgergt9H5fqsOJcOGWY56zXBaMjHGqgdcUBClFTem7MpYv3WmlpxJuwkcHP5x2cLbI12/vnSfta3U0JBmcPX6/M2XGYImxiKurZjv0inMeqy9XnwWhlugGnNvhQhGBSH5VwLPKihvmC4Z7M9pdz+by/wDSmTstX6sHis8S4eFMqLdgNZkXEKecddKOlQMUA0gchoVBuljN+vOoHdHNxmdtcWG2Kk9iHQa jh+OzxfQ wzn9h+eGaNaaQVcyi1HjyeEFCaNdc2GQ6EAXfwUqSWdf2PGRxCLzgyWJkTotEbUS9BRDAoHVHHGbYZWZo3i+NAcrQkmTXzrEs4mKZVwmrqzEoAeDna+Dn5czm+PESsX4iOF0EGkVfE8JTXLSwy+zzsQIHolEEB++19gX4SmeKR5gw1XSuCLUTqwdf0kSgmdBOytAEWldDxnJf+FiD528YPpETab8ASGzns4EnuM69EWNYoF+8/6KIVj0HQWKZGiLE8Il4zcz36pLhRBck7t6WiU2h7dB0dukVyRnejwzhk0u3bnmv2m8Q11H3URXrhjUs4fLSLV445zf6Cr1bJV0JMnbEqAxayJnKS9YG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 15, 2024 at 6:44=E2=80=AFAM Mateusz Guzik w= rote: > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote: > > On Mon, Aug 12, 2024 at 11:18=E2=80=AFPM Mateusz Guzik wrote: > > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote: > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protectio= n, > > > > attempting uprobe look up speculatively. > > > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers = to > > > > validate that mm_struct stays intact for entire duration of this > > > > speculation. If not, we fall back to mmap_lock-protected lookup. > > > > > > > > This allows to avoid contention on mmap_lock in absolutely majority= of > > > > cases, nicely improving uprobe/uretprobe scalability. > > > > > > > > > > Here I have to admit to being mostly ignorant about the mm, so bear w= ith > > > me. :> > > > > > > I note the result of find_active_uprobe_speculative is immediately st= ale > > > in face of modifications. > > > > > > The thing I'm after is that the mmap_lock_speculation business adds > > > overhead on archs where a release fence is not a de facto nop and I > > > don't believe the commit message justifies it. Definitely a bummer to > > > add merely it for uprobes. If there are bigger plans concerning it > > > that's a different story of course. > > > > > > With this in mind I have to ask if instead you could perhaps get away > > > with the already present per-vma sequence counter? > > > > per-vma sequence counter does not implement acquire/release logic, it > > relies on vma->vm_lock for synchronization. So if we want to use it, > > we would have to add additional memory barriers here. This is likely > > possible but as I mentioned before we would need to ensure the > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway > > there (it implements acquire/release logic), we just had to ensure > > mmap_write_lock() increments mm->mm_lock_seq. > > > > So, from the release fence overhead POV I think whether we use > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence > > here. > > > > Per my previous e-mail I'm not particularly familiar with mm internals, > so I'm going to handwave a little bit with my $0,03 concerning multicore > in general and if you disagree with it that's your business. For the > time being I have no interest in digging into any of this. > > Before I do, to prevent this thread from being a total waste, here are > some remarks concerning the patch with the assumption that the core idea > lands. > > From the commit message: > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > > attempting uprobe look up speculatively. > > Just in case I'll note a nit that this paragraph will need to be removed > since the patch adding the flag is getting dropped. Yep, of course, I'll update all that for the next revision (I'll wait for non-RFC patches to land first before reposting). > > A non-nit which may or may not end up mattering is that the flag (which > *is* set on the filep slab cache) makes things more difficult to > validate. Normal RCU usage guarantees that the object itself wont be > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU > flag weakens it significantly -- the thing at hand will always be a > 'struct file', but it may get reallocated to *another* file from under > you. Whether this aspect plays a role here I don't know. Yes, that's ok and is accounted for. We care about that memory not going even from under us (I'm not even sure if it matters that it is still a struct file, tbh; I think that shouldn't matter as we are prepared to deal with completely garbage values read from struct file). > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_= vaddr) > > +{ > > + const vm_flags_t flags =3D VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; > > + struct mm_struct *mm =3D current->mm; > > + struct uprobe *uprobe; > > + struct vm_area_struct *vma; > > + struct file *vm_file; > > + struct inode *vm_inode; > > + unsigned long vm_pgoff, vm_start; > > + int seq; > > + loff_t offset; > > + > > + if (!mmap_lock_speculation_start(mm, &seq)) > > + return NULL; > > + > > + rcu_read_lock(); > > + > > I don't think there is a correctness problem here, but entering rcu > *after* deciding to speculatively do the lookup feels backwards. RCU should protect VMA and file, mm itself won't go anywhere, so this seems= ok. > > > + vma =3D vma_lookup(mm, bp_vaddr); > > + if (!vma) > > + goto bail; > > + > > + vm_file =3D data_race(vma->vm_file); > > + if (!vm_file || (vma->vm_flags & flags) !=3D VM_MAYEXEC) > > + goto bail; > > + > > If vma teardown is allowed to progress and the file got fput'ed... > > > + vm_inode =3D data_race(vm_file->f_inode); > > ... the inode can be NULL, I don't know if that's handled. > Yep, inode pointer value is part of RB-tree key, so if it's NULL, we just won't find a matching uprobe. Same for any other "garbage" f_inode value. Importantly, we never should dereference such inode pointers, at least until we find a valid uprobe (in which case we keep inode reference to it). > More importantly though, per my previous description of > SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated and > the inode you did find is completely unrelated. > > I understand the intent is to backpedal from everything should the mm > seqc change, but the above may happen to matter. Yes, I think we took that into account. All that we care about is memory "type safety", i.e., even if struct file's memory is reused, it's ok, we'll eventually detect the change and will discard wrong uprobe that we might by accident lookup (though probably in most cases we just won't find a uprobe at all). > > > + vm_pgoff =3D data_race(vma->vm_pgoff); > > + vm_start =3D data_race(vma->vm_start); > > + > > + offset =3D (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_star= t); > > + uprobe =3D find_uprobe_rcu(vm_inode, offset); > > + if (!uprobe) > > + goto bail; > > + > > + /* now double check that nothing about MM changed */ > > + if (!mmap_lock_speculation_end(mm, seq)) > > + goto bail; > > This leaks the reference obtained by find_uprobe_rcu(). find_uprobe_rcu() doesn't obtain a reference, uprobe is RCU-protected, and if caller need a refcount bump it will have to use try_get_uprobe() (which might fail). > > > + > > + rcu_read_unlock(); > > + > > + /* happy case, we speculated successfully */ > > + return uprobe; > > +bail: > > + rcu_read_unlock(); > > + return NULL; > > +} > > Now to some handwaving, here it is: > > The core of my concern is that adding more work to down_write on the > mmap semaphore comes with certain side-effects and plausibly more than a > sufficient speed up can be achieved without doing it. > > An mm-wide mechanism is just incredibly coarse-grained and it may happen > to perform poorly when faced with a program which likes to mess with its > address space -- the fast path is going to keep failing and only > inducing *more* overhead as the code decides to down_read the mmap > semaphore. > > Furthermore there may be work currently synchronized with down_write > which perhaps can transition to "merely" down_read, but by the time it > happens this and possibly other consumers expect a change in the > sequence counter, messing with it. > > To my understanding the kernel supports parallel faults with per-vma > locking. I would find it surprising if the same machinery could not be > used to sort out uprobe handling above. per-vma locking is still *locking*. Which means memory sharing between multiple CPUs, which means limited scalability. Lots of work in this series went to avoid even refcounting (as I pointed out for find_uprobe_rcu()) due to the same reason, and so relying on per-VMA locking is just shifting the bottleneck from mmap_lock to vma->vm_lock. Worst (and not uncommon) case is the same uprobe in the same process (and thus vma) being hit on multiple CPUs at the same time. Whether that's protected by mmap_lock or vma->vm_lock is immaterial at that point (from scalability standpoint). > > I presume a down_read on vma around all the work would also sort out any > issues concerning stability of the file or inode objects. > > Of course single-threaded performance would take a hit due to atomic > stemming from down/up_read and parallel uprobe lookups on the same vma > would also get slower, but I don't know if that's a problem for a real > workload. > > I would not have any comments if all speed ups were achieved without > modifying non-uprobe code. I'm also not a mm-focused person, so I'll let Suren and others address mm-specific concerns, but I (hopefully) addressed all the uprobe-related questions and concerns you had.