From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82F82C3DA7F for ; Thu, 15 Aug 2024 17:46:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A6A606B0192; Thu, 15 Aug 2024 13:46:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F36F6B0193; Thu, 15 Aug 2024 13:46:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 86BD46B0194; Thu, 15 Aug 2024 13:46:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5ED036B0192 for ; Thu, 15 Aug 2024 13:46:02 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C3C5F14159E for ; Thu, 15 Aug 2024 17:46:01 +0000 (UTC) X-FDA: 82455208122.27.6AD5A30 Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf25.hostedemail.com (Postfix) with ESMTP id C2384A0002 for ; Thu, 15 Aug 2024 17:45:59 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hSqTJSsl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723743878; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3V+AAngWgfIxv6c1YFrDspk/cizXypebZn5qis955xM=; b=KFSvvwz+X4gn4arSWyMZT6+X5bWuNixXlDD1dYw0nSvXMyjjJXs1LmkdBJed5p921YXXHi Q2neciOLLgBct3htdMw38isUr7iM+KuxevYuEKE0G+P84688CmWzXEaX1Mcr36SmRat0Yg bOOy+7xgUhb/PxsLPLnE4ouGnyQtnCc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723743878; a=rsa-sha256; cv=none; b=vGvYmKMbHh+TVzXWaCmL8Dy7nWwOx6TPKSUxVuW8kRuaWIvZXT1/yuw/O6Ijyxpghh4nLR UZp0OIoA4uLwsDmKisbu+oCwlqDLobPujhn4s0Pqmm/UPf4gjzhGRyRK56ShRQUDOuVhLX O2AEnRDtcIZHYoNJy/wDmbnRurlJCM8= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hSqTJSsl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=surenb@google.com Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-690b6cbce11so11805327b3.2 for ; Thu, 15 Aug 2024 10:45:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1723743959; x=1724348759; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3V+AAngWgfIxv6c1YFrDspk/cizXypebZn5qis955xM=; b=hSqTJSsl1QaLnm98wXy0JTp4CVSuQ7MgoMh/lP/6aY/zsfSWmYwIihZGdBEkWvAo3j 7dPTgMxl0O1FgMaoSJsbiVL8o4UmhCa8OkcSujWuJkFO7rpEd06Cdrmgq6jO83zwfr5V PuDo+xeAEED7WDbTPdNKiGagpn6wCx9nTiBtWSJSZCqiyKQfZ4rF0h7QLetkJXALxNrW bIEBynFhqwgsQp4mbK2Rfd8+Wo8s1XMn0fZUDMtNmiRP8JIrEaeYE11CTFSg0OnB4KSJ 3NIhlJQCQdvyNTrxix3AU+iKka4XXG4ae37wJ8fuax21TfJU7OLE3IyPKE6MfrXYZz98 ikAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723743959; x=1724348759; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3V+AAngWgfIxv6c1YFrDspk/cizXypebZn5qis955xM=; b=CxwGQwF+Ua4vruCA7TYCBZnPzrymc2uok5bd+4TVGXCQ/QMXaUc3T7WPWL+aF/CEts sJUtdI6XnI9mqyI6rIXP76IilcL/prli9u97FJoFE1VNCHRPzoK1Asj4KtfYIWQeoYop twpRoBcZX7OSw+kGL3JqskyI5Q/N9DAKo551PHCHkix5zZ8vm1rGBzS7irTju/f0xb1n 2tze97bNeQtwmJXNQvWL2b5I5lcn8x/BY2hsLdOOv4FfKoxSwDqIMUpGcTeHWDjAuGK8 MCGmsZuSPTuqvPlC4lmim+uo77XokP+y81Uqt8uSer0tPUI0X1kVkgXNIVfPmFEEDMW3 gJ3Q== X-Forwarded-Encrypted: i=1; AJvYcCWRQe0CC71uRp0ujKSJGUNdYACyTGXJaavSBHCweufPxe8hvZATdLn/7S90eleBNp3uFeOGbeZi201KPBCItuGMAr8= X-Gm-Message-State: AOJu0Yzob6PRNQIYXyq76WeDzCtmu/vr5nT0ldEBjS+Zc81BwD3LE1GZ 0mfAhCXeWYFcV1DDTVlslL1frJr3NCwNCK9fxNyhjbNSazkuCbbM1jKrFvDrbEiIWqWVmuKLNps A268LKrZtlogAduw9Y/FUTzvSgQHvCBWCU05d X-Google-Smtp-Source: AGHT+IE/v/zGjlSuMoo/W6H9WCBIwRvVIVJ9q7nRaS1WSXSTrQvfY+mMx781RH1aMnU6RgWx/uBtb692fOwW31mjIUo= X-Received: by 2002:a05:690c:62c9:b0:65f:8afe:9ba6 with SMTP id 00721157ae682-6b1b823f4bdmr4555977b3.14.1723743958241; Thu, 15 Aug 2024 10:45:58 -0700 (PDT) MIME-Version: 1.0 References: <20240813042917.506057-1-andrii@kernel.org> <20240813042917.506057-14-andrii@kernel.org> <7byqni7pmnufzjj73eqee2hvpk47tzgwot32gez3lb2u5lucs2@5m7dvjrvtmv2> In-Reply-To: From: Suren Baghdasaryan Date: Thu, 15 Aug 2024 10:45:45 -0700 Message-ID: Subject: Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution To: Andrii Nakryiko Cc: Mateusz Guzik , Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org, Jann Horn Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: C2384A0002 X-Stat-Signature: hkngodzhxqp5na9pq89y6bw4zq9kjhr8 X-Rspam-User: X-HE-Tag: 1723743959-445320 X-HE-Meta: U2FsdGVkX1+k4kMTfnTeAG6EHA9PFDwTlWs9yIiGoWfoEFrNuWPX6JzY84jmhLeB0gYlAoCB9S/NWWDn19St7teeQCzwKsCbysBmWzyHqpZm/ScPxRLpN/M6hqWyzVxt3PPYnpi2h70V/NB+BMHFGuHiJ5L2hqcA5UqmTzAeckSHzHaK+o2gbghxbOjXBsYJodEmaSzaLaneLjSksQXjeoT5SQ2RiXfzujbUQFeBtN++TqIB/m22zmL8xZOPgPzbJi03WsF2LFpW7mEogPN7EEyUzm6D4koUIQ8Kdrq/BfIBNtEawC/XYSO6RV15DrNXWws81Ju6ao9mf6MBhot0Ot+evBzJhtBy9B/L25IvpNr9lHIYab0jWCGSHuwORZrjubvptZbmAJ+HNThaPSQvz3oWb/tahFrbdEr1sW1Ess8FjdTUBAU4WxXZerlTblniTxoJmrQlY0m8txs2wcHtehJFiRHSbtvCrKGTX9xqwKgqlwEnUyfcSxkw0J4eDTSKt5Aj30YgsDAhnDJ6wil/7mGnqRMZ7PnKFOKUz/BhI79dhXfJ7F3xhgUdJiLuG+C/YSPQUKvw+sezFC/wHj5gyDY0/9Luu2bwex/AjgLTDi6ksLTFkv43xJgINk5UUl9ZlrRzX7NNJ1+G56RQM7BWo5TCVKGhf0MFspUw9oRZfbOIJGkiU/Gq8ploZNml6bzQlQAIrWgRmd2a+hbTFkZbUX4WTx+84ScURJQiqEdvX7FgsIdfH/ZiKGw3FKIyDy9XwKF9CuGU636VntE8R4TjZhVMTdKH0d+uyiofdbaPKslBRhfXppHli2xugIZx8nrwjRqCq9SqM0DnFiYIiG9VRRxTZemYTWx9m6hC2zk2GmNX4EryLyeFRr0Lf0tY+DuHrwrqCs7UF2OmtqSiE1Y1l5yQpEAQwIzkEY2RuYXP61RG9RiiL3rHpk8eHz3TpeXd0AqFhDU6St8CnBzBuXd 2KSf4ykT PZz0Z/xGTOyGrhwt8tw6RJN5R+NrOUASiJSlyYQ/0pBPmF8AlSR3kgxEmN2yAKuT25q644tdMIWPHRvEhjsa9fHI9hUyNC4Om7tYihOGGjcejyrqibHmdHLUBU4u/pHKKKVYjSajHAwI+U/LiToD30srP7my0UyfuxbFbZNofTS6NXXK56n6LM3ZNYU9XBUNmPh5an/IFe5Bf6NVnuPH5WPXzYMpXwfUOk7YPW/a/GxGFPKMQ5r9Y8N6zPy8JTi4KvSW0CCeKgBsM69QLbEkepiz6m4g27c4u7WEBBtS7fdtZNOgDS6htuaqB/cGGpfmSvrgr1rGrL9AO4Gw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 15, 2024 at 9:47=E2=80=AFAM Andrii Nakryiko wrote: > > On Thu, Aug 15, 2024 at 6:44=E2=80=AFAM Mateusz Guzik = wrote: > > > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote: > > > On Mon, Aug 12, 2024 at 11:18=E2=80=AFPM Mateusz Guzik wrote: > > > > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote: > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely acce= ss > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protect= ion, > > > > > attempting uprobe look up speculatively. > > > > > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() helper= s to > > > > > validate that mm_struct stays intact for entire duration of this > > > > > speculation. If not, we fall back to mmap_lock-protected lookup. > > > > > > > > > > This allows to avoid contention on mmap_lock in absolutely majori= ty of > > > > > cases, nicely improving uprobe/uretprobe scalability. > > > > > > > > > > > > > Here I have to admit to being mostly ignorant about the mm, so bear= with > > > > me. :> > > > > > > > > I note the result of find_active_uprobe_speculative is immediately = stale > > > > in face of modifications. > > > > > > > > The thing I'm after is that the mmap_lock_speculation business adds > > > > overhead on archs where a release fence is not a de facto nop and I > > > > don't believe the commit message justifies it. Definitely a bummer = to > > > > add merely it for uprobes. If there are bigger plans concerning it > > > > that's a different story of course. > > > > > > > > With this in mind I have to ask if instead you could perhaps get aw= ay > > > > with the already present per-vma sequence counter? > > > > > > per-vma sequence counter does not implement acquire/release logic, it > > > relies on vma->vm_lock for synchronization. So if we want to use it, > > > we would have to add additional memory barriers here. This is likely > > > possible but as I mentioned before we would need to ensure the > > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway > > > there (it implements acquire/release logic), we just had to ensure > > > mmap_write_lock() increments mm->mm_lock_seq. > > > > > > So, from the release fence overhead POV I think whether we use > > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence > > > here. > > > > > > > Per my previous e-mail I'm not particularly familiar with mm internals, > > so I'm going to handwave a little bit with my $0,03 concerning multicor= e > > in general and if you disagree with it that's your business. For the > > time being I have no interest in digging into any of this. > > > > Before I do, to prevent this thread from being a total waste, here are > > some remarks concerning the patch with the assumption that the core ide= a > > lands. > > > > From the commit message: > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > > > attempting uprobe look up speculatively. > > > > Just in case I'll note a nit that this paragraph will need to be remove= d > > since the patch adding the flag is getting dropped. > > Yep, of course, I'll update all that for the next revision (I'll wait > for non-RFC patches to land first before reposting). > > > > > A non-nit which may or may not end up mattering is that the flag (which > > *is* set on the filep slab cache) makes things more difficult to > > validate. Normal RCU usage guarantees that the object itself wont be > > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU > > flag weakens it significantly -- the thing at hand will always be a > > 'struct file', but it may get reallocated to *another* file from under > > you. Whether this aspect plays a role here I don't know. > > Yes, that's ok and is accounted for. We care about that memory not > going even from under us (I'm not even sure if it matters that it is > still a struct file, tbh; I think that shouldn't matter as we are > prepared to deal with completely garbage values read from struct > file). Correct, with SLAB_TYPESAFE_BY_RCU we do need an additional check that vma->vm_file has not been freed and reused. That's where mmap_lock_speculation_{start|end} helps us. For vma->vm_file to change from under us one would have to take mmap_lock for write. If that happens mmap_lock_speculation_{start|end} should detect that and terminate our speculation. > > > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long b= p_vaddr) > > > +{ > > > + const vm_flags_t flags =3D VM_HUGETLB | VM_MAYEXEC | VM_MAYSHAR= E; > > > + struct mm_struct *mm =3D current->mm; > > > + struct uprobe *uprobe; > > > + struct vm_area_struct *vma; > > > + struct file *vm_file; > > > + struct inode *vm_inode; > > > + unsigned long vm_pgoff, vm_start; > > > + int seq; > > > + loff_t offset; > > > + > > > + if (!mmap_lock_speculation_start(mm, &seq)) > > > + return NULL; > > > + > > > + rcu_read_lock(); > > > + > > > > I don't think there is a correctness problem here, but entering rcu > > *after* deciding to speculatively do the lookup feels backwards. > > RCU should protect VMA and file, mm itself won't go anywhere, so this see= ms ok. > > > > > > + vma =3D vma_lookup(mm, bp_vaddr); > > > + if (!vma) > > > + goto bail; > > > + > > > + vm_file =3D data_race(vma->vm_file); > > > + if (!vm_file || (vma->vm_flags & flags) !=3D VM_MAYEXEC) > > > + goto bail; > > > + > > > > If vma teardown is allowed to progress and the file got fput'ed... > > > > > + vm_inode =3D data_race(vm_file->f_inode); > > > > ... the inode can be NULL, I don't know if that's handled. > > > > Yep, inode pointer value is part of RB-tree key, so if it's NULL, we > just won't find a matching uprobe. Same for any other "garbage" > f_inode value. Importantly, we never should dereference such inode > pointers, at least until we find a valid uprobe (in which case we keep > inode reference to it). > > > More importantly though, per my previous description of > > SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated and > > the inode you did find is completely unrelated. > > > > I understand the intent is to backpedal from everything should the mm > > seqc change, but the above may happen to matter. > > Yes, I think we took that into account. All that we care about is > memory "type safety", i.e., even if struct file's memory is reused, > it's ok, we'll eventually detect the change and will discard wrong > uprobe that we might by accident lookup (though probably in most cases > we just won't find a uprobe at all). > > > > > > + vm_pgoff =3D data_race(vma->vm_pgoff); > > > + vm_start =3D data_race(vma->vm_start); > > > + > > > + offset =3D (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_st= art); > > > + uprobe =3D find_uprobe_rcu(vm_inode, offset); > > > + if (!uprobe) > > > + goto bail; > > > + > > > + /* now double check that nothing about MM changed */ > > > + if (!mmap_lock_speculation_end(mm, seq)) > > > + goto bail; > > > > This leaks the reference obtained by find_uprobe_rcu(). > > find_uprobe_rcu() doesn't obtain a reference, uprobe is RCU-protected, > and if caller need a refcount bump it will have to use > try_get_uprobe() (which might fail). > > > > > > + > > > + rcu_read_unlock(); > > > + > > > + /* happy case, we speculated successfully */ > > > + return uprobe; > > > +bail: > > > + rcu_read_unlock(); > > > + return NULL; > > > +} > > > > Now to some handwaving, here it is: > > > > The core of my concern is that adding more work to down_write on the > > mmap semaphore comes with certain side-effects and plausibly more than = a > > sufficient speed up can be achieved without doing it. AFAIK writers of mmap_lock are not considered a fast path. In a sense yes, we made any writer a bit heavier but OTOH we also made mm->mm_lock_seq a proper sequence count which allows us to locklessly check if mmap_lock is write-locked. I think you asked whether there will be other uses for mmap_lock_speculation_{start|end} and yes. For example, I am planning to use them for printing /proc/{pid}/maps without taking mmap_lock (when it's uncontended). If we have VMA seq counter-based detection it would be better (see below). > > > > An mm-wide mechanism is just incredibly coarse-grained and it may happe= n > > to perform poorly when faced with a program which likes to mess with it= s > > address space -- the fast path is going to keep failing and only > > inducing *more* overhead as the code decides to down_read the mmap > > semaphore. > > > > Furthermore there may be work currently synchronized with down_write > > which perhaps can transition to "merely" down_read, but by the time it > > happens this and possibly other consumers expect a change in the > > sequence counter, messing with it. > > > > To my understanding the kernel supports parallel faults with per-vma > > locking. I would find it surprising if the same machinery could not be > > used to sort out uprobe handling above. >From all the above, my understanding of your objection is that checking mmap_lock during our speculation is too coarse-grained and you would prefer to use the VMA seq counter to check that the VMA we are working on is unchanged. I agree, that would be ideal. I had a quick chat with Jann about this and the conclusion we came to is that we would need to add an additional smp_wmb() barrier inside vma_start_write() and a smp_rmb() in the speculation code: static inline void vma_start_write(struct vm_area_struct *vma) { int mm_lock_seq; if (__is_vma_write_locked(vma, &mm_lock_seq)) return; down_write(&vma->vm_lock->lock); /* * We should use WRITE_ONCE() here because we can have concurrent r= eads * from the early lockless pessimistic check in vma_start_read(). * We don't really care about the correctness of that early check, = but * we should use WRITE_ONCE() for cleanliness and to keep KCSAN hap= py. */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); + smp_wmb(); up_write(&vma->vm_lock->lock); } Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not enough because it's one-way permeable (it's a "RELEASE operation") and later vma->vm_file store (or any other VMA modification) can move before our vma->vm_lock_seq store. This makes vma_start_write() heavier but again, it's write-locking, so should not be considered a fast path. With this change we can use the code suggested by Andrii in https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6= k69Q@mail.gmail.com/ with an additional smp_rmb(): rcu_read_lock() vma =3D find_vma(...) if (!vma) /* bail */ vm_lock_seq =3D smp_load_acquire(&vma->vm_lock_seq); mm_lock_seq =3D smp_load_acquire(&vma->mm->mm_lock_seq); /* I think vm_lock has to be acquired first to avoid the race */ if (mm_lock_seq =3D=3D vm_lock_seq) /* bail, vma is write-locked */ ... perform uprobe lookup logic based on vma->vm_file->f_inode ... smp_rmb(); if (vma->vm_lock_seq !=3D vm_lock_seq) /* bail, VMA might have changed */ The smp_rmb() is needed so that vma->vm_lock_seq load does not get reordered and moved up before speculation. I'm CC'ing Jann since he understands memory barriers way better than me and will keep me honest. > > per-vma locking is still *locking*. Which means memory sharing between > multiple CPUs, which means limited scalability. Lots of work in this > series went to avoid even refcounting (as I pointed out for > find_uprobe_rcu()) due to the same reason, and so relying on per-VMA > locking is just shifting the bottleneck from mmap_lock to > vma->vm_lock. Worst (and not uncommon) case is the same uprobe in the > same process (and thus vma) being hit on multiple CPUs at the same > time. Whether that's protected by mmap_lock or vma->vm_lock is > immaterial at that point (from scalability standpoint). > > > > > I presume a down_read on vma around all the work would also sort out an= y > > issues concerning stability of the file or inode objects. > > > > Of course single-threaded performance would take a hit due to atomic > > stemming from down/up_read and parallel uprobe lookups on the same vma > > would also get slower, but I don't know if that's a problem for a real > > workload. > > > > I would not have any comments if all speed ups were achieved without > > modifying non-uprobe code. > > I'm also not a mm-focused person, so I'll let Suren and others address > mm-specific concerns, but I (hopefully) addressed all the > uprobe-related questions and concerns you had.