From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A1E7D41D44 for ; Tue, 12 Nov 2024 01:05:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD7558D0014; Mon, 11 Nov 2024 20:05:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D88848D0001; Mon, 11 Nov 2024 20:05:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BDECB8D0014; Mon, 11 Nov 2024 20:05:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9D0F58D0001 for ; Mon, 11 Nov 2024 20:05:08 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4EFA5161B4D for ; Tue, 12 Nov 2024 01:05:08 +0000 (UTC) X-FDA: 82775646996.13.0C3C743 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf01.hostedemail.com (Postfix) with ESMTP id 7FE1140013 for ; Tue, 12 Nov 2024 01:04:34 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=fIHb0Twc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731373444; a=rsa-sha256; cv=none; b=s2kweSW4zoPr4nqALrW0UyXMOWd2ZP8eS2Cgwl37xKeOIBSbz0rLDoCtcVv8jkkKXfbzFW NvnTyAblN/7HfKYvrvSK53CH/f4Ur1gyFBbunQISU/okEHZiUlOQ/H7YbjxUSZUpR5s50W JNA9Ds6EDKuce+sM8akbxC8TLwP0HPw= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=fIHb0Twc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731373444; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zkdTVDim1ZzBmzlpVVSDkcbtyV/TYvAbmr4nsEJFTmI=; b=xucita1Ew1K79BEaZaPTi9Urhm0p9E0XfH3FF62+tJx6nes+ASLxO6pxGhgPH6iDSz2pLG goK6+UGOp3pFgIfuscmg1Sv02I1jNi5nrIux6GLLhhC17/G9PCfgsw7ZNOUjnQXT+/IP6i PFP6tKzPgBRccHT61xn7lr6bfWT4k2M= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-460969c49f2so119591cf.0 for ; Mon, 11 Nov 2024 17:05:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731373505; x=1731978305; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zkdTVDim1ZzBmzlpVVSDkcbtyV/TYvAbmr4nsEJFTmI=; b=fIHb0TwcUEtfLkyy3YlZdellrjGMeeoAcUAxs3L2F4kB4qlnzTpqqu+Plrb2R7m33s WL1c96Jo0fCPUo7PF3b1nCEZ7lmFdrrJEsphmk+pygRMTl09W72CYvFEbOJa8csOi7Ck fXw19UBFFt8KZCSqRLG6W3v9mW4lSIDsuqEDCA5Z8Eb+tNkpSfFIhCwIPV0mN8DTEc5O NRhiybc48F6UunnXmsutBOnTji8fNPGx+EyDb8KQWzfjbHqGxeC4yBX4iHzLLjkh4rvE qX2ew/Lr5VhnM/DXl6X/HQh+QrIDUUzSjzDF21w7ym5b8iTVW/2jnnkLW0RZK2/15Hiy 0iQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731373505; x=1731978305; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zkdTVDim1ZzBmzlpVVSDkcbtyV/TYvAbmr4nsEJFTmI=; b=WrD/yvE4QJYjHxfrH4f+ETckIkE6KuFAoRi6/cDgMYe/G0H6x4eFOPvzpMKZhiW+lE U/fYEtq/tKRbRrdm0MeyfnhifdZaRSNYcCcbV2qRDfGhWDoJy0849VZrXADpuVk7a8Zq 12/FSfg2GqMq70zV/wdBs05RVQiMwHBW66ypJTrO2apHb1KC2B3cdL9v2enAWBSF++Pa 2nCYMBAoo2DyVOYnIiMpj9LcqfsBOK1rKlLV/b4Bn8X80mZCyMysJG2P+i80AdVSUrrs StvxZBqivY60xXzyQrj5dbE9gwMvwEVx5ZXBUzxazyKl2cMrXtOJr4gHgbU4R8tqibnb Ru/Q== X-Forwarded-Encrypted: i=1; AJvYcCUrJFkWyRO0MzbyUdDRPtkrYwCnFHnOOQu1K42nevuMnN+XzVbKd8RmZRy8vPJL4fNyjBCaozYK3w==@kvack.org X-Gm-Message-State: AOJu0YzqvVhRV/1GipaudJtBDGS228YGPWcMIhVeDFEbCgvSTHpJeq70 3G6jIu0Hlfq5Z0gZR2Ft8pd8RBNRDa6wIZh5+PTEo9uiwqISd53mt57T7vl0XpAgS2Yc7S8ydji 4x6sSZsqk59xYyk2Gax920P5VVUKat5Fbv8B8 X-Gm-Gg: ASbGncsuEPOz/NVz4SVY9iAPGX5Ed0X5T2wdI0UQgntlxXqIUTbs//+YFBkn/N2cbV1 uiwo9kY4Iv9KR4vcKxVB0+cYe+LKyLyg= X-Google-Smtp-Source: AGHT+IFa81CijRE1MeCWud//eJQAC53OwMrr0PPUNOh3z+iinXXS0t7yiFi2lRvjks5Sh8V99+c/YY6jyEmTZnx2kQE= X-Received: by 2002:a05:622a:2c3:b0:461:67d8:1f50 with SMTP id d75a77b69052e-4633ef60960mr1574411cf.4.1731373505181; Mon, 11 Nov 2024 17:05:05 -0800 (PST) MIME-Version: 1.0 References: <20241028010818.2487581-1-andrii@kernel.org> <20241028010818.2487581-5-andrii@kernel.org> <20241112092816.cf5b0aa1ef10f50ce872892f@kernel.org> In-Reply-To: <20241112092816.cf5b0aa1ef10f50ce872892f@kernel.org> From: Suren Baghdasaryan Date: Mon, 11 Nov 2024 17:04:54 -0800 Message-ID: Subject: Re: [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution To: Masami Hiramatsu Cc: Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, mjguzik@gmail.com, brauner@kernel.org, jannh@google.com, mhocko@kernel.org, vbabka@suse.cz, shakeel.butt@linux.dev, hannes@cmpxchg.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, arnd@arndb.de, richard.weiyang@gmail.com, zhangpeng.00@bytedance.com, linmiaohe@huawei.com, viro@zeniv.linux.org.uk, hca@linux.ibm.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 7FE1140013 X-Rspamd-Server: rspam11 X-Stat-Signature: n6p6grboms7gdob878ufxui1njmxb3fn X-HE-Tag: 1731373474-436339 X-HE-Meta: U2FsdGVkX19WgegN9pmmr3eIz9VXa8gFrXoTb5JoiEiVD0rg6TRS2D3KpzAHkX/rm4F56ucjF7+6b9VVC5F0tDPxGUtF8Wn8cvNcIjBL1PPZ5F8glLRlQ+z4891YPR7s0HXkBWteTsXAGVWNokRNclFsR6ZHrmKcteOcDXLkXe7OyqH8IdOEjygWwf6r0gPcSx9ZmrJRj0K45vfsNGmssn5pltWMaTYJTk70TmCjHBdZfbIjTkmyA7mVjGUwsTvEGTS+mkU552XoUADGj5wBh18Xn5OyTgOkx7ZZn8AvXVHlmW/sClg1k3vbgEELWNA71ZiloHu5muo6DLnMB5FXzAo/J1N41wRlpibzV6WHaSPf73nlLmr4Lkg5/IAVHihWoYWxpHpTccyA3GJDStVqo1aW8zpHyfF+xPOXjoFakhmtyw+e5BEbaXgz8cBytUv2KrEjg0gdmSssHbEbVcrVD4nVeLHHGClTzzFVVVK1APUC+3C9zql5otxwLyBFSnnwLnF2aukhwWYBy4J6CiadnVwk0Tm8B2kMG3cJGSARLmNsk+kYJBLhCNWiDufxqPlGewcmJetDEFSktyRi4IWATsSq6ev9EI7f3w/mCT5qUQqYzUs8nq6aPNw/KAPrpmNOtSL9nXZy1SNdPaxmZ/z6L400xN6AIY7XFDwtszulHg/fcMMeMAReFWdXKsuRJE6P/JMR9ffYtumXdYG4kV+EnZ2ar2iEtIOGvu3qIoQXCLpFILqFhPDreu0xBgrWLczZj0BUy5Ggy4A3KR2AGl2ZMJfI8+E/MXvkebyIUWzpfZ0EbYiV4GTMpE0gCW0WUeW49NMz674BM43UJotvgsAo1pzi4a0zCGtBVjd9fENNkyI6L2vCvLgRUINiLuajDe8AkSqm8Ug934lDyri0lJ4xjZJoZjvdZJS0B9JIUvDQgsgMqTMgzhkyPTIR99YW7KZeKr0KW6RhjgKUsnXKPG8 RuUJSXLk /stet2WYddtwPWJV24jt9ed+XNn8esba9C52BoDgtHhAtCHIM4yykRSLFNt+D/4GslH1dxyNOFFkMIbMVBu7KYhrC0YqRMN2e6BmRsSIcOkMcoyoQOP4f4jNfG+2MX5AMe34wX0vpD0v5Jj3elhh/URkNslmQShq/E9EnUAQYvHFckY3490xYUAUW/2GNuEc+BxSM0tsYUrrwpou1TtAis/uVo6VN+s8bK7MIu0Irmh5ycR3Wp5NnoPYRdbug7IlRX4M2Zhs99Tkd8Pcvoqx0lyXccq0dT4NSrfpQ6yYaH89Fa79C9an1PHMVIeFUhIehXqSKnqqtHTEEyOngwiaadhZpGov1cF+ctnfp3rpHpIkkL4hePrEao/qJbg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 11, 2024 at 4:28=E2=80=AFPM Masami Hiramatsu wrote: > > On Sun, 27 Oct 2024 18:08:18 -0700 > Andrii Nakryiko wrote: > > > Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING > > files, a special case, now goes through RCU-delated freeing), we can > > safely access vma->vm_file->f_inode field locklessly under just > > rcu_read_lock() protection, which enables looking up uprobe from > > uprobes_tree completely locklessly and speculatively without the need t= o > > acquire mmap_lock for reads. In most cases, anyway, assuming that there > > are no parallel mm and/or VMA modifications. The underlying struct > > file's memory won't go away from under us (even if struct file can be > > reused in the meantime). > > > > We rely on newly added mmap_lock_speculation_{begin,end}() helpers to > > validate that mm_struct stays intact for entire duration of this > > speculation. If not, we fall back to mmap_lock-protected lookup. > > The speculative logic is written in such a way that it will safely > > handle any garbage values that might be read from vma or file structs. > > > > Benchmarking results speak for themselves. > > > > BEFORE (latest tip/perf/core) > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > > uprobe-nop ( 1 cpus): 3.384 =C2=B1 0.004M/s ( 3.384M/s/= cpu) > > uprobe-nop ( 2 cpus): 5.456 =C2=B1 0.005M/s ( 2.728M/s/= cpu) > > uprobe-nop ( 3 cpus): 7.863 =C2=B1 0.015M/s ( 2.621M/s/= cpu) > > uprobe-nop ( 4 cpus): 9.442 =C2=B1 0.008M/s ( 2.360M/s/= cpu) > > uprobe-nop ( 5 cpus): 11.036 =C2=B1 0.013M/s ( 2.207M/s/= cpu) > > uprobe-nop ( 6 cpus): 10.884 =C2=B1 0.019M/s ( 1.814M/s/= cpu) > > uprobe-nop ( 7 cpus): 7.897 =C2=B1 0.145M/s ( 1.128M/s/= cpu) > > uprobe-nop ( 8 cpus): 10.021 =C2=B1 0.128M/s ( 1.253M/s/= cpu) > > uprobe-nop (10 cpus): 9.932 =C2=B1 0.170M/s ( 0.993M/s/= cpu) > > uprobe-nop (12 cpus): 8.369 =C2=B1 0.056M/s ( 0.697M/s/= cpu) > > uprobe-nop (14 cpus): 8.678 =C2=B1 0.017M/s ( 0.620M/s/= cpu) > > uprobe-nop (16 cpus): 7.392 =C2=B1 0.003M/s ( 0.462M/s/= cpu) > > uprobe-nop (24 cpus): 5.326 =C2=B1 0.178M/s ( 0.222M/s/= cpu) > > uprobe-nop (32 cpus): 5.426 =C2=B1 0.059M/s ( 0.170M/s/= cpu) > > uprobe-nop (40 cpus): 5.262 =C2=B1 0.070M/s ( 0.132M/s/= cpu) > > uprobe-nop (48 cpus): 6.121 =C2=B1 0.010M/s ( 0.128M/s/= cpu) > > uprobe-nop (56 cpus): 6.252 =C2=B1 0.035M/s ( 0.112M/s/= cpu) > > uprobe-nop (64 cpus): 7.644 =C2=B1 0.023M/s ( 0.119M/s/= cpu) > > uprobe-nop (72 cpus): 7.781 =C2=B1 0.001M/s ( 0.108M/s/= cpu) > > uprobe-nop (80 cpus): 8.992 =C2=B1 0.048M/s ( 0.112M/s/= cpu) > > > > AFTER > > =3D=3D=3D=3D=3D > > uprobe-nop ( 1 cpus): 3.534 =C2=B1 0.033M/s ( 3.534M/s/= cpu) > > uprobe-nop ( 2 cpus): 6.701 =C2=B1 0.007M/s ( 3.351M/s/= cpu) > > uprobe-nop ( 3 cpus): 10.031 =C2=B1 0.007M/s ( 3.344M/s/= cpu) > > uprobe-nop ( 4 cpus): 13.003 =C2=B1 0.012M/s ( 3.251M/s/= cpu) > > uprobe-nop ( 5 cpus): 16.274 =C2=B1 0.006M/s ( 3.255M/s/= cpu) > > uprobe-nop ( 6 cpus): 19.563 =C2=B1 0.024M/s ( 3.261M/s/= cpu) > > uprobe-nop ( 7 cpus): 22.696 =C2=B1 0.054M/s ( 3.242M/s/= cpu) > > uprobe-nop ( 8 cpus): 24.534 =C2=B1 0.010M/s ( 3.067M/s/= cpu) > > uprobe-nop (10 cpus): 30.475 =C2=B1 0.117M/s ( 3.047M/s/= cpu) > > uprobe-nop (12 cpus): 33.371 =C2=B1 0.017M/s ( 2.781M/s/= cpu) > > uprobe-nop (14 cpus): 38.864 =C2=B1 0.004M/s ( 2.776M/s/= cpu) > > uprobe-nop (16 cpus): 41.476 =C2=B1 0.020M/s ( 2.592M/s/= cpu) > > uprobe-nop (24 cpus): 64.696 =C2=B1 0.021M/s ( 2.696M/s/= cpu) > > uprobe-nop (32 cpus): 85.054 =C2=B1 0.027M/s ( 2.658M/s/= cpu) > > uprobe-nop (40 cpus): 101.979 =C2=B1 0.032M/s ( 2.549M/s/= cpu) > > uprobe-nop (48 cpus): 110.518 =C2=B1 0.056M/s ( 2.302M/s/= cpu) > > uprobe-nop (56 cpus): 117.737 =C2=B1 0.020M/s ( 2.102M/s/= cpu) > > uprobe-nop (64 cpus): 124.613 =C2=B1 0.079M/s ( 1.947M/s/= cpu) > > uprobe-nop (72 cpus): 133.239 =C2=B1 0.032M/s ( 1.851M/s/= cpu) > > uprobe-nop (80 cpus): 142.037 =C2=B1 0.138M/s ( 1.775M/s/= cpu) > > > > Previously total throughput was maxing out at 11mln/s, and gradually > > declining past 8 cores. With this change, it now keeps growing with eac= h > > added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core > > Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). > > > > Looks good to me, except one question below. > > > Reviewed-by: Oleg Nesterov > > Suggested-by: Matthew Wilcox > > Suggested-by: Peter Zijlstra > > Signed-off-by: Andrii Nakryiko > > --- > > kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 45 insertions(+) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 290c445768fa..efcd62f7051d 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm,= unsigned long vaddr) > > return is_trap_insn(&opcode); > > } > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_= vaddr) > > +{ > > + struct mm_struct *mm =3D current->mm; > > + struct uprobe *uprobe =3D NULL; > > + struct vm_area_struct *vma; > > + struct file *vm_file; > > + loff_t offset; > > + unsigned int seq; > > + > > + guard(rcu)(); > > + > > + if (!mmap_lock_speculation_begin(mm, &seq)) > > + return NULL; > > + > > + vma =3D vma_lookup(mm, bp_vaddr); > > + if (!vma) > > + return NULL; > > + > > + /* > > + * vm_file memory can be reused for another instance of struct fi= le, > > + * but can't be freed from under us, so it's safe to read fields = from > > + * it, even if the values are some garbage values; ultimately > > + * find_uprobe_rcu() + mmap_lock_speculation_end() check will ens= ure > > + * that whatever we speculatively found is correct > > If vm_file is a garbage value, may `vm_file->f_inode` access be dangerous= ? > > > + */ > > + vm_file =3D READ_ONCE(vma->vm_file); > > + if (!vm_file) > > + return NULL; > > + > > + offset =3D (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm= a->vm_start); > > + uprobe =3D find_uprobe_rcu(vm_file->f_inode, offset); > ^^^^ Here > > if it only stores vm_file or NULL, there's no problem. IIRC correctly, vma->vm_file is RCU-safe and we are in the read RCU section, so it should not contain a garbage value. > > Thank you, > > > + if (!uprobe) > > + return NULL; > > + > > + /* now double check that nothing about MM changed */ > > + if (!mmap_lock_speculation_end(mm, seq)) > > + return NULL; > > + > > + return uprobe; > > +} > > + > > /* assumes being inside RCU protected region */ > > static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, i= nt *is_swbp) > > { > > @@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(uns= igned long bp_vaddr, int *is_swb > > struct uprobe *uprobe =3D NULL; > > struct vm_area_struct *vma; > > > > + uprobe =3D find_active_uprobe_speculative(bp_vaddr); > > + if (uprobe) > > + return uprobe; > > + > > mmap_read_lock(mm); > > vma =3D vma_lookup(mm, bp_vaddr); > > if (vma) { > > -- > > 2.43.5 > > > > > -- > Masami Hiramatsu (Google)