From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43DA3C10F1A for ; Tue, 7 May 2024 18:52:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF4066B0088; Tue, 7 May 2024 14:52:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CA3B46B0092; Tue, 7 May 2024 14:52:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B93016B0099; Tue, 7 May 2024 14:52:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9D4956B0088 for ; Tue, 7 May 2024 14:52:45 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 445BE80A32 for ; Tue, 7 May 2024 18:52:45 +0000 (UTC) X-FDA: 82092496290.22.DF9CF19 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf20.hostedemail.com (Postfix) with ESMTP id 7AD7D1C0011 for ; Tue, 7 May 2024 18:52:43 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fLfgwMjV; spf=pass (imf20.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715107963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AwLLEGYjxX+LEtBwUpGGf8LdmjjTex/ws/dicJ6KmUU=; b=xNfutJpT1j1oUCtq/pLRlNLKOHYjMUOFJ3x290VBmAYzEEjqW0lrcF58dfVg3ZKFZ19sJw GrmrdQZTA3MVaQ8J5cc2bymGhRPD25tDyt63q+089tokNEBRTscYcbFivEGWfL/5JIcnT8 EkZl0unqRkX65L2PsCN4+5gtwmf3z1I= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fLfgwMjV; spf=pass (imf20.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715107963; a=rsa-sha256; cv=none; b=VrRbMhMM3XVDeuOq5s+F43xSUcYgH8LiCmyFy6GuftfVCMdmdPqli4n+R4hM7ES5+x5uVn sJwqNyTBdfGhEBGYCj9bM5bVg0VmipuFbRqxMnfo0Sy9XsNwmbjtn7TPH0Fgga1Ci76E3z BO9kkGoY6mTZy6gciDl8OFgfuKhaQqI= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-1ecddf96313so30383665ad.2 for ; Tue, 07 May 2024 11:52:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715107962; x=1715712762; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AwLLEGYjxX+LEtBwUpGGf8LdmjjTex/ws/dicJ6KmUU=; b=fLfgwMjVFcsfIeU4D5sh7hI1TW0wR+sAGfP+GplXIeX405KD+As1a9iJ5+lodV/rYI cO9Tfb8wSpyAImFlE5MuS0bhW3bP/r9vjdkJTfOYhFR9mMwnJfwkaPfvO94DsobLGBrx ce9BeZ1IqkFRf8xX8gk2c6mbgrltxd9vC5QrFfTB3d0UzAIItB34QREaBHaqnTWGU1tt tuDGx1sgQcv3+ygBojoXeTCOR6Lb1+2BUYxZruHTKa/y6O4NmT518IO057tHkHcr1i9a jP3CZ3MZFkN/m01gLLEnteK4cYLD52SboYI2zh/dP6eEmHyi6fC1139dV5U1DIPMjDy+ yJhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715107962; x=1715712762; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AwLLEGYjxX+LEtBwUpGGf8LdmjjTex/ws/dicJ6KmUU=; b=jZId1Mzk7T55pC+6GYoi88kAYRZc72tjQFe69+vwaK/hVWLDoJYT0SjCc8/HxkWqKh 7cgZfkQcc8I62B6Dw6yNnnXiC0IMUk+rwRgG/kii1ZZ+AiU1WbmH2rpTEVgRPhicxrqi uMxaC6QbC1TXWsrX4QhIAJkLxFCe6pwtsjoryyvJqJ3dKWQT1fyklMK7hVxSG/dcBa1V kGQgqefpOFXrFFT0AVBw0PcNGUezNgzm6bH4cSu26UwjYsBZPAYf+GxFJC2bTYgShUnJ okC0FdYbZ3gXPZQ7sOl4PDUjiedtzHtH1zGEI1l3u1ZOvbWEov4/c/4UPpYAuM+rfzOa +pgg== X-Forwarded-Encrypted: i=1; AJvYcCVqp3/dvFr/AUmXuwqq9+zHOIWbN3nWsiWpr8+Jwhj7XhyPS0mhYwvsn4Fnf9QiWIVSrqQ9+omKS9UQdlK5z0kyxRQ= X-Gm-Message-State: AOJu0YwGLLGA5SbDDqy9bCoqLO7/9sLtjacgLMdRIqo/jtzu7Y7RSKRl ElPJACMPs48mxFeUKwStHtKH2nGbe9AFLLjhhGMhZL7o95/03TCYXJra3xH6IiZ+6+hzCxjOUNa zTfla8K3RWFDC1nyboPf4K5+BpAU= X-Google-Smtp-Source: AGHT+IEP2PRAErTlJKaX5mCGUjXUHTm1j2QYD4mJIWSnjVBKhEf4IrxDiaZsBvKjXK1rTpYf+lSwrEtYTWi4YA61IoU= X-Received: by 2002:a17:90a:fa02:b0:2b2:6339:b1a3 with SMTP id 98e67ed59e1d1-2b616aedaf7mr472183a91.37.1715107962298; Tue, 07 May 2024 11:52:42 -0700 (PDT) MIME-Version: 1.0 References: <20240504003006.3303334-1-andrii@kernel.org> <20240504003006.3303334-3-andrii@kernel.org> In-Reply-To: From: Andrii Nakryiko Date: Tue, 7 May 2024 11:52:30 -0700 Message-ID: Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc//maps To: "Liam R. Howlett" , Andrii Nakryiko , linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 7AD7D1C0011 X-Rspamd-Server: rspam06 X-Stat-Signature: gxfcifdwum4efu5dyst8os4bwigmgeeh X-HE-Tag: 1715107963-659861 X-HE-Meta: U2FsdGVkX19pwrZ4k+u5HFlomMbd8ZNCbqEHiiOjQkIZzx+Nk38+lGN32jst8Lny/hp5P5CXZVjCeXVwxiylv72xXYvoYNYTe6ImIXdXKBNs453K0YELBID1MDRQnMcKV3AD6RhulQ6AOXh0vjixGiBxZaapdXez63wGJ9ZW8A5pQzFa05vNIjIUDfcUY25/ilTxRcUr8ebqPF79fKUZ9eDtT5jKhuquyZA+gibCWdbYKJXg9pqziQQh5RKPdm0RfjgLCyi+YCa9uOLFmKrpQzuq5sRVPZ33VG8NElj4GeBzDIarQm3O3lITPxTJzjDAAJZFaj0Cwh8CRDRyaPRs35OF0ugW3ir8l8UvWev6SHnIAgYKSDNIYN0rMzo92qnkRo5m3tDhL5Y4DNTgRJg7LOcspbSGBZJQBN1e1p0qaMgj85D/fO/MU+BeaB5uPDmiu4uFoIdNEeJDg79nOUQVAOuJqmUCTnqO3yYKgkZ0qjrbYuuiL4e/3dxCP+SMT3Erpg+DxIl2m7uNXsXz+viMPQxYimlunUfCXWagd3hkeDXcysrF9MUg+pRiMIeIpJLDvmoD+t43DVtF3Wlyq5deKWV70oPZQrDLVGYd2/33UPllU9H77AS3P2KmdvxpX1h9SwxWGY7xWtZK0+sjR2vEmLtsNKpVASGiUL4jwhntOxoUTJUAVj6JIXD6+geG8MTtwgOC6nPf8owm6UVbl+5110+K0LRLijPnVpF1LPsilNfD/dEvq7wV1ON3Na8TKTGAFJkJD2B+VF21i3aFzR9+EelvXB8NO+48RnNyFZnhBxA7gLKHV8XzD+56yqXa9UOjJQnDlGUqfnyWrUDf9frzOHe9/p0mU5FVu3q0ZG/9tXFdyDu3CqPL1n/XFk9TkxzuTHMbScKeZkui42KcJsAkoDyN91rRX1LfvCC2G23HUUnlxUxEkBKXcT5Q0Mj7bX6e+93JXlG54yzVn3NcUCp MoqI+jA6 kEwA79+OFpOxUImmiuah6mhAh9gv16ZGjQEgOd9kD1j+fhGc8I0Q47YnPaLRAfJR6u4kICRARr32A+L5LPavwOlrT8cvti7r7o6hZ/pPPjKO7Y6NqKby+j3qBQ6LSvz3xYCAYKzfX3yiidHdIEhAfO+Wxm1ffIDctbY92JI9+bKP0YihtzQUhFsuqmYzpFlrcDk9ycOSBBweJ9AhqxQPAN+gIGKDeYQj5RbblDh8g0P14w5oQ1tpfKHBeq2nm6nK/X4Ty5R5K86VdinMph74tm5CGudK4zzkjYkABlvUksR8qi3qT5yLU0pnFpiAw73Dov0LSc/tT0WucxAJdx5uu0xuhG7gyDTRcbJCn8nDm2/UDnSbiMLafdjO784DWL55lgpdgjWRm3OMNE9DIjwtc/L349g+Y3j24dIyhgpC/tOYF/CE/WenU3j2XoJRe6u21AVRM/xoZH4equiAbcAMLXxUX5A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 7, 2024 at 11:10=E2=80=AFAM Liam R. Howlett wrote: > > * Andrii Nakryiko [240503 20:30]: > > /proc//maps file is extremely useful in practice for various tasks > > involving figuring out process memory layout, what files are backing an= y > > given memory range, etc. One important class of applications that > > absolutely rely on this are profilers/stack symbolizers. They would > > normally capture stack trace containing absolute memory addresses of > > some functions, and would then use /proc//maps file to file > > corresponding backing ELF files, file offsets within them, and then > > continue from there to get yet more information (ELF symbols, DWARF > > information) to get human-readable symbolic information. > > > > As such, there are both performance and correctness requirement > > involved. This address to VMA information translation has to be done as > > efficiently as possible, but also not miss any VMA (especially in the > > case of loading/unloading shared libraries). > > > > Unfortunately, for all the /proc//maps file universality and > > usefulness, it doesn't fit the above 100%. > > > > First, it's text based, which makes its programmatic use from > > applications and libraries unnecessarily cumbersome and slow due to the > > need to do text parsing to get necessary pieces of information. > > > > Second, it's main purpose is to emit all VMAs sequentially, but in > > practice captured addresses would fall only into a small subset of all > > process' VMAs, mainly containing executable text. Yet, library would > > need to parse most or all of the contents to find needed VMAs, as there > > is no way to skip VMAs that are of no use. Efficient library can do the > > linear pass and it is still relatively efficient, but it's definitely a= n > > overhead that can be avoided, if there was a way to do more targeted > > querying of the relevant VMA information. > > > > Another problem when writing generic stack trace symbolization library > > is an unfortunate performance-vs-correctness tradeoff that needs to be > > made. Library has to make a decision to either cache parsed contents of > > /proc//maps for service future requests (if application requests t= o > > symbolize another set of addresses, captured at some later time, which > > is typical for periodic/continuous profiling cases) to avoid higher > > costs of needed to re-parse this file or caching the contents in memory > > to speed up future requests. In the former case, more memory is used fo= r > > the cache and there is a risk of getting stale data if application > > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > > through additiona mmap() calls (and other means of altering memory > > address space). In the latter case, it's the performance hit that comes > > from re-opening the file and re-reading/re-parsing its contents all ove= r > > again. > > > > This patch aims to solve this problem by providing a new API built on > > top of /proc//maps. It is ioctl()-based and built as a binary > > interface, avoiding the cost and awkwardness of textual representation > > for programmatic use. It's designed to be extensible and > > forward/backward compatible by including user-specified field size and > > using copy_struct_from_user() approach. But, most importantly, it allow= s > > to do point queries for specific single address, specified by user. And > > this is done efficiently using VMA iterator. > > > > User has a choice to pick either getting VMA that covers provided > > address or -ENOENT if none is found (exact, least surprising, case). Or= , > > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > > get either VMA that covers the address (if there is one), or the closes= t > > next VMA (i.e., VMA with the smallest vm_start > addr). The later allow= s > > more efficient use, but, given it could be a surprising behavior, > > requires an explicit opt-in. > > > > Basing this ioctl()-based API on top of /proc//maps's FD makes > > sense given it's querying the same set of VMA data. All the permissions > > checks performed on /proc//maps opening fit here as well. > > ioctl-based implementation is fetching remembered mm_struct reference, > > but otherwise doesn't interfere with seq_file-based implementation of > > /proc//maps textual interface, and so could be used together or > > independently without paying any price for that. > > > > There is one extra thing that /proc//maps doesn't currently > > provide, and that's an ability to fetch ELF build ID, if present. User > > has control over whether this piece of information is requested or not > > by either setting build_id_size field to zero or non-zero maximum buffe= r > > size they provided through build_id_addr field (which encodes user > > pointer as __u64 field). > > > > The need to get ELF build ID reliably is an important aspect when > > dealing with profiling and stack trace symbolization, and > > /proc//maps textual representation doesn't help with this, > > requiring applications to open underlying ELF binary through > > /proc//map_files/- symlink, which adds an extra > > permissions implications due giving a full access to the binary from > > (potentially) another process, while all application is interested in i= s > > build ID. Giving an ability to request just build ID doesn't introduce > > any additional security concerns, on top of what /proc//maps is > > already concerned with, simplifying the overall logic. > > > > Kernel already implements build ID fetching, which is used from BPF > > subsystem. We are reusing this code here, but plan a follow up changes > > to make it work better under more relaxed assumption (compared to what > > existing code assumes) of being called from user process context, in > > which page faults are allowed. BPF-specific implementation currently > > bails out if necessary part of ELF file is not paged in, all due to > > extra BPF-specific restrictions (like the need to fetch build ID in > > restrictive contexts such as NMI handler). > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > hard-coded or user-provided names) is optional just like build ID. If > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > it, saving resources. > > > > Signed-off-by: Andrii Nakryiko > > --- > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/fs.h | 32 ++++++++ > > 2 files changed, 197 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 8e503a1635b7..cb7b1ff1a144 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -22,6 +22,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, st= ruct file *file) > > return do_maps_open(inode, file, &proc_pid_maps_op); > > } > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __use= r *uarg) > > +{ > > + struct procfs_procmap_query karg; > > + struct vma_iterator iter; > > + struct vm_area_struct *vma; > > + struct mm_struct *mm; > > + const char *name =3D NULL; > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf =3D NULL; > > + __u64 usize; > > + int err; > > + > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > + return -EFAULT; > > + if (usize > PAGE_SIZE) > > + return -E2BIG; > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > + return -EINVAL; > > + err =3D copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > + if (err) > > + return err; > > + > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > + return -EINVAL; > > + if (!!karg.vma_name_size !=3D !!karg.vma_name_addr) > > + return -EINVAL; > > + if (!!karg.build_id_size !=3D !!karg.build_id_addr) > > + return -EINVAL; > > + > > + mm =3D priv->mm; > > + if (!mm || !mmget_not_zero(mm)) > > + return -ESRCH; > > + if (mmap_read_lock_killable(mm)) { > > + mmput(mm); > > + return -EINTR; > > + } > > Using the rcu lookup here will allow for more success rate with less > lock contention. > If you have any code pointers, I'd appreciate it. If not, I'll try to find it myself, no worries. > > + > > + vma_iter_init(&iter, mm, karg.query_addr); > > + vma =3D vma_next(&iter); > > + if (!vma) { > > + err =3D -ENOENT; > > + goto out; > > + } > > + /* user wants covering VMA, not the closest next one */ > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > + vma->vm_start > karg.query_addr) { > > + err =3D -ENOENT; > > + goto out; > > + } > > The interface you are using is a start address to search from to the end > of the address space, so this won't work as you intended with the > PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator Maybe the name isn't the best, by "EXACT" here I meant "VMA that exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA" would be better wording. With that out of the way, I think this API works exactly how I expect it to work: # cat /proc/3406/maps | grep -C1 7f42099fe000 7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0 7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 # cat addrs.txt 0x7f42099fe010 # ./procfs_query -f addrs.txt -p 3406 -v -Q PID: 3406 PATH: addrs.txt READ 1 addrs! SORTED ADDRS (1): ADDR #0: 0x7f42099fe010 VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 (build ID: NO, 0 bytes) RESOLVED ADDRS (1): RESOLVED #0: 0x7f42099fe010 -> OFF 0x2010 NAME /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 You can see above that for the requested 0x7f42099fe010 address we got a VMA that starts before this address: 7f42099fe000-7f4209a0e000, which is what we want. Before submitting I ran the tool with /proc//maps and ioctl to "resolve" the exact same set of addresses and I compared results. They were identical. Note, there is a small bug in the tool I added in patch #5. I changed `-i` argument to `-Q` at the very last moment and haven't updated the code in one place. But other than that I didn't change anything. For the above output, I added "VMA FOUND" verbose logging to see all the details of VMA, not just resolved offset. I'll add that in v2. > has the desired interface you want as the single address lookup doesn't > use the vma iterator. I'd just run the vma_next() and check the limits. > See find_exact_vma() for the limit checks. > > > + > > + karg.vma_start =3D vma->vm_start; > > + karg.vma_end =3D vma->vm_end; > > + [...]