From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8E50C25B4F for ; Mon, 6 May 2024 18:53:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66F8D6B0098; Mon, 6 May 2024 14:53:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 61FE26B0099; Mon, 6 May 2024 14:53:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 50F006B009A; Mon, 6 May 2024 14:53:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3036D6B0098 for ; Mon, 6 May 2024 14:53:48 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E995E1A0719 for ; Mon, 6 May 2024 18:53:47 +0000 (UTC) X-FDA: 82088870094.22.4A71435 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf22.hostedemail.com (Postfix) with ESMTP id 1FFDDC000D for ; Mon, 6 May 2024 18:53:45 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=e4puEBak; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of acme@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=acme@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715021626; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KN3r16H8NU3p0kABwC0vFdb0GNYlWbV+BEjCIY8CzVA=; b=xebkQB3O8MMV7SNGQsCnmE435qj0MS4vXogxO9ECwFy/k210cbfGe3KXJ2NOpJqicOx/ZS lzJd3LPReFahgYycVIYXG/zfisXCWQgow5ZnMpiYk4LsbeEKJZ72RYTqy3zHvIqECMmP2H TzJEzUKYBJVNpmR2DTF0YPu6VI+lylQ= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=e4puEBak; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of acme@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=acme@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715021626; a=rsa-sha256; cv=none; b=vAGWpUCrsUaf4PrlRDtC99XCk/TV+PTmyM12XwKcoPTEBFqY8F58j3YHT2uOnQDl9OPuUE 4NHsrRC/GQdcvVXHIvhHIhncwavHyK6iuMkebkaFZlxt6t7O/S1ChQBU9sisB5agJzJLkw rzHb7vVAgdJS2U3yep1VMJSTIhX3SvY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 0F4626143F; Mon, 6 May 2024 18:53:45 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0382DC116B1; Mon, 6 May 2024 18:53:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715021624; bh=GZr+p5TBW31QQMQhi32NoIs4LnmnAqFcXLqfJ3ZnwXU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=e4puEBak3WAa1kzFEkJrSBSJU2y2u1NqdwQX9sOf7uFYjlQEh+ZITLfQklFNifYGb v96jmgBIGqU1+3jhHcV4a2khe4UP/dDmfqBe5MMO57OVvA05D5jNC+7W3ET4aw7hD5 XyBrOKM4LNUMRN72+NGaf3f0BTWp5K1yAml1LhdJA+eJb2jdjRRBAVC95J8h7bk74h RDMGaVsAmThZSVg4xb0/KIIhQeo2UAEvAoY6B5OoY/8Zw20Wg5O6OaOPEPH/qSTWxj 6ShriMWzn/ukTZDHxBDCSJaw/5p5zwtyhEc5P7OgbRYwe58+qxypLZOWmgk4Lx7HjC SLghAt1id7Q9g== Date: Mon, 6 May 2024 15:53:40 -0300 From: Arnaldo Carvalho de Melo To: Namhyung Kim Cc: Andrii Nakryiko , Jiri Olsa , Ian Rogers , Greg KH , Andrii Nakryiko , linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org, Daniel =?iso-8859-1?Q?M=FCller?= , "linux-perf-use." Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc//maps Message-ID: References: <20240504003006.3303334-1-andrii@kernel.org> <20240504003006.3303334-3-andrii@kernel.org> <2024050439-janitor-scoff-be04@gregkh> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 1FFDDC000D X-Stat-Signature: 3ec7nkrydawhezhtwjprsrc8dxokrnrc X-HE-Tag: 1715021625-358239 X-HE-Meta: U2FsdGVkX1/5uCm5HogahKsTsxV20wQlGLDkgWFrcp72ndlZX8DWYpXCnIyt8+CCFaMSJZR97slp1FdNJ+0hcIVt4EClr5OoXKnpE++UisVAIRyZtwdOUoXixEPuHymTE7/1gSnH0ZOUL6y6keNcYI+OCmDLGS5q0spmxg1RyJkQzP0N1Vy8ehay4Vo3V70iUPQENCmKxWQMtJwp+uHMQEykw1uixAqC1RcynPGIVH3nUNFScLBUvcD1j2mvEYpsQcLWru5zcVR9yWEAjQDUpH9Sj6WE8GHK3QwbiyAW5CbWXJbdKqDpCBvLcgup+qxayp95onjES0fR3wxEiK73kfmMy/xVV8oIoyGgulBZ9XwRvsrm7Bsxu69n0FHSbZ6YtGxDmLEi+voNYAKbSOvUq00t0ITsbEyvUMGBys74zn25wfZVPf64O81x6pEt4NXLGtMEBdavkrKd9Y+LEq0ii6N4MOaF1q9tx0nv29ibOpNqedvcFTIEowbKdf7R5L+K5N+7wSrAl8GC3KBUVa8YqlbuggOUQmpsJp8SRZPQn22hI9hCZxP783DFousU5IL23csXtR/le7AI9PTSxoF8Yf5SIkWT2Xh3zx3KxdFpqZM4xURA5Pe79RzuNOeViKaHQLGLL1peeQ+9m1OBpYm/AXxFcSz/f72FmYkq11NdSM7iOj2Nigi8T5OU0qoUlftoHpDbuXT9hHHCl+IMoHuvc7ZbnbkGBqGeIVh2lgsU4P1g0vWFMl47s5YNH5bKMP3LX1moEoLyp9Jj4/mgnV5dcqoZYU2yWsU3VnDnBW3O9uqttxhxl/twEmMJHKugFlPk1vAckd4tSUlgiYf6y1Dy2qSEVuB9BgpANx3zenwHcl3dYF9DHyUrfjTts7YTvi+RwCDbyLqAxEmW4O3PGd3D+eJI2mRhMTrv4RgsnlqNK7kRzbDmnr/PuhIG22SDVQrDpNcDfuRfkFYq9N6kp2Q Hr1KOlPy CGR9X4DNoMkWX2kg77oAdKVcYBEbICd41AHF3eTXigEIsSXOOCyfWwJhCIrhbM0TSRSj+T2zflibM5eQWkJEs5y8uFY/eyRaoJPXpN2E7bNhSqggNQgKnQBB8V1aDEWMyfo3EdNJW7JmQyebTLTImlahqNZTwMqlhJbXMHAhgYBvJ6xC+RjhjdxTWbdpz0tDsD7GHOJvPzMVVg0i12Ic26mTnEnrDRhjDOeRZn+TFdeAzCRVN07bt2R/oGADQzGVjbmgjw3Qxd30mhUkidkieg2QwRVNsqpY8SccpS4DZ6zutcvEwtzYEyD31w5wjyF1yFqICdn4YyS7mhtXOi/3wN17vrIKggYYybTJP3K1QnSMJeZMcuizMlcFCHBTIXITDlsXE+CSABVkkduHYAsU/AvBu8Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote: > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo wrote: > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > On Sat, May 4, 2024 at 8:28 AM Greg KH wrote: > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > it, saving resources. > > > > > Signed-off-by: Andrii Nakryiko > > > > Where is the userspace code that uses this new api you have created? > > > So I added a faithful comparison of existing /proc//maps vs new > > > ioctl() API to solve a common problem (as described above) in patch > > > #5. The plan is to put it in mentioned blazesym library at the very > > > least. > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > linux-perf-user), as they need to do stack symbolization as well. > I think the general use case in perf is different. This ioctl API is great > for live tracing of a single (or a small number of) process(es). And > yes, perf tools have those tracing use cases too. But I think the > major use case of perf tools is system-wide profiling. > For system-wide profiling, you need to process samples of many > different processes at a high frequency. Now perf record doesn't > process them and just save it for offline processing (well, it does > at the end to find out build-ID but it can be omitted). Since: Author: Jiri Olsa Date: Mon Dec 14 11:54:49 2020 +0100 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events") We don't need to to process the events to find the build ids. I haven't checked if we still do it to find out which DSOs had hits, but we shouldn't need to do it for build-ids (unless they were not in memory when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I haven't checked but IIRC is a possibility if that ELF part isn't in memory at the time we want to copy it). If we're still traversing it like that I guess we can have a knob and make it the default to not do that and instead create the perf.data build ID header table with all the build-ids we got from PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event processing at the end of a 'perf record' session. > Doing it online is possible (like perf top) but it would add more > overhead during the profiling. And we cannot move processing It comes in the PERF_RECORD_MMAP2, filled by the kernel. > or symbolization to the end of profiling because some (short- > lived) tasks can go away. right > Also it should support perf report (offline) on data from a > different kernel or even a different machine. right > So it saves the memory map of processes and symbolizes > the stack trace with it later. Of course it needs to be updated > as the memory map changes and that's why it tracks mmap > or similar syscalls with PERF_RECORD_MMAP[2] records. > A problem with this approach is to get the initial state of all > (or a target for non-system-wide mode) existing processes. > We call it synthesizing, and read /proc/PID/maps to generate > the mmap records. > I think the below comment from Arnaldo talked about how > we can improve the synthesizing (which is sequential access > to proc maps) using BPF. Yes, I wonder how far Jiri went, Jiri? - Arnaldo > Thanks, > Namhyung > > > > > > At some point, when BPF iterators became a thing we thought about, IIRC > > Jiri did some experimentation, but I lost track, of using BPF to > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > > as in uapi/linux/perf_event.h: > > > > /* > > * The MMAP2 records are an augmented version of MMAP, they add > > * maj, min, ino numbers to be used to uniquely identify each mapping > > * > > * struct { > > * struct perf_event_header header; > > * > > * u32 pid, tid; > > * u64 addr; > > * u64 len; > > * u64 pgoff; > > * union { > > * struct { > > * u32 maj; > > * u32 min; > > * u64 ino; > > * u64 ino_generation; > > * }; > > * struct { > > * u8 build_id_size; > > * u8 __reserved_1; > > * u16 __reserved_2; > > * u8 build_id[20]; > > * }; > > * }; > > * u32 prot, flags; > > * char filename[]; > > * struct sample_id sample_id; > > * }; > > */ > > PERF_RECORD_MMAP2 = 10, > > > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > > > As perf.data files can be used for many purposes we want them all, so we > > setup a meta data perf file descriptor to go on receiving the new mmaps > > while we read /proc//maps, to reduce the chance of missing maps, do > > it in parallel, etc: > > > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > > > Usage: perf record [] [] > > or: perf record [] -- [] > > > > --num-thread-synthesize > > number of threads to run for event synthesis > > --synth > > Fine-tune event synthesis: default=all > > > > ⬢[acme@toolbox perf-tools-next]$ > > > > For this specific initial synthesis of everything the plan, as mentioned > > about Jiri's experiments, was to use a BPF iterator to just feed the > > perf ring buffer with those events, that way userspace would just > > receive the usual records it gets when a new mmap is put in place, the > > BPF iterator would just feed the preexisting mmaps, as instructed via > > the perf_event_attr for the perf_event_open syscall. > > > > For people not wanting BPF, i.e. disabling it altogether in perf or > > disabling just BPF skels, then we would fallback to the current method, > > or to the one being discussed here when it becomes available. > > > > One thing to have in mind is for this iterator not to generate duplicate > > records for non-pre-existing mmaps, i.e. we would need some generation > > number that would be bumped when asking for such pre-existing maps > > PERF_RECORD_MMAP2 dumps. > > > > > It will be up to other similar projects to adopt this, but we'll > > > definitely get this into blazesym as it is actually a problem for the > > > > At some point looking at plugging blazesym somehow with perf may be > > something to consider, indeed. > > > > - Arnaldo > > > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > > [2], this wasn't done just because we could, but it was requested by > > > Oculus customers) to cache the contents of /proc//maps and run > > > the risk of missing some shared libraries that can be loaded later. It > > > would be great to not have to do this tradeoff, which this new API > > > would enable. > > > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > > > > > > > --- > > > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > > > > include/uapi/linux/fs.h | 32 ++++++++ > > > > > 2 files changed, 197 insertions(+) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index 8e503a1635b7..cb7b1ff1a144 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -22,6 +22,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > > > > > > #include > > > > > #include > > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > > return do_maps_open(inode, file, &proc_pid_maps_op); > > > > > } > > > > > > > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > > > > +{ > > > > > + struct procfs_procmap_query karg; > > > > > + struct vma_iterator iter; > > > > > + struct vm_area_struct *vma; > > > > > + struct mm_struct *mm; > > > > > + const char *name = NULL; > > > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > > > > + __u64 usize; > > > > > + int err; > > > > > + > > > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > > > > + return -EFAULT; > > > > > + if (usize > PAGE_SIZE) > > > > > > > > Nice, where did you document that? And how is that portable given that > > > > PAGE_SIZE can be different on different systems? > > > > > > I'm happy to document everything, can you please help by pointing > > > where this documentation has to live? > > > > > > This is mostly fool-proofing, though, because the user has to pass > > > sizeof(struct procfs_procmap_query), which I don't see ever getting > > > close to even 4KB (not even saying about 64KB). This is just to > > > prevent copy_struct_from_user() below to do too much zero-checking. > > > > > > > > > > > and why aren't you checking the actual structure size instead? You can > > > > easily run off the end here without knowing it. > > > > > > See copy_struct_from_user(), it does more checks. This is a helper > > > designed specifically to deal with use cases like this where kernel > > > struct size can change and user space might be newer or older. > > > copy_struct_from_user() has a nice documentation describing all these > > > nuances. > > > > > > > > > > > > + return -E2BIG; > > > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > > > > + return -EINVAL; > > > > > > > > Ok, so you have two checks? How can the first one ever fail? > > > > > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE > > > won't fail, but this one will fail. > > > > > > The point of this check is that user has to specify at least first > > > three fields of procfs_procmap_query (size, query_flags, and > > > query_addr), because without those the query is meaningless. > > > > > > > > > > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > > > > > and this helper does more checks validating that the user either has a > > > shorter struct (and then zero-fills the rest of kernel-side struct) or > > > has longer (and then the longer part has to be zero filled). Do check > > > copy_struct_from_user() documentation, it's great. > > > > > > > > + if (err) > > > > > + return err; > > > > > + > > > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > > > > + return -EINVAL; > > > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > > > > + return -EINVAL; > > > > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > > > > + return -EINVAL; > > > > > > > > So you want values to be set, right? > > > > > > Either both should be set, or neither. It's ok for both size/addr > > > fields to be zero, in which case it indicates that the user doesn't > > > want this part of information (which is usually a bit more expensive > > > to get and might not be necessary for all the cases). > > > > > > > > > > > > + > > > > > + mm = priv->mm; > > > > > + if (!mm || !mmget_not_zero(mm)) > > > > > + return -ESRCH; > > > > > > > > What is this error for? Where is this documentned? > > > > > > I copied it from existing /proc//maps checks. I presume it's > > > guarding the case when mm might be already put. So if the process is > > > gone, but we have /proc//maps file open? > > > > > > > > > > > > + if (mmap_read_lock_killable(mm)) { > > > > > + mmput(mm); > > > > > + return -EINTR; > > > > > + } > > > > > + > > > > > + vma_iter_init(&iter, mm, karg.query_addr); > > > > > + vma = vma_next(&iter); > > > > > + if (!vma) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + /* user wants covering VMA, not the closest next one */ > > > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > > > > + vma->vm_start > karg.query_addr) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + > > > > > + karg.vma_start = vma->vm_start; > > > > > + karg.vma_end = vma->vm_end; > > > > > + > > > > > + if (vma->vm_file) { > > > > > + const struct inode *inode = file_user_inode(vma->vm_file); > > > > > + > > > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > > > > > > > So the major/minor is that of the file superblock? Why? > > > > > > Because inode number is unique only within given super block (and even > > > then it's more complicated, e.g., btrfs subvolumes add more headaches, > > > I believe). inode + dev maj/min is sometimes used for cache/reuse of > > > per-binary information (e.g., pre-processed DWARF information, which > > > is *very* expensive, so anything that allows to avoid doing this is > > > helpful). > > > > > > > > > > > > + karg.inode = inode->i_ino; > > > > > > > > What is userspace going to do with this? > > > > > > > > > > See above. > > > > > > > > + } else { > > > > > + karg.vma_offset = 0; > > > > > + karg.dev_major = 0; > > > > > + karg.dev_minor = 0; > > > > > + karg.inode = 0; > > > > > > > > Why not set everything to 0 up above at the beginning so you never miss > > > > anything, and you don't miss any holes accidentally in the future. > > > > > > > > > > Stylistic preference, I find this more explicit, but I don't care much > > > one way or another. > > > > > > > > + } > > > > > + > > > > > + karg.vma_flags = 0; > > > > > + if (vma->vm_flags & VM_READ) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > > > > + if (vma->vm_flags & VM_WRITE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > > > > + if (vma->vm_flags & VM_EXEC) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > > > > + if (vma->vm_flags & VM_MAYSHARE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > > > > + > > > > > > [...] > > > > > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > > > > index 45e4e64fd664..fe8924a8d916 100644 > > > > > --- a/include/uapi/linux/fs.h > > > > > +++ b/include/uapi/linux/fs.h > > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > > > > __u64 return_mask; > > > > > }; > > > > > > > > > > +/* /proc//maps ioctl */ > > > > > +#define PROCFS_IOCTL_MAGIC 0x9f > > > > > > > > Don't you need to document this in the proper place? > > > > > > I probably do, but I'm asking for help in knowing where. procfs is not > > > a typical area of kernel I'm working with, so any pointers are highly > > > appreciated. > > > > > > > > > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > > > > + > > > > > +enum procmap_query_flags { > > > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > > > > +}; > > > > > + > > > > > +enum procmap_vma_flags { > > > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > > > > > > > Are these bits? If so, please use the bit macro for it to make it > > > > obvious. > > > > > > > > > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to > > > add any extra #includes to this UAPI header, but I can figure out the > > > necessary dependency and do BIT(), I just didn't feel like BIT() adds > > > much here, tbh. > > > > > > > > +}; > > > > > + > > > > > +struct procfs_procmap_query { > > > > > + __u64 size; > > > > > + __u64 query_flags; /* in */ > > > > > > > > Does this map to the procmap_vma_flags enum? if so, please say so. > > > > > > no, procmap_query_flags, and yes, I will > > > > > > > > > > > > + __u64 query_addr; /* in */ > > > > > + __u64 vma_start; /* out */ > > > > > + __u64 vma_end; /* out */ > > > > > + __u64 vma_flags; /* out */ > > > > > + __u64 vma_offset; /* out */ > > > > > + __u64 inode; /* out */ > > > > > > > > What is the inode for, you have an inode for the file already, why give > > > > it another one? > > > > > > This is inode of vma's backing file, same as /proc//maps' file > > > column. What inode of file do I already have here? You mean of > > > /proc//maps itself? It's useless for the intended purposes. > > > > > > > > > > > > + __u32 dev_major; /* out */ > > > > > + __u32 dev_minor; /* out */ > > > > > > > > What is major/minor for? > > > > > > This is the same information as emitted by /proc//maps, > > > identifies superblock of vma's backing file. As I mentioned above, it > > > can be used for caching per-file (i.e., per-ELF binary) information > > > (for example). > > > > > > > > > > > > + __u32 vma_name_size; /* in/out */ > > > > > + __u32 build_id_size; /* in/out */ > > > > > + __u64 vma_name_addr; /* in */ > > > > > + __u64 build_id_addr; /* in */ > > > > > > > > Why not document this all using kerneldoc above the structure? > > > > > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be > > > figuring out the best place and approach, and so wanted to avoid > > > documentation churn. > > > > > > Would something like what we have for pm_scan_arg and pagemap APIs > > > work? I see it added a few simple descriptions for pm_scan_arg struct, > > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add > > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, > > > though)? Anyways, I'm hoping for pointers where all this should be > > > documented. Thank you! > > > > > > > > > > > anyway, I don't like ioctls, but there is a place for them, you just > > > > have to actually justify the use for them and not say "not efficient > > > > enough" as that normally isn't an issue overall. > > > > > > I've written a demo tool in patch #5 which performs real-world task: > > > mapping addresses to their VMAs (specifically calculating file offset, > > > finding vma_start + vma_end range to further access files from > > > /proc//map_files/-). I did the implementation > > > faithfully, doing it in the most optimal way for both APIs. I showed > > > that for "typical" (it's hard to specify what typical is, of course, > > > too many variables) scenario (it was data collected on a real server > > > running real service, 30 seconds of process-specific stack traces were > > > captured, if I remember correctly). I showed that doing exactly the > > > same amount of work is ~35x times slower with /proc//maps. > > > > > > Take another process, another set of addresses, another anything, and > > > the numbers will be different, but I think it gives the right idea. > > > > > > But I think we are overpivoting on text vs binary distinction here. > > > It's the more targeted querying of VMAs that's beneficial here. This > > > allows applications to not cache anything and just re-query when doing > > > periodic or continuous profiling (where addresses are coming in not as > > > one batch, as a sequence of batches extended in time). > > > > > > /proc//maps, for all its usefulness, just can't provide this sort > > > of ability, as it wasn't designed to do that and is targeting > > > different use cases. > > > > > > And then, a new ability to request reliable (it's not 100% reliable > > > today, I'm going to address that as a follow up) build ID is *crucial* > > > for some scenarios. The mentioned Oculus use case, the need to fully > > > access underlying ELF binary just to get build ID is frowned upon. And > > > for a good reason. Profiler only needs build ID, which is no secret > > > and not sensitive information. This new (and binary, yes) API allows > > > to add this into an API without breaking any backwards compatibility. > > > > > > > > > > > thanks, > > > > > > > > greg k-h > >