From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6E88C74A5B for ; Sat, 18 Mar 2023 06:09:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10F166B00AD; Sat, 18 Mar 2023 02:09:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BC526B00AE; Sat, 18 Mar 2023 02:09:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC6326B00AF; Sat, 18 Mar 2023 02:09:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id DCFC76B00AD for ; Sat, 18 Mar 2023 02:09:01 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A1ABF814AD for ; Sat, 18 Mar 2023 06:09:01 +0000 (UTC) X-FDA: 80580990882.28.4502142 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf11.hostedemail.com (Postfix) with ESMTP id C2BAD40009 for ; Sat, 18 Mar 2023 06:08:59 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ohCbefDA; spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679119739; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kV6D6PWuyY+pzJCeIR0JhzUrH+GgDghjB9uU7AntUqA=; b=g6j2zuKL1cowwo+IVNnewns/+5i8lUyLV3qUb4d1CCxqPCjmDwr/AQpykxisxNBS3AzSAq FI3podn2rSrtl4MuBejKcm000MLU8OeLdn+4Mo1jsG0YbkDipRsDOjXMHXYojpNwEyop56 gcErSYE7Rnn3ELehfmFnDDGPPXkX9I0= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ohCbefDA; spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679119739; a=rsa-sha256; cv=none; b=7fRvZXstWAXwoVj+hoQHgF5BF8AK5NhnIdFXQZtUIBNh8zWFmqKHDH5G4kYYzUQRT0YVD3 ZDLvC5krZvplEJcdwyU8H257xg2RDM4ZMPF8yIfntK57pZKX6iYW+YkwZ+Pg8lubnAzf+N +Q4dK5IOiomyQ46zJhvBHm9LLp5pU3g= Received: by mail-ed1-f47.google.com with SMTP id r11so28102875edd.5 for ; Fri, 17 Mar 2023 23:08:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679119738; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kV6D6PWuyY+pzJCeIR0JhzUrH+GgDghjB9uU7AntUqA=; b=ohCbefDAyCpU7VGBYqZpzJHtlEkbOY7Acj9MBcGEFIKZQ02InFEathyokeRbr/+LRi 5OaU4RFGmzKS4u7JBelbcT1IDEaXUuEjStkFDpvE45mo2RD1IUSwzC974CU2MgIGXnkI g643/nyEUBUMu0xbJDTEJAUR9Syt+GIoMIYgaWga7pFQzMb4AALVIKcH1d+GVExOoZI5 /jdWvi8OKRx5/UJNVM8SBBJLg2oqE5HHWAS70YjOskmsToTtt2yzbeG7Djhfjm7XlIyk qABY68h9/bSktn/pRHowHwANI2c2+bSCdHgZMJa/jupIa6QCDgMnP2IndATSAsbk3CN9 3Hzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679119738; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kV6D6PWuyY+pzJCeIR0JhzUrH+GgDghjB9uU7AntUqA=; b=ol2wFK5zae663AvZYJ1oIUTNwTHNQs9ZmbAkjNbjmWqMCiogz+UiEVoqrra9WXUcSt X9DcYQmkE4/EJwky7IvjQrv+vEvp4E7XXS38RHL8PmS9/SBHXC7JJ1jjhEfuHktTzQ7b ss8+3ufFEgO0GdbMpKv4xVQ3AVMB1dPvldOFDpto9QiH0RLlUrTplzCLB32rSIIcdvna hQpmvpAS/Xgv54u9ieVHmyGsxJW9+pW49Ubn1ANQEtl/vdyRNitLE0pGzfF3f4wCq+R9 I92D6r6iZb0iRjahZluuQpk8qEN0a/aszgKOmU1uqdqlRrVRDFHxTAv+OOdErZJa6nt/ Q8CQ== X-Gm-Message-State: AO0yUKU3yUJzKokc+MuGF7nbKkDk55J+c7Q12piqT4NsVbqcyDWvED3a KLeGtER2pL6HIt52LIw12Ljr0As6xODMO7fqtK0= X-Google-Smtp-Source: AK7set9vNfDYTp13qC3VLlkv2NKrpoWodSQ81WtPPKmcZNYmxSE0tQmMZA6FKAfCBAih1aFLohvSXIcdvr4MosOWD90= X-Received: by 2002:a17:906:b4b:b0:931:c1a:b526 with SMTP id v11-20020a1709060b4b00b009310c1ab526mr814048ejg.5.1679119738091; Fri, 17 Mar 2023 23:08:58 -0700 (PDT) MIME-Version: 1.0 References: <20230316170149.4106586-1-jolsa@kernel.org> <20230317211403.GZ3390869@ZenIV> <20230317212125.GA3390869@ZenIV> In-Reply-To: <20230317212125.GA3390869@ZenIV> From: Andrii Nakryiko Date: Fri, 17 Mar 2023 23:08:44 -0700 Message-ID: Subject: Re: [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object To: Al Viro Cc: Matthew Wilcox , Ian Rogers , Jiri Olsa , Alexei Starovoitov , Andrii Nakryiko , Hao Luo , Andrew Morton , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , bpf@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Daniel Borkmann , Namhyung Kim , Dave Chinner , kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: C2BAD40009 X-Stat-Signature: xxz537bykky6w1uqu574uqdqcju4wccj X-HE-Tag: 1679119739-216416 X-HE-Meta: U2FsdGVkX18W6tQkcKlxAp2dL2W9aPAFuTUIalnQC7cPjMFA16KNsBgtmcVV2UQIKPcYva5ln7ZyzfiEMGqptuNAZmtMccPdb6fcLIatjtQNoEbRlXl0pLMCibC199VBdCH68mJPlMdYx6symaUP4LO5Y3Y4etfF7e77RjllYDoZ1/zPvu+sIngKjjoy91JiAw+NgYhgD4GCv4I2KEQ78ALgCWkJuYZwcDzP4LdlKZ79YLF5DqDXoBqzm33aiWDetE02iTierYoADB3+JzCjDz2BdT9dx4+ohb3Jtd52d1GJ2ZLoymQ0UP4E1P3CUZmaTCd+lCpfUlusQqEeqSpzW83R8CT5eQCEP/NHRNmLE69VLyMBcdowU2IZSGOr2l16xm9h7Wqa+AXG1n3IKK2JGAGofs3kUnVDrl7v/nx9xt1MaqrTRKfmUqV/vD2BOaqwxsTHaQfceA3K1yBeCEboyO+1noDH6TI/LjSgUBGrFxHT/IdaGS42i6S/SjlXbWuwam6lBTvURnmjcbr2qY7pCYjEiS7MTzTKBZnRi9Ul8oZxk6wM2JkYcZ1MsPnl0Xd21l7DSErUcv1GsIFb3/Fwu3Q7X+p+r2dEvldllz03JDp+u9s91RUKFGE5gSWuiK2Xa+BlAeiv/Vx77q633CDXFIch6sBpg59Fyn2NQ9TK7DckjLTJLP2hm61uqM5ncM8zehZShunt76hHu5bSv4KkzAn8RbgvatkZalotu5d8mazdYe5H5N1lvReQPUIs131dANojkQ+pYH/GWRXxfMqweb5y2ISZ6Z/pEaBLJkyze4al3feBKzu6M4r/L1xxINlVDgrCrKbtCajMWL+tRB4CtK7+2orUXql659u+FL8sTlnuXxbpXsaOUcvzgQALucZ1m1eQs/RcdhFzIYXFZqTxx4fTIxe+1qjyEm0WRXEOWjb9gOE2UEGW4HFRHAGj34mcFCJdLMeOazRkdlgCHOx FodN/zvL kHa8REvYstEHnZuoPgkel3WB93/G1gxeei6IwlpcRx4kmVVcjCI58SQaimfU02xHmU5ujJA65qB0CNKlNIhUqjDwJxh8MTiQ9WZLAiYVIGFgXUtEInYGYojH1csk7ZRIXsQt4OdnitdT6wEDmzuuFQmrNeM01B6W3LcW5FijAdPpqHo/gV6dEWI8Z2cZJr+JiWLxLFakVH3cUYnubfoVmGaCGbsl12cORhx6cq7XaEhPCJnCpJweqDx7gCXUQhjofUEtr7cVFtqX5ekLnNpRUoAQ7hJ/nBSVFPrScYHSdDiDwaneIN7xdDBPDeCDHAn93MovN5MgV+pu1gIwTj9Skul+ynviZ1pk+WO9Rs94cyyZldXhMLRaYU+zvdEihjqYNpKrekOeVKh6l2YOGpCggsQH8/LIUdE7exWGun+xdkg4LGLZdjRW3pXJ7JgRzLyuwVDJ/5L3suiIWv6V9y5oXhU8fnkSgqlPI0TXfDMcmI19ImmYmi3Oa9EjR5OdeAy9d1AHE2SIDvGAGH3SuXr6esJe+ik29vo6uu75wZ2M1GRcfuvA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Mar 17, 2023 at 2:21=E2=80=AFPM Al Viro w= rote: > > On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote: > > On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote: > > > > > > But build IDs are _generally_ available. The only problem (AIUI) > > > > is when you're trying to examine the contents of one container from > > > > another container. And to solve that problem, you're imposing a co= st > > > > on everybody else with (so far) pretty vague justifications. I rea= lly > > > > don't like to see you growing struct file for this (nor struct inod= e, > > > > nor struct vm_area_struct). It's all quite unsatisfactory and I do= n't > > > > have a good suggestion. > > > > > > There is a lot of profiling, observability and debugging tooling buil= t > > > using BPF. And when capturing stack traces from BPF programs, if the > > > build ID note is not physically present in memory, fetching it from > > > the BPF program might fail in NMI (and other non-faultable contexts). > > > This patch set is about making sure we always can fetch build ID, eve= n > > > from most restrictive environments. It's guarded by Kconfig to avoid > > > adding 8 bytes of overhead to struct file for environment where this > > > might be unacceptable, giving users and distros a choice. > > > > Lovely. As an exercise you might want to collect the stats on the > > number of struct file instances on the system vs. the number of files > > that happen to be ELF objects and are currently mmapped anywhere. That's a good suggestion. I wrote a simple script that uses the drgn tool ([0]), it enables nice introspection of the state of the kernel memory for the running kernel. The script is at the bottom ([1]) for anyone to sanity check. I didn't try to figure out which file is mmaped as executable and which didn't, so let's do worst case and assume that none of the file is executable, and thus that 8 byte pointer is a waste for all of them. On my devserver I got: task_cnt=3D15984 uniq_file_cnt=3D56780 On randomly chosen production host I got: task_cnt=3D6387 uniq_file_cnt=3D22514 So it seems like my devserver is "busier" than the production host. :) Above numbers suggest that my devserver's kernel has about 57000 *unique* `struct file *` instances. That's 450KB of overhead. That's not much by any modern standard. But let's say I'm way off, and we have 1 million struct files. That's 8MB overhead. I'd argue that those 8MB is not a big deal even on a normal laptop, even less so on production servers. Especially if you have 1 million active struct file instances created in the system, as way more will be used for application-specific needs. > > That does depend upon the load, obviously, but it's not hard to collect= - > > you already have more than enough hooks inserted in the relevant places= . > > That might give a better appreciation of the reactions... > > One possibility would be a bit stolen from inode flags + hash keyed by > struct inode address (middle bits make for a decent hash function); > inode eviction would check that bit and kick the corresponding thing > from hash if the bit is set. > > Associating that thing with inode =3D> hash lookup/insert + set the bit. This is an interesting idea, but now we are running into a few unnecessary problems. We need to have a global dynamically sized hash map in the system. If we fix the number of buckets, we risk either wasting memory on an underutilized system (if we oversize), or performance problems due to collisions (if we undersize) if we have a busy system with lots of executables mapped in memory. If we don't pre-size, then we are talking about reallocations, rehashing, and doing that under global lock or something like that. Further, we'd have to take locks on buckets, which causes further problems for looking up build ID from this hashmap in NMI context for perf events and BPF programs, as locks can't be safely taken under those conditions, and thus fetching build ID would still be unreliable (though less so than it is today, of course). All of this is solvable to some degree (but not perfectly and not with simple and elegant approaches), but seems like an unnecessarily overcomplication compared to the amount of memory that we hope to save. It still feels like a Kconfig-guarded 8 byte field per struct file is a reasonable price for gaining reliable build ID information for profiling/tracing tools. [0] https://drgn.readthedocs.io/en/latest/index.html [1] Script I used: from drgn.helpers.linux.pid import for_each_task from drgn.helpers.linux.fs import for_each_file task_cnt =3D 0 file_set =3D set() for task in for_each_task(prog): task_cnt +=3D 1 try: for (fd, file) in for_each_file(task): file_set.add(file.value_()) except: pass uniq_file_cnt =3D len(file_set) print(f"task_cnt=3D{task_cnt} uniq_file_cnt=3D{uniq_file_cnt}")