From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8858BCDB465 for ; Wed, 11 Oct 2023 23:53:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A218F8D00EC; Wed, 11 Oct 2023 19:53:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D07F8D0002; Wed, 11 Oct 2023 19:53:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 898BE8D00EC; Wed, 11 Oct 2023 19:53:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7BD778D0002 for ; Wed, 11 Oct 2023 19:53:57 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 4D2C7A0489 for ; Wed, 11 Oct 2023 23:53:57 +0000 (UTC) X-FDA: 81334836114.06.0AE19C4 Received: from mail-vk1-f170.google.com (mail-vk1-f170.google.com [209.85.221.170]) by imf30.hostedemail.com (Postfix) with ESMTP id 8A5A680002 for ; Wed, 11 Oct 2023 23:53:55 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NE9e1tIF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of sonicadvance1@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=sonicadvance1@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697068435; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; b=BdRT1R9gZqOpjoCVTz4znqQpwXeA8IQfGDGFLaT4HPSUVICS0G/bH2dAFme4gEDFDmdUTV jdFg3ftsP377Jufc6JmYsqZYdq9zVdtjnZzJc3aF1rX86NyRL9SpQs0mg66U3V3UlMqKY+ 6w2DBMGEA8IAZf5nsDteifDQu4VUNz0= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NE9e1tIF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of sonicadvance1@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=sonicadvance1@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697068435; a=rsa-sha256; cv=none; b=UL5PraRwAhVQAicvgl0UrOkhCfiNcjKO3FczNhbAjbJUdxDEIfaQ5wSdNosYoW0U6foro8 jjdiYVUUurSS7uC8cx4/pVspKLneHKd/kz15yLaarRkI+0cosLyhBooCjI+aUdSLvbUMeg Y+rsZhdjw3SGzob8RytSzJclRw7TpN4= Received: by mail-vk1-f170.google.com with SMTP id 71dfb90a1353d-49dc95be894so135563e0c.2 for ; Wed, 11 Oct 2023 16:53:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697068434; x=1697673234; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; b=NE9e1tIF2H/eRJd4BikD9gbwN1AW1HDcud/cxgcH8QJUzSfLVLzjJ3IKIaT4q6xhyd sbOtBRvWxrKXxHwuf1tfjRBLzYgu4mM4rgZ7h3IKk1z2BiHzT5kFxWtpRW05V6xnQR1Z Zs0dnlwwhlsgUEQO+4fh/IrIDVPC8AGPc9ZlETu5f0grDO+c5oF5wqA2BhgU6cAcWA4o J8JBCvP2jysx68805OZHvt4ad0vfWyQLIGEJHevUzA99PVqvrlmwYxZv+/uGauMqQXV9 CkEAhiH4jRTCTpkDCYKr/6NF3vyGYo7HncHhtEsxOfJHDRquH/isEMs5peAxpiCoT8Pu /XZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697068434; x=1697673234; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; b=RUmZ1+ax6LlG6BHFueKQGE1ojo9mWPF6U1axricQdAN/l+wEYBZAI0JAMcRyEtGTi5 rBCkg3pva6ilmH1TxmOCaJi4AkQQxfx33CTBplHr2ukQTfWzfVd76JOwSxvR3peI0gus 1PpToRDZf/NwRh7woS6uK5ddeOgzRmK6SLV7S7sLF/rtgsIgc1nU1RMbLd4vh1qHva6F TneAYy6d8w2HFch4pWNwyLcTFOFcnR8JdMdW0Wva9AgNWODNZPY9bc36qv06fx9MOiLO 2/KEU/4weIFlP6Mupxbf1MgJq0U9DvRGYHCDp+VzaRq9xkSD77wIgQDSeRDN6cQPiv06 QEeg== X-Gm-Message-State: AOJu0YxOj3VOq8h5bFEgsB3opp5FVrRoFeR471nf/SndPFv+dfIM4pvv 9bgQKVxSXSiofB2fzfIKA38G5OGbt6P7LwEc+q8= X-Google-Smtp-Source: AGHT+IHEP24cW7bJTMG7j9O7/PrmB2IivXDbB+fxOo6F7UkPE1P/21ax1Hkz75j24bn+XAEKV0AGU3W76LFSjvnzgRs= X-Received: by 2002:a1f:e641:0:b0:49d:e70:6258 with SMTP id d62-20020a1fe641000000b0049d0e706258mr16202662vkh.3.1697068434490; Wed, 11 Oct 2023 16:53:54 -0700 (PDT) MIME-Version: 1.0 References: <20230907204256.3700336-1-gpiccoli@igalia.com> <202310091034.4F58841@keescook> In-Reply-To: <202310091034.4F58841@keescook> From: Ryan Houdek Date: Wed, 11 Oct 2023 16:53:43 -0700 Message-ID: Subject: Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc To: Kees Cook Cc: "Guilherme G. Piccoli" , David Hildenbrand , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-dev@igalia.com, kernel@gpiccoli.net, ebiederm@xmission.com, oleg@redhat.com, yzaikin@google.com, mcgrof@kernel.org, akpm@linux-foundation.org, brauner@kernel.org, viro@zeniv.linux.org.uk, willy@infradead.org, dave@stgolabs.net, joshua@froggi.es Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 8A5A680002 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: oi7epadddycb4cec58rqir7tepozqtk4 X-HE-Tag: 1697068435-602224 X-HE-Meta: U2FsdGVkX181SjGtgBWn1wVmFkagg4EuUjf3k8mp/80yJ+Tjm+gV3JpT/Q6bf5u/O7m82t9/yWWQvInBLRcXKbxvi6p/+xPx7VUKu2QEfyuNU9H0uSGMN/aIAt20TsYTbuSSNtLsLCY4LBWRpubhFdXo0UOQoYmcvomv7bQVUvrZKSrNvrSwo0WgXhrd0OuGzVp2CAecTULc8I3ZoEgbTYcLRbEQaHiYSIQ++9RbdwqQNa1YRAzjCuggIpwN8rDr+tF/cGD18mwAhKfCXM4wNFA2t20a8SFh2CAi6sOjGdV0zgizQNAF58q6jhiQwpYqeIiyKK6/2UNaJQfBh7kIEayzEJuYEFhqwGsCISJLIcaWgzWKaZhDKDr+HzTDpgsOBgEA2CbqteqzllbKIgJAcuu6VdbhJuVdsnMClx/gk8Et/QgQMMXZhV9sQw6nQ9CLngiCalZ/WHK0Zpx1jUTLp8iUNliMajG7X2bcDsciVx13XpWJ9UFsJNKhMUGIOPzrHnErS9RWUnNikeQslDkH4j/TrL0VFg3B+9pe9cTXjolZ34kK4vVB956gpDnK+90J/YBOEDAncv4/jCO7CPtuKAW/ZVZ30bXS6YnpJbxQ+ir7DZIsoUCtvOgEe8YpR/Ew52078R1HsxZH1F+oLKVbWInythZYZ4f6mPcjEtL5+X6ClKOcem+TzJSnLD1shpQAoa5eenUF/3uvCK3vtdmhpFB2Reu0k+jOZa5LKv4qROuacVGYTPd8i+3BzCp71CsFgmXNkq+d8NIfuwaRuLsSXjoFFcIsgwewF9ao9xl02pVD6/kKydRzEQy8Z7umMvYdrMEJGveCfGYI2MSMcQld7QS5o2g13qQcu+Ns8WIpltPkT0fhh1SUTl5Zoogh2NONiaBVLoQQjgroKKy2XkOsTkIKhNJ8RnopcUSLYCtjM5o166UAB0e5naVuJh9TNlPLXypssF9UoNPwUjK59b9 MfX/W1Z0 6JKGBJt64q141d3SdlnJShLuXaSrf5PVfPuU+QJwsQKzguRq34A7K55OSxORxruBmPGxuX44NTiVX28ZcB4P7hvBrpxKs/ssHjo5s8hgo3EYhfg9pf1L6xumsSJQ3NyA5Fpd7gLzn9uNqBe3769V5iuQUvy4c8v+yaB66M2wjGVtrBjM5Ma3y/fHaYTxmbCdAL/Y59nS+MHXprvqdxJ52W5aRuQVfv4l8N3ffsZ0oGm3DIsWSvlXh+Mw598LTudr80LeYe2s5Yqau0S37Kkd5XUZPixWb0BwQsrl6w7cdSptHvWTkhYup3hj1BSeUsNO1cENFW9HE1a1ZHMgcQ7wQGOfSyVuOCWjb7p7RbKv/yd2T+PhVpw2KdeVWO0ny5tnAmXO212nuu6zqH/XKNo6WX6Bku+voxUthKMhDB4n7VJSk3AdnxZCYC2F1MAV7PkFe8CZ9WyWAi6xRQDeyyW393H2seG+nV9tCGtQDIk0vhPOzpeomkCdoj5ujqk+a/O8kRlPTuo39GsoU6hS7b9OXpojOsQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 9, 2023 at 10:37=E2=80=AFAM Kees Cook w= rote: > > On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: > > On 07.09.23 22:24, Guilherme G. Piccoli wrote: > > > Currently the kernel provides a symlink to the executable binary, in = the > > > form of procfs file exe_file (/proc/self/exe_file for example). But w= hat > > > happens in interpreted scenarios (like binfmt_misc) is that such link > > > always points to the *interpreter*. For cases of Linux binary emulato= rs, > > > like FEX [0] for example, it's then necessary to somehow mask that an= d > > > emulate the true binary path. > > > > I'm absolutely no expert on that, but I'm wondering if, instead of modi= fying > > exe_file and adding an interpreter file, you'd want to leave exe_file a= lone > > and instead provide an easier way to obtain the interpreted file. > > > > Can you maybe describe why modifying exe_file is desired (about which > > consumers are we worrying? ) and what exactly FEX does to handle that (= how > > does it mask that?). > > > > So a bit more background on the challenges without this change would be > > appreciated. > > Yeah, it sounds like you're dealing with a process that examines > /proc/self/exe_file for itself only to find the binfmt_misc interpreter > when it was run via binfmt_misc? > > What actually breaks? Or rather, why does the process to examine > exe_file? I'm just trying to see if there are other solutions here that > would avoid creating an ambiguous interface... > > -- > Kees Cook Hey there, FEX-Emu developer here. I can try and explain some of the issues= . First thing is that we should set the stage here that there is a fundamental discrepancy between how ELF interpreters are represented versus binfmt_misc interpreters when it comes to procfs exe. An ELF file today can either be static or dynamic, wit= h the dynamic ELF files having a program header called PT_INTERP which will tell = the kernel where its interpreter executable lives. In an x86-64 environment thi= s is likely to be something like /lib64/ld-linux-x86-64.so.2. Today, the Kern= el doesn't put the PT_INTERP handle into procfs exe, it instead uses the dynamic ELF that was originally launched. In contrast to how this behaviour works, a binfmt_misc interpreter file getting launched through execve may or may not have ELF header sections. But it is left up t= o the binfmt_misc handler to do whatever it may need. The kernel sets procfs exe to the binfmt_misc interpreter instead of the executable. This is fundamentally the contrasting behaviour that is trying to be improved. It seems like the this behaviour is an oversight of the original binfmt_misc implementation rather than any sort of ambition to ensure there is a difference. It's already ambiguous that the interface changes when executing an executable through binfmt_misc= . Some simple ways applications break: - Applications like chrome tend to relaunch themselves through execve with `/proc/self/exe` - Chrome does this. I think Flatpaks or AppImage applications do this? - There are definitely more that do this that I have noticed. - In the cover letter there was a link to Mesa, the OSS OpenGL/Vulkan drivers using this - This library uses this interface to find out what application is running for applying workarounds for application bugs. Plenty of historical applications that use the API badly or incorrectly and need specific driver workarounds for them. - Some applications may use this path to open their own executable path and= then mmap back in for doing tricky memory mirroring or dynamic linking of themselves. - Saw some old abandoned emulator software doing this. There's likely more uses that I haven't noticed from software using this interface. Onward to what FEX-Emu is and how it tries working around the issue with a fairly naive hack. FEX-Emu is an x86 and x86-64 CPU emulator that gets installed as a binfmt_misc interpreter. It then executes x86 and x86-64 ELF files on an Arm64 device as effectively a multi-arch capable fashion. It's lightweight in that all application processes and threads are just regular Arm64 processes and threads. This is similar to how qemu-user opera= tes. When processing system calls, FEX will intercept any call that consumes a pathname, it will then inspect that path name and if it is one of the ways it is possible to access procfs/exe then it redirects to the true x86/x86-64 executable. This is an attempt to behave like how if the ELF was executed without a binfmt_misc handler. Pathnames captured in FEX-Emu today: - /proc/self/exe - /proc//exe - /proc/thread-self/exe This is very fragile and doesn't cover the full range of how applications could access procfs. Applications could end up using the *at variants of syscalls with an FD that has /proc/self/ open. They could do simple tricks like `/proc/self/../self/exe` and it would side-step this check. It's a game of whack-a-mole and escalating overhead to try and close the gap purely due to, what appears to be, an oversight in how binfmt_misc and PT_INTERP is handled. Hopefully this explains why this is necessary and that reducing the differences between how PT_INTERP and binfmt_misc are represented is desired.