* [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc
@ 2023-09-07 20:24 Guilherme G. Piccoli
2023-09-07 20:24 ` [RFC PATCH 1/2] binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs Guilherme G. Piccoli
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Guilherme G. Piccoli @ 2023-09-07 20:24 UTC (permalink / raw)
To: linux-kernel, linux-fsdevel
Cc: linux-mm, kernel-dev, kernel, keescook, ebiederm, oleg, yzaikin,
mcgrof, akpm, brauner, viro, willy, david, dave, sonicadvance1,
joshua, Guilherme G. Piccoli
Currently the kernel provides a symlink to the executable binary, in the
form of procfs file exe_file (/proc/self/exe_file for example). But what
happens in interpreted scenarios (like binfmt_misc) is that such link
always points to the *interpreter*. For cases of Linux binary emulators,
like FEX [0] for example, it's then necessary to somehow mask that and
emulate the true binary path.
We hereby propose a way to expose such interpreted binary as exe_file if
the flag 'I' is selected on binfmt_misc. When that flag is set, the file
/proc/self/exe_file points to the *interpreted* file, be it ELF or not.
In order to allow users to distinguish if such flag is used or not without
checking the binfmt_misc filesystem, we propose also the /proc/self/interpreter
file, which always points to the *interpreter* in scenarios where
interpretation is set, like binfmt_misc. This file is empty / points to
nothing in the case of regular ELF execution, though we could consider
implementing a way to point to the LD preloader if that makes sense...
This was sent as RFC because of course it's a very core change, affecting
multiple areas and there are design choices (and questions) in each patch
so we could discuss and check the best way to implement the solution as
well as the corner cases handling. This is a very useful feature for
emulators and such, like FEX and Wine, which usually need to circumvent
this kernel limitation in order to expose the true emulated file name
(more examples at [1][2][3]).
This patchset is based on the currently v6.6-rc1 candidate (Linus tree
from yesterday) and was tested under QEMU as well as using FEX.
Thanks in advance for comments, any feedback is greatly appreciated!
Cheers,
Guilherme
[0] https://github.com/FEX-Emu/FEX
[1] Using an environment variable trick to override exe_file:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/u_process.c#L209
[2] https://github.com/baldurk/renderdoc/pull/2694
[3] FEX handling of the exe_file parsing:
https://github.com/FEX-Emu/FEX/blob/main/Source/Tools/FEXLoader/LinuxSyscalls/FileManagement.cpp#L499
Guilherme G. Piccoli (2):
binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs
fork, procfs: Introduce /proc/self/interpreter symlink
Documentation/admin-guide/binfmt-misc.rst | 11 ++
arch/arc/kernel/troubleshoot.c | 5 +
fs/binfmt_elf.c | 7 ++
fs/binfmt_misc.c | 11 ++
fs/coredump.c | 5 +
fs/exec.c | 26 ++++-
fs/proc/base.c | 48 +++++---
include/linux/binfmts.h | 4 +
include/linux/mm.h | 7 +-
include/linux/mm_types.h | 2 +
kernel/audit.c | 5 +
kernel/audit_watch.c | 7 +-
kernel/fork.c | 131 +++++++++++++++++-----
kernel/signal.c | 7 +-
kernel/sys.c | 5 +
kernel/taskstats.c | 7 +-
security/tomoyo/util.c | 5 +
17 files changed, 241 insertions(+), 52 deletions(-)
--
2.42.0
^ permalink raw reply [flat|nested] 13+ messages in thread* [RFC PATCH 1/2] binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs 2023-09-07 20:24 [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli @ 2023-09-07 20:24 ` Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 2/2] fork, procfs: Introduce /proc/self/interpreter symlink Guilherme G. Piccoli ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Guilherme G. Piccoli @ 2023-09-07 20:24 UTC (permalink / raw) To: linux-kernel, linux-fsdevel Cc: linux-mm, kernel-dev, kernel, keescook, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, david, dave, sonicadvance1, joshua, Guilherme G. Piccoli The procfs symlink /proc/self/exe_file points to the executable binary of the running process/thread. What happens though for binfmt_misc interpreted cases is that it effectively points to the *interpreter*, which is valid as the interpreter is in fact the one binary running. But there are cases in which this is considered a limitation - see for example the case of Linux architecture emulators, like FEX [0]. A binary running under such emulator could check its own symlink, and that'd be invalid, pointing instead to the emulator (its interpreter). This adds overhead to the emulation process, that must trap accesses to such symlink to guarantee it is exposed properly to the emulated binary. Add hereby the flag 'I' to binfmt_misc to allow override this default behavior of mapping the interpreter as /proc/self/exe - with this flag, the *interpreted* file is exposed in procfs instead. [0] https://github.com/FEX-Emu/FEX Suggested-by: Ryan Houdek <sonicadvance1@gmail.com> Tested-by: Ryan Houdek <sonicadvance1@gmail.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- Some design decisions / questions: (a) The patch makes use of interpreted_file == NULL to diverge if the flag is set or not. In other words: in any case but binfmt_misc with flag 'I' set, this pointer is NULL and the real exe_file is retrieved. This way, we don't need to propagate some flag from the ends of binfmt_misc up to procfs code. Does it make sense? (b) Of course there's various places affected by this change that I'm not sure if we should care or not, i.e., if we need to somehow change the behavior as well. Examples are audit code, tomoyo, etc. Even worse is the case of users setting the exe_file through prctl; what to do in these cases? I've marked them on code using comments starting with FIXME (BINPRM_FLAGS_EXPOSE_INTERP) so we can debate. (c) Keeping interpreted_file on mm_struct *seems* to make sense to me, though I'm not really sure of the impact of this new member there, for cache locality or anything else I'm not seeing. An alternative would be a small struct exec_files{} that contains both exe_file and interpreted_file, if that's somehow better... (d) Naming: both "interpreted_file" and the flag "I" were simple choices and subject to change to any community suggestion, I'm not attached to them in any way. Probably there's more implicit design decisions here and any feedback is greatly appreciated! Thanks in advance, Guilherme Documentation/admin-guide/binfmt-misc.rst | 11 +++ arch/arc/kernel/troubleshoot.c | 5 ++ fs/binfmt_elf.c | 7 ++ fs/binfmt_misc.c | 11 +++ fs/coredump.c | 5 ++ fs/exec.c | 18 +++- fs/proc/base.c | 2 +- include/linux/binfmts.h | 3 + include/linux/mm.h | 6 +- include/linux/mm_types.h | 1 + kernel/audit.c | 5 ++ kernel/audit_watch.c | 7 +- kernel/fork.c | 105 ++++++++++++++++------ kernel/signal.c | 7 +- kernel/sys.c | 5 ++ kernel/taskstats.c | 7 +- security/tomoyo/util.c | 5 ++ 17 files changed, 173 insertions(+), 37 deletions(-) diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst index 59cd902e3549..175fca8439d6 100644 --- a/Documentation/admin-guide/binfmt-misc.rst +++ b/Documentation/admin-guide/binfmt-misc.rst @@ -88,6 +88,17 @@ Here is what the fields mean: emulation is installed and uses the opened image to spawn the emulator, meaning it is always available once installed, regardless of how the environment changes. + ``I`` - expose the interpreted file in the /proc/self/exe symlink + By default, binfmt_misc executing binaries expose their interpreter + as the /proc/self/exe file, which makes sense given that the actual + executable running is the interpreter indeed. But there are some + cases in which we want to change that behavior - imagine an emulator + of Linux binaries (of different architecture, for example) which + needs to deal with the different behaviors when running native - the + binary's symlink (/proc/self/exe) points to the binary itself - vs + the emulated case, whereas the link points to the interpreter. This + flag allows to change the default behavior and have the proc symlink + pointing to the **interpreted** file, not the interpreter. There are some restrictions: diff --git a/arch/arc/kernel/troubleshoot.c b/arch/arc/kernel/troubleshoot.c index d5b3ed2c58f5..d078af66f07b 100644 --- a/arch/arc/kernel/troubleshoot.c +++ b/arch/arc/kernel/troubleshoot.c @@ -62,6 +62,11 @@ static void print_task_path_n_nm(struct task_struct *tsk) if (!mm) goto done; + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): observe that if using binfmt_misc + * with flag 'I' set, this functionality will diverge from the + * /proc/self/exe symlink with regards of what executable is running. + */ exe_file = get_mm_exe_file(mm); mmput(mm); diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 7b3d2d491407..fb0c22fa3635 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1162,6 +1162,13 @@ static int load_elf_binary(struct linux_binprm *bprm) } } + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): here is one of our problems - bprm->file is + * elf_mapped(), whereas our saved bprm->interpreted_file isn't. Now, why not just + * map it, right? Because we're not sure how to (or if it's indeed necessary). + * What if the interpreted file is not ELF? Could be anything that its interpreter + * is able to read and execute... + */ error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt, elf_prot, elf_flags, total_size); if (BAD_ADDR(error)) { diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index e0108d17b085..36350c3d73f5 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -48,6 +48,7 @@ enum {Enabled, Magic}; #define MISC_FMT_OPEN_BINARY (1UL << 30) #define MISC_FMT_CREDENTIALS (1UL << 29) #define MISC_FMT_OPEN_FILE (1UL << 28) +#define MISC_FMT_EXPOSE_INTERPRETED (1UL << 27) typedef struct { struct list_head list; @@ -181,6 +182,9 @@ static int load_misc_binary(struct linux_binprm *bprm) if (retval < 0) goto ret; + if (fmt->flags & MISC_FMT_EXPOSE_INTERPRETED) + bprm->interp_flags |= BINPRM_FLAGS_EXPOSE_INTERP; + if (fmt->flags & MISC_FMT_OPEN_FILE) { interp_file = file_clone_open(fmt->interp_file); if (!IS_ERR(interp_file)) @@ -258,6 +262,11 @@ static char *check_special_flags(char *sfs, Node *e) p++; e->flags |= MISC_FMT_OPEN_FILE; break; + case 'I': + pr_debug("register: flag: I: (expose interpreted binary)\n"); + p++; + e->flags |= MISC_FMT_EXPOSE_INTERPRETED; + break; default: cont = 0; } @@ -524,6 +533,8 @@ static void entry_status(Node *e, char *page) *dp++ = 'C'; if (e->flags & MISC_FMT_OPEN_FILE) *dp++ = 'F'; + if (e->flags & MISC_FMT_EXPOSE_INTERPRETED) + *dp++ = 'I'; *dp++ = '\n'; if (!test_bit(Magic, &e->flags)) { diff --git a/fs/coredump.c b/fs/coredump.c index 9d235fa14ab9..1a771c7cba67 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -164,6 +164,11 @@ static int cn_print_exe_file(struct core_name *cn, bool name_only) char *pathbuf, *path, *ptr; int ret; + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): observe that if using binfmt_misc + * with flag 'I' set, this coredump functionality will diverge from the + * /proc/self/exe symlink with regards of what executable is running. + */ exe_file = get_mm_exe_file(current->mm); if (!exe_file) return cn_esc_printf(cn, "%s (path unknown)", current->comm); diff --git a/fs/exec.c b/fs/exec.c index 6518e33ea813..bb1574f37b67 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1280,7 +1280,8 @@ int begin_new_exec(struct linux_binprm * bprm) * not visible until then. Doing it here also ensures * we don't race against replace_mm_exe_file(). */ - retval = set_mm_exe_file(bprm->mm, bprm->file); + retval = set_mm_exe_file(bprm->mm, bprm->file, + bprm->interpreted_file); if (retval) goto out; @@ -1405,6 +1406,13 @@ int begin_new_exec(struct linux_binprm * bprm) fd_install(retval, bprm->executable); bprm->executable = NULL; bprm->execfd = retval; + /* + * Since bprm->interpreted_file points to bprm->executable and + * fd_install() consumes its refcount, we need to bump the refcount + * here to avoid warnings as "file count is 0" on kernel log. + */ + if (unlikely(bprm->interp_flags & BINPRM_FLAGS_EXPOSE_INTERP)) + get_file(bprm->interpreted_file); } return 0; @@ -1500,6 +1508,8 @@ static void free_bprm(struct linux_binprm *bprm) allow_write_access(bprm->file); fput(bprm->file); } + if (bprm->interpreted_file) + fput(bprm->interpreted_file); if (bprm->executable) fput(bprm->executable); /* If a binfmt changed the interp, free it. */ @@ -1789,6 +1799,9 @@ static int exec_binprm(struct linux_binprm *bprm) bprm->interpreter = NULL; allow_write_access(exec); + if (unlikely(bprm->interp_flags & BINPRM_FLAGS_EXPOSE_INTERP)) + bprm->interpreted_file = exec; + if (unlikely(bprm->have_execfd)) { if (bprm->executable) { fput(exec); @@ -1796,7 +1809,8 @@ static int exec_binprm(struct linux_binprm *bprm) } bprm->executable = exec; } else - fput(exec); + if (!(bprm->interp_flags & BINPRM_FLAGS_EXPOSE_INTERP)) + fput(exec); } audit_bprm(bprm); diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..a13fbfc46997 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1727,7 +1727,7 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) task = get_proc_task(d_inode(dentry)); if (!task) return -ENOENT; - exe_file = get_task_exe_file(task); + exe_file = get_task_exe_file(task, true); put_task_struct(task); if (exe_file) { *exe_path = exe_file->f_path; diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 8d51f69f9f5e..5dde52de7877 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -45,6 +45,7 @@ struct linux_binprm { point_of_no_return:1; struct file *executable; /* Executable to pass to the interpreter */ struct file *interpreter; + struct file *interpreted_file; /* only for binfmt_misc with flag I */ struct file *file; struct cred *cred; /* new credentials */ int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ @@ -75,6 +76,8 @@ struct linux_binprm { #define BINPRM_FLAGS_PRESERVE_ARGV0_BIT 3 #define BINPRM_FLAGS_PRESERVE_ARGV0 (1 << BINPRM_FLAGS_PRESERVE_ARGV0_BIT) +#define BINPRM_FLAGS_EXPOSE_INTERP_BIT 4 +#define BINPRM_FLAGS_EXPOSE_INTERP (1 << BINPRM_FLAGS_EXPOSE_INTERP_BIT) /* * This structure defines the functions that are used to load the binary formats that * linux accepts. diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f4..a00b32906604 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3255,10 +3255,12 @@ static inline int check_data_rlimit(unsigned long rlim, extern int mm_take_all_locks(struct mm_struct *mm); extern void mm_drop_all_locks(struct mm_struct *mm); -extern int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); +extern int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file, + struct file *new_interpreted_file); extern int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); extern struct file *get_mm_exe_file(struct mm_struct *mm); -extern struct file *get_task_exe_file(struct task_struct *task); +extern struct file *get_task_exe_file(struct task_struct *task, + bool prefer_interpreted); extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages); extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 36c5b43999e6..346f81875f3e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -842,6 +842,7 @@ struct mm_struct { /* store ref to file /proc/<pid>/exe symlink points to */ struct file __rcu *exe_file; + struct file __rcu *interpreted_file; /* see binfmt_misc flag I */ #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_subscriptions *notifier_subscriptions; #endif diff --git a/kernel/audit.c b/kernel/audit.c index 16205dd29843..83c64c376c0c 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2197,6 +2197,11 @@ void audit_log_d_path_exe(struct audit_buffer *ab, if (!mm) goto out_null; + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): observe that if using binfmt_misc + * with flag 'I' set, this audit functionality will diverge from the + * /proc/self/exe symlink with regards of what executable is running. + */ exe_file = get_mm_exe_file(mm); if (!exe_file) goto out_null; diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 65075f1e4ac8..b8f947849fb2 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -527,7 +527,12 @@ int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark *mark) unsigned long ino; dev_t dev; - exe_file = get_task_exe_file(tsk); + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): if using the binfmt_misc flag 'I', we diverge + * here from proc_exe_link(), exposing the true exe_file (instead of the interpreted + * binary as proc). Should we expose here the same exe_file as proc's one *always*? + */ + exe_file = get_task_exe_file(tsk, false); if (!exe_file) return 0; ino = file_inode(exe_file)->i_ino; diff --git a/kernel/fork.c b/kernel/fork.c index 3b6d20dfb9a8..8c4824dcc433 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -628,12 +628,47 @@ void free_task(struct task_struct *tsk) } EXPORT_SYMBOL(free_task); -static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) +/** + * __get_mm_exe_or_interp_file - helper that acquires a reference to the mm's + * executable file, or if prefer_interp is set, go with mm->interpreted_file + * instead. + * + * Returns %NULL if mm has no associated executable/interpreted file. + * User must release file via fput(). + */ +static inline struct file *__get_mm_exe_or_interp_file(struct mm_struct *mm, + bool prefer_interp) { struct file *exe_file; + rcu_read_lock(); + + if (unlikely(prefer_interp)) + exe_file = rcu_dereference(mm->interpreted_file); + else + exe_file = rcu_dereference(mm->exe_file); + + if (exe_file && !get_file_rcu(exe_file)) + exe_file = NULL; + rcu_read_unlock(); + return exe_file; +} + +struct file *get_mm_exe_file(struct mm_struct *mm) +{ + return __get_mm_exe_or_interp_file(mm, false); +} + +static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) +{ + struct file *exe_file, *interp_file; + exe_file = get_mm_exe_file(oldmm); RCU_INIT_POINTER(mm->exe_file, exe_file); + + interp_file = __get_mm_exe_or_interp_file(oldmm, true); + RCU_INIT_POINTER(mm->interpreted_file, interp_file); + /* * We depend on the oldmm having properly denied write access to the * exe_file already. @@ -1279,6 +1314,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm_init_owner(mm, p); mm_pasid_init(mm); RCU_INIT_POINTER(mm->exe_file, NULL); + RCU_INIT_POINTER(mm->interpreted_file, NULL); mmu_notifier_subscriptions_init(mm); init_tlb_flush_pending(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS @@ -1348,7 +1384,7 @@ static inline void __mmput(struct mm_struct *mm) khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); mm_put_huge_zero_page(mm); - set_mm_exe_file(mm, NULL); + set_mm_exe_file(mm, NULL, NULL); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); list_del(&mm->mmlist); @@ -1394,7 +1430,9 @@ EXPORT_SYMBOL_GPL(mmput_async); /** * set_mm_exe_file - change a reference to the mm's executable file * - * This changes mm's executable file (shown as symlink /proc/[pid]/exe). + * This changes mm's executable file (shown as symlink /proc/[pid]/exe), + * and if new_interpreted_file != NULL, also sets this field (check the + * binfmt_misc documentation, flag 'I', for details about this). * * Main users are mmput() and sys_execve(). Callers prevent concurrent * invocations: in mmput() nobody alive left, in execve it happens before @@ -1402,16 +1440,19 @@ EXPORT_SYMBOL_GPL(mmput_async); * * Can only fail if new_exe_file != NULL. */ -int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) +int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file, + struct file *new_interpreted_file) { struct file *old_exe_file; + struct file *old_interpreted_file; /* - * It is safe to dereference the exe_file without RCU as - * this function is only called if nobody else can access - * this mm -- see comment above for justification. + * It is safe to dereference exe_file / interpreted_file + * without RCU as this function is only called if nobody else + * can access this mm -- see comment above for justification. */ old_exe_file = rcu_dereference_raw(mm->exe_file); + old_interpreted_file = rcu_dereference_raw(mm->interpreted_file); if (new_exe_file) { /* @@ -1423,10 +1464,20 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) get_file(new_exe_file); } rcu_assign_pointer(mm->exe_file, new_exe_file); + + /* For this one we don't care about write access... */ + if (new_interpreted_file) + get_file(new_interpreted_file); + rcu_assign_pointer(mm->interpreted_file, new_interpreted_file); + if (old_exe_file) { allow_write_access(old_exe_file); fput(old_exe_file); } + + if (old_interpreted_file) + fput(old_interpreted_file); + return 0; } @@ -1436,6 +1487,12 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) * This changes mm's executable file (shown as symlink /proc/[pid]/exe). * * Main user is sys_prctl(PR_SET_MM_MAP/EXE_FILE). + * + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): imagine user performs the sys_prctl() + * aiming to change /proc/self/exe symlink - suppose user is interested + * in the executable path itself. With binfmt_misc flag 'I', this change + * **won't reflect** since procfs make use of interpreted_file. What to do + * in this case? Do we care? */ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) { @@ -1482,31 +1539,15 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) } /** - * get_mm_exe_file - acquire a reference to the mm's executable file - * - * Returns %NULL if mm has no associated executable file. - * User must release file via fput(). - */ -struct file *get_mm_exe_file(struct mm_struct *mm) -{ - struct file *exe_file; - - rcu_read_lock(); - exe_file = rcu_dereference(mm->exe_file); - if (exe_file && !get_file_rcu(exe_file)) - exe_file = NULL; - rcu_read_unlock(); - return exe_file; -} - -/** - * get_task_exe_file - acquire a reference to the task's executable file + * get_task_exe_file - acquire a reference to the task's executable or + * interpreted file (only for procfs, when under binfmt_misc with flag 'I'). * * Returns %NULL if task's mm (if any) has no associated executable file or * this is a kernel thread with borrowed mm (see the comment above get_task_mm). * User must release file via fput(). */ -struct file *get_task_exe_file(struct task_struct *task) +struct file *get_task_exe_file(struct task_struct *task, + bool prefer_interpreted) { struct file *exe_file = NULL; struct mm_struct *mm; @@ -1514,8 +1555,14 @@ struct file *get_task_exe_file(struct task_struct *task) task_lock(task); mm = task->mm; if (mm) { - if (!(task->flags & PF_KTHREAD)) - exe_file = get_mm_exe_file(mm); + if (!(task->flags & PF_KTHREAD)) { + if (unlikely(prefer_interpreted)) { + exe_file = __get_mm_exe_or_interp_file(mm, true); + if (!exe_file) + exe_file = get_mm_exe_file(mm); + } else + exe_file = get_mm_exe_file(mm); + } } task_unlock(task); return exe_file; diff --git a/kernel/signal.c b/kernel/signal.c index 09019017d669..3a8d85a65c49 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1263,7 +1263,12 @@ static void print_fatal_signal(int signr) struct pt_regs *regs = task_pt_regs(current); struct file *exe_file; - exe_file = get_task_exe_file(current); + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): if using the binfmt_misc flag 'I', we diverge + * here from proc_exe_link(), exposing the true exe_file (instead of the interpreted + * binary as proc). Should we expose here the same exe_file as proc's one *always*? + */ + exe_file = get_task_exe_file(current, false); if (exe_file) { pr_info("%pD: %s: potentially unexpected fatal signal %d.\n", exe_file, current->comm, signr); diff --git a/kernel/sys.c b/kernel/sys.c index 2410e3999ebe..17fab2f71443 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1912,6 +1912,11 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) if (err) goto exit; + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): please read the comment + * on replace_mm_exe_file() to ponder about the divergence when + * using binfmt_misc with flag 'I'. + */ err = replace_mm_exe_file(mm, exe.file); exit: fdput(exe); diff --git a/kernel/taskstats.c b/kernel/taskstats.c index 8ce3fa0c19e2..a5d5afc1919a 100644 --- a/kernel/taskstats.c +++ b/kernel/taskstats.c @@ -157,7 +157,12 @@ static void send_cpu_listeners(struct sk_buff *skb, static void exe_add_tsk(struct taskstats *stats, struct task_struct *tsk) { /* No idea if I'm allowed to access that here, now. */ - struct file *exe_file = get_task_exe_file(tsk); + struct file *exe_file = get_task_exe_file(tsk, false); + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): if using the binfmt_misc flag 'I', we diverge + * here from proc_exe_link(), exposing the true exe_file (instead of the interpreted + * binary as proc). Should we expose here the same exe_file as proc's one *always*? + */ if (exe_file) { /* Following cp_new_stat64() in stat.c . */ diff --git a/security/tomoyo/util.c b/security/tomoyo/util.c index 6799b1122c9d..844bdbd27240 100644 --- a/security/tomoyo/util.c +++ b/security/tomoyo/util.c @@ -971,6 +971,11 @@ const char *tomoyo_get_exe(void) if (!mm) return NULL; + /* + * FIXME (BINPRM_FLAGS_EXPOSE_INTERP): observe that if using binfmt_misc + * with flag 'I' set, this tomoyo functionality will diverge from the + * /proc/self/exe symlink with regards of what executable is running. + */ exe_file = get_mm_exe_file(mm); if (!exe_file) return NULL; -- 2.42.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 2/2] fork, procfs: Introduce /proc/self/interpreter symlink 2023-09-07 20:24 [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 1/2] binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs Guilherme G. Piccoli @ 2023-09-07 20:24 ` Guilherme G. Piccoli 2023-10-06 7:51 ` [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli 2023-10-06 12:07 ` David Hildenbrand 3 siblings, 0 replies; 13+ messages in thread From: Guilherme G. Piccoli @ 2023-09-07 20:24 UTC (permalink / raw) To: linux-kernel, linux-fsdevel Cc: linux-mm, kernel-dev, kernel, keescook, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, david, dave, sonicadvance1, joshua, Guilherme G. Piccoli Currently we have the /proc/self/exe_file symlink, that points to the executable binary of the running process - or to the *interpreted* file if flag 'I' is set on binfmt_misc. In this second case, we then lose the ability of having a symlink to the interpreter. Introduce hereby the /proc/self/interpreter symlink which always points to the *interpreter* when we're under interpretation, like on binfmt_misc use. We don't require to have a new file pointer since mm_struct contains exe_file, which points to such interpreter. Suggested-by: Ryan Houdek <sonicadvance1@gmail.com> Tested-by: Ryan Houdek <sonicadvance1@gmail.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- Design choices / questions: (a) The file /proc/self/interpreter is always present, and in most cases, points to nothing (since we're running an ELF binary, with no interpreter!). Is it OK? The implementation follows the pattern of /proc/self/exe_file, returns -ENOENT when we have no interpreter. I'm not sure if there's a way to implement this as a procfs file that is not always visible, i.e., it would only show up if such interpreter exists. If that would be possible, is it better than the current implementation though? Also, we could somehow make /proc/self/interpreter points to the LD preloader for ELFs, if that's considered a better approach. (b) I like xmacros to avoid repeating code, but I'm totally fine changing that as well as naming of stuff. (c) Should we extend functionality present on exe_file to the interpreter, like prctl replacing mechanism, audit info collection, etc? Any feedback here is greatly appreciated - thanks in advance! Cheers, Guilherme fs/exec.c | 8 +++++++ fs/proc/base.c | 48 ++++++++++++++++++++++++++-------------- include/linux/binfmts.h | 1 + include/linux/mm.h | 1 + include/linux/mm_types.h | 1 + kernel/fork.c | 26 ++++++++++++++++++++++ 6 files changed, 69 insertions(+), 16 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index bb1574f37b67..39f9c86d5ebc 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1285,6 +1285,13 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; + /* + * In case we're in an interpreted scenario (like binfmt_misc, + * for example), flag it on mm_struct in order to expose such + * interpreter as the symlink /proc/self/interpreter. + */ + bprm->mm->has_interpreter = bprm->has_interpreter; + /* If the binary is not readable then enforce mm->dumpable=0 */ would_dump(bprm, bprm->file); if (bprm->have_execfd) @@ -1796,6 +1803,7 @@ static int exec_binprm(struct linux_binprm *bprm) exec = bprm->file; bprm->file = bprm->interpreter; + bprm->has_interpreter = true; bprm->interpreter = NULL; allow_write_access(exec); diff --git a/fs/proc/base.c b/fs/proc/base.c index a13fbfc46997..ecd5ea05acb0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1719,25 +1719,39 @@ static const struct file_operations proc_pid_set_comm_operations = { .release = single_release, }; -static int proc_exe_link(struct dentry *dentry, struct path *exe_path) +static struct file *proc_get_task_exe_file(struct task_struct *task) { - struct task_struct *task; - struct file *exe_file; - - task = get_proc_task(d_inode(dentry)); - if (!task) - return -ENOENT; - exe_file = get_task_exe_file(task, true); - put_task_struct(task); - if (exe_file) { - *exe_path = exe_file->f_path; - path_get(&exe_file->f_path); - fput(exe_file); - return 0; - } else - return -ENOENT; + return get_task_exe_file(task, true); } +static struct file *proc_get_task_interpreter_file(struct task_struct *task) +{ + return get_task_interpreter_file(task); +} + +/* Definition of proc_exe_link and proc_interpreter_link. */ +#define PROC_GET_LINK_FUNC(type) \ +static int proc_##type##_link(struct dentry *dentry, struct path *path) \ +{ \ + struct task_struct *task; \ + struct file *file; \ + \ + task = get_proc_task(d_inode(dentry)); \ + if (!task) \ + return -ENOENT; \ + file = proc_get_task_##type##_file(task); \ + put_task_struct(task); \ + if (file) { \ + *path = file->f_path; \ + path_get(&file->f_path); \ + fput(file); \ + return 0; \ + } else \ + return -ENOENT; \ +} +PROC_GET_LINK_FUNC(exe); +PROC_GET_LINK_FUNC(interpreter); + static const char *proc_pid_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) @@ -3276,6 +3290,7 @@ static const struct pid_entry tgid_base_stuff[] = { LNK("cwd", proc_cwd_link), LNK("root", proc_root_link), LNK("exe", proc_exe_link), + LNK("interpreter", proc_interpreter_link), REG("mounts", S_IRUGO, proc_mounts_operations), REG("mountinfo", S_IRUGO, proc_mountinfo_operations), REG("mountstats", S_IRUSR, proc_mountstats_operations), @@ -3626,6 +3641,7 @@ static const struct pid_entry tid_base_stuff[] = { LNK("cwd", proc_cwd_link), LNK("root", proc_root_link), LNK("exe", proc_exe_link), + LNK("interpreter", proc_interpreter_link), REG("mounts", S_IRUGO, proc_mounts_operations), REG("mountinfo", S_IRUGO, proc_mountinfo_operations), #ifdef CONFIG_PROC_PAGE_MONITOR diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 5dde52de7877..2362c6bc6ead 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -46,6 +46,7 @@ struct linux_binprm { struct file *executable; /* Executable to pass to the interpreter */ struct file *interpreter; struct file *interpreted_file; /* only for binfmt_misc with flag I */ + bool has_interpreter; /* In order to expose /proc/self/interpreter */ struct file *file; struct cred *cred; /* new credentials */ int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ diff --git a/include/linux/mm.h b/include/linux/mm.h index a00b32906604..e06b703db494 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3261,6 +3261,7 @@ extern int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); extern struct file *get_mm_exe_file(struct mm_struct *mm); extern struct file *get_task_exe_file(struct task_struct *task, bool prefer_interpreted); +extern struct file *get_task_interpreter_file(struct task_struct *task); extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages); extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 346f81875f3e..19a73c41991c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -843,6 +843,7 @@ struct mm_struct { /* store ref to file /proc/<pid>/exe symlink points to */ struct file __rcu *exe_file; struct file __rcu *interpreted_file; /* see binfmt_misc flag I */ + bool has_interpreter; /* exposes (or not) /proc/self/interpreter */ #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_subscriptions *notifier_subscriptions; #endif diff --git a/kernel/fork.c b/kernel/fork.c index 8c4824dcc433..5cb542f92d5e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1568,6 +1568,32 @@ struct file *get_task_exe_file(struct task_struct *task, return exe_file; } +/** + * get_task_interpreter_file - acquire a reference to the task's *interpreter* + * executable (which is in fact the exe_file on mm_struct!). This is used in + * order to expose /proc/self/interpreter, if we're under an interpreted + * scenario (like binfmt_misc). + * + * Returns %NULL if exe_file is not an interpreter (i.e., it is the truly + * running binary indeed). + */ +struct file *get_task_interpreter_file(struct task_struct *task) +{ + struct file *interpreter_file = NULL; + struct mm_struct *mm; + + task_lock(task); + + mm = task->mm; + if (mm && mm->has_interpreter) { + if (!(task->flags & PF_KTHREAD)) + interpreter_file = get_mm_exe_file(mm); + } + + task_unlock(task); + return interpreter_file; +} + /** * get_task_mm - acquire a reference to the task's mm * -- 2.42.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-09-07 20:24 [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 1/2] binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 2/2] fork, procfs: Introduce /proc/self/interpreter symlink Guilherme G. Piccoli @ 2023-10-06 7:51 ` Guilherme G. Piccoli 2023-10-06 12:07 ` David Hildenbrand 3 siblings, 0 replies; 13+ messages in thread From: Guilherme G. Piccoli @ 2023-10-06 7:51 UTC (permalink / raw) To: linux-kernel, linux-fsdevel Cc: linux-mm, kernel-dev, kernel, keescook, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, david, dave, sonicadvance1, joshua On 07/09/2023 22:24, Guilherme G. Piccoli wrote: > Currently the kernel provides a symlink to the executable binary, in the > form of procfs file exe_file (/proc/self/exe_file for example). But what > happens in interpreted scenarios (like binfmt_misc) is that such link > always points to the *interpreter*. For cases of Linux binary emulators, > like FEX [0] for example, it's then necessary to somehow mask that and > emulate the true binary path. > > We hereby propose a way to expose such interpreted binary as exe_file if > the flag 'I' is selected on binfmt_misc. When that flag is set, the file > /proc/self/exe_file points to the *interpreted* file, be it ELF or not. > In order to allow users to distinguish if such flag is used or not without > checking the binfmt_misc filesystem, we propose also the /proc/self/interpreter > file, which always points to the *interpreter* in scenarios where > interpretation is set, like binfmt_misc. This file is empty / points to > nothing in the case of regular ELF execution, though we could consider > implementing a way to point to the LD preloader if that makes sense... > > This was sent as RFC because of course it's a very core change, affecting > multiple areas and there are design choices (and questions) in each patch > so we could discuss and check the best way to implement the solution as > well as the corner cases handling. This is a very useful feature for > emulators and such, like FEX and Wine, which usually need to circumvent > this kernel limitation in order to expose the true emulated file name > (more examples at [1][2][3]). > > This patchset is based on the currently v6.6-rc1 candidate (Linus tree > from yesterday) and was tested under QEMU as well as using FEX. > Thanks in advance for comments, any feedback is greatly appreciated! > Cheers, > > Guilherme > > > [0] https://github.com/FEX-Emu/FEX > > [1] Using an environment variable trick to override exe_file: > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/u_process.c#L209 > > [2] https://github.com/baldurk/renderdoc/pull/2694 > > [3] FEX handling of the exe_file parsing: > https://github.com/FEX-Emu/FEX/blob/main/Source/Tools/FEXLoader/LinuxSyscalls/FileManagement.cpp#L499 > > Hi folks, gentle monthly ping. Any opinions / suggestions on that? Thanks in advance, Guilherme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-09-07 20:24 [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli ` (2 preceding siblings ...) 2023-10-06 7:51 ` [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli @ 2023-10-06 12:07 ` David Hildenbrand 2023-10-09 17:37 ` Kees Cook 3 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2023-10-06 12:07 UTC (permalink / raw) To: Guilherme G. Piccoli, linux-kernel, linux-fsdevel Cc: linux-mm, kernel-dev, kernel, keescook, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, sonicadvance1, joshua On 07.09.23 22:24, Guilherme G. Piccoli wrote: > Currently the kernel provides a symlink to the executable binary, in the > form of procfs file exe_file (/proc/self/exe_file for example). But what > happens in interpreted scenarios (like binfmt_misc) is that such link > always points to the *interpreter*. For cases of Linux binary emulators, > like FEX [0] for example, it's then necessary to somehow mask that and > emulate the true binary path. I'm absolutely no expert on that, but I'm wondering if, instead of modifying exe_file and adding an interpreter file, you'd want to leave exe_file alone and instead provide an easier way to obtain the interpreted file. Can you maybe describe why modifying exe_file is desired (about which consumers are we worrying? ) and what exactly FEX does to handle that (how does it mask that?). So a bit more background on the challenges without this change would be appreciated. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-10-06 12:07 ` David Hildenbrand @ 2023-10-09 17:37 ` Kees Cook 2023-10-11 23:53 ` Ryan Houdek 2023-11-13 17:33 ` Guilherme G. Piccoli 0 siblings, 2 replies; 13+ messages in thread From: Kees Cook @ 2023-10-09 17:37 UTC (permalink / raw) To: Guilherme G. Piccoli Cc: David Hildenbrand, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, sonicadvance1, joshua On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: > On 07.09.23 22:24, Guilherme G. Piccoli wrote: > > Currently the kernel provides a symlink to the executable binary, in the > > form of procfs file exe_file (/proc/self/exe_file for example). But what > > happens in interpreted scenarios (like binfmt_misc) is that such link > > always points to the *interpreter*. For cases of Linux binary emulators, > > like FEX [0] for example, it's then necessary to somehow mask that and > > emulate the true binary path. > > I'm absolutely no expert on that, but I'm wondering if, instead of modifying > exe_file and adding an interpreter file, you'd want to leave exe_file alone > and instead provide an easier way to obtain the interpreted file. > > Can you maybe describe why modifying exe_file is desired (about which > consumers are we worrying? ) and what exactly FEX does to handle that (how > does it mask that?). > > So a bit more background on the challenges without this change would be > appreciated. Yeah, it sounds like you're dealing with a process that examines /proc/self/exe_file for itself only to find the binfmt_misc interpreter when it was run via binfmt_misc? What actually breaks? Or rather, why does the process to examine exe_file? I'm just trying to see if there are other solutions here that would avoid creating an ambiguous interface... -- Kees Cook ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-10-09 17:37 ` Kees Cook @ 2023-10-11 23:53 ` Ryan Houdek 2023-11-13 17:33 ` Guilherme G. Piccoli 1 sibling, 0 replies; 13+ messages in thread From: Ryan Houdek @ 2023-10-11 23:53 UTC (permalink / raw) To: Kees Cook Cc: Guilherme G. Piccoli, David Hildenbrand, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua On Mon, Oct 9, 2023 at 10:37 AM Kees Cook <keescook@chromium.org> wrote: > > On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: > > On 07.09.23 22:24, Guilherme G. Piccoli wrote: > > > Currently the kernel provides a symlink to the executable binary, in the > > > form of procfs file exe_file (/proc/self/exe_file for example). But what > > > happens in interpreted scenarios (like binfmt_misc) is that such link > > > always points to the *interpreter*. For cases of Linux binary emulators, > > > like FEX [0] for example, it's then necessary to somehow mask that and > > > emulate the true binary path. > > > > I'm absolutely no expert on that, but I'm wondering if, instead of modifying > > exe_file and adding an interpreter file, you'd want to leave exe_file alone > > and instead provide an easier way to obtain the interpreted file. > > > > Can you maybe describe why modifying exe_file is desired (about which > > consumers are we worrying? ) and what exactly FEX does to handle that (how > > does it mask that?). > > > > So a bit more background on the challenges without this change would be > > appreciated. > > Yeah, it sounds like you're dealing with a process that examines > /proc/self/exe_file for itself only to find the binfmt_misc interpreter > when it was run via binfmt_misc? > > What actually breaks? Or rather, why does the process to examine > exe_file? I'm just trying to see if there are other solutions here that > would avoid creating an ambiguous interface... > > -- > Kees Cook Hey there, FEX-Emu developer here. I can try and explain some of the issues. First thing is that we should set the stage here that there is a fundamental discrepancy between how ELF interpreters are represented versus binfmt_misc interpreters when it comes to procfs exe. An ELF file today can either be static or dynamic, with the dynamic ELF files having a program header called PT_INTERP which will tell the kernel where its interpreter executable lives. In an x86-64 environment this is likely to be something like /lib64/ld-linux-x86-64.so.2. Today, the Kernel doesn't put the PT_INTERP handle into procfs exe, it instead uses the dynamic ELF that was originally launched. In contrast to how this behaviour works, a binfmt_misc interpreter file getting launched through execve may or may not have ELF header sections. But it is left up to the binfmt_misc handler to do whatever it may need. The kernel sets procfs exe to the binfmt_misc interpreter instead of the executable. This is fundamentally the contrasting behaviour that is trying to be improved. It seems like the this behaviour is an oversight of the original binfmt_misc implementation rather than any sort of ambition to ensure there is a difference. It's already ambiguous that the interface changes when executing an executable through binfmt_misc. Some simple ways applications break: - Applications like chrome tend to relaunch themselves through execve with `/proc/self/exe` - Chrome does this. I think Flatpaks or AppImage applications do this? - There are definitely more that do this that I have noticed. - In the cover letter there was a link to Mesa, the OSS OpenGL/Vulkan drivers using this - This library uses this interface to find out what application is running for applying workarounds for application bugs. Plenty of historical applications that use the API badly or incorrectly and need specific driver workarounds for them. - Some applications may use this path to open their own executable path and then mmap back in for doing tricky memory mirroring or dynamic linking of themselves. - Saw some old abandoned emulator software doing this. There's likely more uses that I haven't noticed from software using this interface. Onward to what FEX-Emu is and how it tries working around the issue with a fairly naive hack. FEX-Emu is an x86 and x86-64 CPU emulator that gets installed as a binfmt_misc interpreter. It then executes x86 and x86-64 ELF files on an Arm64 device as effectively a multi-arch capable fashion. It's lightweight in that all application processes and threads are just regular Arm64 processes and threads. This is similar to how qemu-user operates. When processing system calls, FEX will intercept any call that consumes a pathname, it will then inspect that path name and if it is one of the ways it is possible to access procfs/exe then it redirects to the true x86/x86-64 executable. This is an attempt to behave like how if the ELF was executed without a binfmt_misc handler. Pathnames captured in FEX-Emu today: - /proc/self/exe - /proc/<pid>/exe - /proc/thread-self/exe This is very fragile and doesn't cover the full range of how applications could access procfs. Applications could end up using the *at variants of syscalls with an FD that has /proc/self/ open. They could do simple tricks like `/proc/self/../self/exe` and it would side-step this check. It's a game of whack-a-mole and escalating overhead to try and close the gap purely due to, what appears to be, an oversight in how binfmt_misc and PT_INTERP is handled. Hopefully this explains why this is necessary and that reducing the differences between how PT_INTERP and binfmt_misc are represented is desired. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-10-09 17:37 ` Kees Cook 2023-10-11 23:53 ` Ryan Houdek @ 2023-11-13 17:33 ` Guilherme G. Piccoli 2023-11-13 18:29 ` Eric W. Biederman 1 sibling, 1 reply; 13+ messages in thread From: Guilherme G. Piccoli @ 2023-11-13 17:33 UTC (permalink / raw) To: Kees Cook, David Hildenbrand, sonicadvance1 Cc: linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, ebiederm, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua On 09/10/2023 14:37, Kees Cook wrote: > On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: >> On 07.09.23 22:24, Guilherme G. Piccoli wrote: >>> Currently the kernel provides a symlink to the executable binary, in the >>> form of procfs file exe_file (/proc/self/exe_file for example). But what >>> happens in interpreted scenarios (like binfmt_misc) is that such link >>> always points to the *interpreter*. For cases of Linux binary emulators, >>> like FEX [0] for example, it's then necessary to somehow mask that and >>> emulate the true binary path. >> >> I'm absolutely no expert on that, but I'm wondering if, instead of modifying >> exe_file and adding an interpreter file, you'd want to leave exe_file alone >> and instead provide an easier way to obtain the interpreted file. >> >> Can you maybe describe why modifying exe_file is desired (about which >> consumers are we worrying? ) and what exactly FEX does to handle that (how >> does it mask that?). >> >> So a bit more background on the challenges without this change would be >> appreciated. > > Yeah, it sounds like you're dealing with a process that examines > /proc/self/exe_file for itself only to find the binfmt_misc interpreter > when it was run via binfmt_misc? > > What actually breaks? Or rather, why does the process to examine > exe_file? I'm just trying to see if there are other solutions here that > would avoid creating an ambiguous interface... > Thanks Kees and David! Did Ryan's thorough comment addressed your questions? Do you have any take on the TODOs? I can maybe rebase against 6.7-rc1 and resubmit , if that makes sense! But would be better having the TODOs addressed, I guess. Thanks in advance for reviews and feedback on this. Cheers, Guilherme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-11-13 17:33 ` Guilherme G. Piccoli @ 2023-11-13 18:29 ` Eric W. Biederman 2023-11-13 19:16 ` David Hildenbrand 2023-11-13 19:17 ` Guilherme G. Piccoli 0 siblings, 2 replies; 13+ messages in thread From: Eric W. Biederman @ 2023-11-13 18:29 UTC (permalink / raw) To: Guilherme G. Piccoli Cc: Kees Cook, David Hildenbrand, sonicadvance1, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua "Guilherme G. Piccoli" <gpiccoli@igalia.com> writes: > On 09/10/2023 14:37, Kees Cook wrote: >> On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: >>> On 07.09.23 22:24, Guilherme G. Piccoli wrote: >>>> Currently the kernel provides a symlink to the executable binary, in the >>>> form of procfs file exe_file (/proc/self/exe_file for example). But what >>>> happens in interpreted scenarios (like binfmt_misc) is that such link >>>> always points to the *interpreter*. For cases of Linux binary emulators, >>>> like FEX [0] for example, it's then necessary to somehow mask that and >>>> emulate the true binary path. >>> >>> I'm absolutely no expert on that, but I'm wondering if, instead of modifying >>> exe_file and adding an interpreter file, you'd want to leave exe_file alone >>> and instead provide an easier way to obtain the interpreted file. >>> >>> Can you maybe describe why modifying exe_file is desired (about which >>> consumers are we worrying? ) and what exactly FEX does to handle that (how >>> does it mask that?). >>> >>> So a bit more background on the challenges without this change would be >>> appreciated. >> >> Yeah, it sounds like you're dealing with a process that examines >> /proc/self/exe_file for itself only to find the binfmt_misc interpreter >> when it was run via binfmt_misc? >> >> What actually breaks? Or rather, why does the process to examine >> exe_file? I'm just trying to see if there are other solutions here that >> would avoid creating an ambiguous interface... >> > > Thanks Kees and David! Did Ryan's thorough comment addressed your > questions? Do you have any take on the TODOs? > > I can maybe rebase against 6.7-rc1 and resubmit , if that makes sense! > But would be better having the TODOs addressed, I guess. Currently there is a mechanism in the kernel for changing /proc/self/exe. Would that be reasonable to use in this case? It came from the checkpoint/restart work, but given that it is already implemented it seems like the path of least resistance to get your binfmt_misc that wants to look like binfmt_elf to use that mechanism. Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-11-13 18:29 ` Eric W. Biederman @ 2023-11-13 19:16 ` David Hildenbrand 2023-11-14 16:11 ` Eric W. Biederman 2023-11-13 19:17 ` Guilherme G. Piccoli 1 sibling, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2023-11-13 19:16 UTC (permalink / raw) To: Eric W. Biederman, Guilherme G. Piccoli Cc: Kees Cook, sonicadvance1, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua On 13.11.23 19:29, Eric W. Biederman wrote: > "Guilherme G. Piccoli" <gpiccoli@igalia.com> writes: > >> On 09/10/2023 14:37, Kees Cook wrote: >>> On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: >>>> On 07.09.23 22:24, Guilherme G. Piccoli wrote: >>>>> Currently the kernel provides a symlink to the executable binary, in the >>>>> form of procfs file exe_file (/proc/self/exe_file for example). But what >>>>> happens in interpreted scenarios (like binfmt_misc) is that such link >>>>> always points to the *interpreter*. For cases of Linux binary emulators, >>>>> like FEX [0] for example, it's then necessary to somehow mask that and >>>>> emulate the true binary path. >>>> >>>> I'm absolutely no expert on that, but I'm wondering if, instead of modifying >>>> exe_file and adding an interpreter file, you'd want to leave exe_file alone >>>> and instead provide an easier way to obtain the interpreted file. >>>> >>>> Can you maybe describe why modifying exe_file is desired (about which >>>> consumers are we worrying? ) and what exactly FEX does to handle that (how >>>> does it mask that?). >>>> >>>> So a bit more background on the challenges without this change would be >>>> appreciated. >>> >>> Yeah, it sounds like you're dealing with a process that examines >>> /proc/self/exe_file for itself only to find the binfmt_misc interpreter >>> when it was run via binfmt_misc? >>> >>> What actually breaks? Or rather, why does the process to examine >>> exe_file? I'm just trying to see if there are other solutions here that >>> would avoid creating an ambiguous interface... >>> >> >> Thanks Kees and David! Did Ryan's thorough comment addressed your >> questions? Do you have any take on the TODOs? >> >> I can maybe rebase against 6.7-rc1 and resubmit , if that makes sense! >> But would be better having the TODOs addressed, I guess. > > Currently there is a mechanism in the kernel for changing > /proc/self/exe. Would that be reasonable to use in this case? > > It came from the checkpoint/restart work, but given that it is already > implemented it seems like the path of least resistance to get your > binfmt_misc that wants to look like binfmt_elf to use that mechanism. I had that in mind as well, but prctl_set_mm_exe_file()->replace_mm_exe_file() fails if the executable is still mmaped (due to denywrite handling); that should be the case for the emulator I strongly assume. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-11-13 19:16 ` David Hildenbrand @ 2023-11-14 16:11 ` Eric W. Biederman 2023-11-14 16:14 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Eric W. Biederman @ 2023-11-14 16:11 UTC (permalink / raw) To: David Hildenbrand Cc: Guilherme G. Piccoli, Kees Cook, sonicadvance1, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua David Hildenbrand <david@redhat.com> writes: > On 13.11.23 19:29, Eric W. Biederman wrote: >> "Guilherme G. Piccoli" <gpiccoli@igalia.com> writes: >> >>> On 09/10/2023 14:37, Kees Cook wrote: >>>> On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: >>>>> On 07.09.23 22:24, Guilherme G. Piccoli wrote: >>>>>> Currently the kernel provides a symlink to the executable binary, in the >>>>>> form of procfs file exe_file (/proc/self/exe_file for example). But what >>>>>> happens in interpreted scenarios (like binfmt_misc) is that such link >>>>>> always points to the *interpreter*. For cases of Linux binary emulators, >>>>>> like FEX [0] for example, it's then necessary to somehow mask that and >>>>>> emulate the true binary path. >>>>> >>>>> I'm absolutely no expert on that, but I'm wondering if, instead of modifying >>>>> exe_file and adding an interpreter file, you'd want to leave exe_file alone >>>>> and instead provide an easier way to obtain the interpreted file. >>>>> >>>>> Can you maybe describe why modifying exe_file is desired (about which >>>>> consumers are we worrying? ) and what exactly FEX does to handle that (how >>>>> does it mask that?). >>>>> >>>>> So a bit more background on the challenges without this change would be >>>>> appreciated. >>>> >>>> Yeah, it sounds like you're dealing with a process that examines >>>> /proc/self/exe_file for itself only to find the binfmt_misc interpreter >>>> when it was run via binfmt_misc? >>>> >>>> What actually breaks? Or rather, why does the process to examine >>>> exe_file? I'm just trying to see if there are other solutions here that >>>> would avoid creating an ambiguous interface... >>>> >>> >>> Thanks Kees and David! Did Ryan's thorough comment addressed your >>> questions? Do you have any take on the TODOs? >>> >>> I can maybe rebase against 6.7-rc1 and resubmit , if that makes sense! >>> But would be better having the TODOs addressed, I guess. >> Currently there is a mechanism in the kernel for changing >> /proc/self/exe. Would that be reasonable to use in this case? >> It came from the checkpoint/restart work, but given that it is >> already >> implemented it seems like the path of least resistance to get your >> binfmt_misc that wants to look like binfmt_elf to use that mechanism. > > I had that in mind as well, but > prctl_set_mm_exe_file()->replace_mm_exe_file() fails if the executable > is still mmaped (due to denywrite handling); that should be the case > for the emulator I strongly assume. Bah yes. The sanity check that that the old executable is no longer mapped does make it so that we can't trivially change the /proc/self/exe using prctl(PR_SET_MM_EXE_FILE). Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-11-14 16:11 ` Eric W. Biederman @ 2023-11-14 16:14 ` David Hildenbrand 0 siblings, 0 replies; 13+ messages in thread From: David Hildenbrand @ 2023-11-14 16:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Guilherme G. Piccoli, Kees Cook, sonicadvance1, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua On 14.11.23 17:11, Eric W. Biederman wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 13.11.23 19:29, Eric W. Biederman wrote: >>> "Guilherme G. Piccoli" <gpiccoli@igalia.com> writes: >>> >>>> On 09/10/2023 14:37, Kees Cook wrote: >>>>> On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: >>>>>> On 07.09.23 22:24, Guilherme G. Piccoli wrote: >>>>>>> Currently the kernel provides a symlink to the executable binary, in the >>>>>>> form of procfs file exe_file (/proc/self/exe_file for example). But what >>>>>>> happens in interpreted scenarios (like binfmt_misc) is that such link >>>>>>> always points to the *interpreter*. For cases of Linux binary emulators, >>>>>>> like FEX [0] for example, it's then necessary to somehow mask that and >>>>>>> emulate the true binary path. >>>>>> >>>>>> I'm absolutely no expert on that, but I'm wondering if, instead of modifying >>>>>> exe_file and adding an interpreter file, you'd want to leave exe_file alone >>>>>> and instead provide an easier way to obtain the interpreted file. >>>>>> >>>>>> Can you maybe describe why modifying exe_file is desired (about which >>>>>> consumers are we worrying? ) and what exactly FEX does to handle that (how >>>>>> does it mask that?). >>>>>> >>>>>> So a bit more background on the challenges without this change would be >>>>>> appreciated. >>>>> >>>>> Yeah, it sounds like you're dealing with a process that examines >>>>> /proc/self/exe_file for itself only to find the binfmt_misc interpreter >>>>> when it was run via binfmt_misc? >>>>> >>>>> What actually breaks? Or rather, why does the process to examine >>>>> exe_file? I'm just trying to see if there are other solutions here that >>>>> would avoid creating an ambiguous interface... >>>>> >>>> >>>> Thanks Kees and David! Did Ryan's thorough comment addressed your >>>> questions? Do you have any take on the TODOs? >>>> >>>> I can maybe rebase against 6.7-rc1 and resubmit , if that makes sense! >>>> But would be better having the TODOs addressed, I guess. >>> Currently there is a mechanism in the kernel for changing >>> /proc/self/exe. Would that be reasonable to use in this case? >>> It came from the checkpoint/restart work, but given that it is >>> already >>> implemented it seems like the path of least resistance to get your >>> binfmt_misc that wants to look like binfmt_elf to use that mechanism. >> >> I had that in mind as well, but >> prctl_set_mm_exe_file()->replace_mm_exe_file() fails if the executable >> is still mmaped (due to denywrite handling); that should be the case >> for the emulator I strongly assume. > > Bah yes. The sanity check that that the old executable is no longer > mapped does make it so that we can't trivially change the /proc/self/exe > using prctl(PR_SET_MM_EXE_FILE). I was wondering if we should have a new file (yet have to come up witha fitting name) that defaults to /proc/self/exe as long as that new file doesn't explicitly get set via a prctl. So /proc/self/exe would indeed always show the emulator (executable), but the new file could be adjusted to something that is being executed by the emulator. Just a thought ... I'd rather leave /proc/self/exe alone. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc 2023-11-13 18:29 ` Eric W. Biederman 2023-11-13 19:16 ` David Hildenbrand @ 2023-11-13 19:17 ` Guilherme G. Piccoli 1 sibling, 0 replies; 13+ messages in thread From: Guilherme G. Piccoli @ 2023-11-13 19:17 UTC (permalink / raw) To: Eric W. Biederman, sonicadvance1 Cc: Kees Cook, David Hildenbrand, linux-kernel, linux-fsdevel, linux-mm, kernel-dev, kernel, oleg, yzaikin, mcgrof, akpm, brauner, viro, willy, dave, joshua On 13/11/2023 15:29, Eric W. Biederman wrote: > [...] > Currently there is a mechanism in the kernel for changing > /proc/self/exe. Would that be reasonable to use in this case? > > It came from the checkpoint/restart work, but given that it is already > implemented it seems like the path of least resistance to get your > binfmt_misc that wants to look like binfmt_elf to use that mechanism. > > Eric > Thanks Eric! I'm curious on how that would work: we'd change the symlink of the emulator? So, the *emulated* software, when reading that, would see the correct symlink? Also, just to fully clarify: are you suggesting we hook the new binfmt_misc flag proposed here to the internal kernel way of changing the proc/self/exe symlink, or are you suggesting we use the prctl() tune from the emulator, like the userspace changing its own symlink? One of the biggest concerns I have with this kind of approach is that changing the symlink actually...changes it - the binary mapping itself, I mean. Whereas my way was a "fake" change, just expose one thing for the emulated app, but changes nothing else... Cheers, Guilherme ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-11-14 16:15 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-09-07 20:24 [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 1/2] binfmt_misc, fork, proc: Introduce flag to expose the interpreted binary in procfs Guilherme G. Piccoli 2023-09-07 20:24 ` [RFC PATCH 2/2] fork, procfs: Introduce /proc/self/interpreter symlink Guilherme G. Piccoli 2023-10-06 7:51 ` [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc Guilherme G. Piccoli 2023-10-06 12:07 ` David Hildenbrand 2023-10-09 17:37 ` Kees Cook 2023-10-11 23:53 ` Ryan Houdek 2023-11-13 17:33 ` Guilherme G. Piccoli 2023-11-13 18:29 ` Eric W. Biederman 2023-11-13 19:16 ` David Hildenbrand 2023-11-14 16:11 ` Eric W. Biederman 2023-11-14 16:14 ` David Hildenbrand 2023-11-13 19:17 ` Guilherme G. Piccoli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox