* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
@ 2025-10-27 6:26 ` Bernd Edlinger
2025-10-27 12:06 ` Peter Zijlstra
` (4 subsequent siblings)
5 siblings, 0 replies; 68+ messages in thread
From: Bernd Edlinger @ 2025-10-27 6:26 UTC (permalink / raw)
To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Alexey Dobriyan, Jens Axboe, Paul Moore, Elena Reshetova,
David Windsor, Mateusz Guzik, YueHaibing, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, tiozhang, Penglei Jiang, Lorenzo Stoakes,
Adrian Ratiu, Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Hi all,
This is a friendly ping, just a gentle reminder since this series has been around a while.
FYI the patch still applies cleanly to current kernel sources, compiles correctly and
tests are still passed.
Thanks
Bernd.
On 8/21/25 19:34, Bernd Edlinger wrote:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.
>
> The problem happens when a tracer tries to ptrace_attach
> to a multi-threaded process, that does an execve in one of
> the threads at the same time, without doing that in a forked
> sub-process. That means: There is a race condition, when one
> or more of the threads are already ptraced, but the thread
> that invoked the execve is not yet traced. Now in this
> case the execve locks the cred_guard_mutex and waits for
> de_thread to complete. But that waits for the traced
> sibling threads to exit, and those have to wait for the
> tracer to receive the exit signal, but the tracer cannot
> call wait right now, because it is waiting for the ptrace
> call to complete, and this never does not happen.
> The traced process and the tracer are now in a deadlock
> situation, and can only be killed by a fatal signal.
>
> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
>
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes. If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
>
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending. In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.
>
> This means there is no API change, unlike the previous
> version of this patch which was discussed here:
>
> https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/
>
> See tools/testing/selftests/ptrace/vmaccess.c
> for a test case that gets fixed by this change.
>
> Note that since the test case was originally designed to
> test the ptrace_attach returning an error in this situation,
> the test expectation needed to be adjusted, to allow the
> API to succeed at the first attempt.
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
> fs/exec.c | 69 ++++++++---
> fs/proc/base.c | 6 +
> include/linux/cred.h | 1 +
> include/linux/sched/signal.h | 18 +++
> kernel/cred.c | 30 ++++-
> kernel/ptrace.c | 32 +++++
> kernel/seccomp.c | 12 +-
> tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
> 8 files changed, 266 insertions(+), 37 deletions(-)
>
> v10: Changes to previous version, make the PTRACE_ATTACH
> return -EAGAIN, instead of execve return -ERESTARTSYS.
> Added some lessions learned to the description.
>
> v11: Check old and new credentials in PTRACE_ATTACH again without
> changing the API.
>
> Note: I got actually one response from an automatic checker to the v11 patch,
>
> https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
>
> which is complaining about:
>
>>>>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@
>
> 417 struct linux_binprm *bprm = task->signal->exec_bprm;
> 418 const struct cred *old_cred;
> 419 struct mm_struct *old_mm;
> 420
> 421 retval = down_write_killable(&task->signal->exec_update_lock);
> 422 if (retval)
> 423 goto unlock_creds;
> 424 task_lock(task);
> > 425 old_cred = task->real_cred;
>
> v12: Essentially identical to v11.
>
> - Fixed a minor merge conflict in linux v5.17, and fixed the
> above mentioned nit by adding __rcu to the declaration.
>
> - re-tested the patch with all linux versions from v5.11 to v6.6
>
> v10 was an alternative approach which did imply an API change.
> But I would prefer to avoid such an API change.
>
> The difficult part is getting the right dumpability flags assigned
> before de_thread starts, hope you like this version.
> If not, the v10 is of course also acceptable.
>
> v13: Fixed duplicated Return section in function header of
> is_dumpability_changed which was reported by the kernel test robot
>
> v14: rebased to v6.7, refreshed and retested.
> And added a more detailed description of the actual bug.
>
> v15: rebased to v6.8-rc1, addressed some review comments.
> Split the test case vmaccess into vmaccess1 and vmaccess2
> to improve overall test coverage.
>
> v16: rebased to 6.17-rc2, fixed some minor merge conflicts.
>
> v17: avoid use of task->in_execve in ptrace_attach.
>
>
> Thanks
> Bernd.
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2a1e5e4042a1..31c6ceaa5f69 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
> return 0;
> }
>
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
> {
> struct signal_struct *sig = tsk->signal;
> struct sighand_struct *oldsighand = tsk->sighand;
> spinlock_t *lock = &oldsighand->siglock;
> + struct task_struct *t;
> + bool unsafe_execve_in_progress = false;
>
> if (thread_group_empty(tsk))
> goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
> if (!thread_group_leader(tsk))
> sig->notify_count--;
>
> + for_other_threads(tsk, t) {
> + if (unlikely(t->ptrace)
> + && (t != tsk->group_leader || !t->exit_state))
> + unsafe_execve_in_progress = true;
> + }
> +
> + if (unlikely(unsafe_execve_in_progress)) {
> + spin_unlock_irq(lock);
> + sig->exec_bprm = bprm;
> + mutex_unlock(&sig->cred_guard_mutex);
> + spin_lock_irq(lock);
> + }
> +
> while (sig->notify_count) {
> __set_current_state(TASK_KILLABLE);
> spin_unlock_irq(lock);
> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
> release_task(leader);
> }
>
> + if (unlikely(unsafe_execve_in_progress)) {
> + mutex_lock(&sig->cred_guard_mutex);
> + sig->exec_bprm = NULL;
> + }
> +
> sig->group_exec_task = NULL;
> sig->notify_count = 0;
>
> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
> return 0;
>
> killed:
> + if (unlikely(unsafe_execve_in_progress)) {
> + mutex_lock(&sig->cred_guard_mutex);
> + sig->exec_bprm = NULL;
> + }
> +
> /* protects against exit_notify() and __exit_signal() */
> read_lock(&tasklist_lock);
> sig->group_exec_task = NULL;
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> */
> trace_sched_prepare_exec(current, bprm);
>
> + /* If the binary is not readable then enforce mm->dumpable=0 */
> + would_dump(bprm, bprm->file);
> + if (bprm->have_execfd)
> + would_dump(bprm, bprm->executable);
> +
> + /*
> + * Figure out dumpability. Note that this checking only of current
> + * is wrong, but userspace depends on it. This should be testing
> + * bprm->secureexec instead.
> + */
> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> + is_dumpability_changed(current_cred(), bprm->cred) ||
> + !(uid_eq(current_euid(), current_uid()) &&
> + gid_eq(current_egid(), current_gid())))
> + set_dumpable(bprm->mm, suid_dumpable);
> + else
> + set_dumpable(bprm->mm, SUID_DUMP_USER);
> +
> /*
> * Ensure all future errors are fatal.
> */
> bprm->point_of_no_return = true;
>
> /* Make this the only thread in the thread group */
> - retval = de_thread(me);
> + retval = de_thread(me, bprm);
> if (retval)
> goto out;
> /* see the comment in check_unsafe_exec() */
> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> - /* If the binary is not readable then enforce mm->dumpable=0 */
> - would_dump(bprm, bprm->file);
> - if (bprm->have_execfd)
> - would_dump(bprm, bprm->executable);
> -
> /*
> * Release all of the old mmap stuff
> */
> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> me->sas_ss_sp = me->sas_ss_size = 0;
>
> - /*
> - * Figure out dumpability. Note that this checking only of current
> - * is wrong, but userspace depends on it. This should be testing
> - * bprm->secureexec instead.
> - */
> - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> - !(uid_eq(current_euid(), current_uid()) &&
> - gid_eq(current_egid(), current_gid())))
> - set_dumpable(current->mm, suid_dumpable);
> - else
> - set_dumpable(current->mm, SUID_DUMP_USER);
> -
> perf_event_exec();
>
> /*
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> return -ERESTARTNOINTR;
>
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + return -ERESTARTNOINTR;
> + }
> +
> bprm->cred = prepare_exec_creds();
> if (likely(bprm->cred))
> return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 62d35631ba8c..e5bcf812cee0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
> if (rv < 0)
> goto out_free;
>
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + rv = -ERESTARTNOINTR;
> + goto out_free;
> + }
> +
> rv = security_setprocattr(PROC_I(inode)->op.lsmid,
> file->f_path.dentry->d_name.name, page,
> count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index a102a10f833f..fb0361911489 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
> extern struct cred *cred_alloc_blank(void);
> extern struct cred *prepare_creds(void);
> extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
> extern int commit_creds(struct cred *);
> extern void abort_creds(struct cred *);
> extern struct cred *prepare_kernel_cred(struct task_struct *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..3c47d8b55863 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -237,9 +237,27 @@ struct signal_struct {
> struct mm_struct *oom_mm; /* recorded mm when the thread group got
> * killed by the oom killer */
>
> + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
> + * against new credentials while
> + * de_thread is waiting for other
> + * traced threads to terminate.
> + * Set while de_thread is executing.
> + * The cred_guard_mutex is released
> + * after de_thread() has called
> + * zap_other_threads(), therefore
> + * a fatal signal is guaranteed to be
> + * already pending in the unlikely
> + * event, that
> + * current->signal->exec_bprm happens
> + * to be non-zero after the
> + * cred_guard_mutex was acquired.
> + */
> +
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> * (notably. ptrace)
> + * Held while execve runs, except when
> + * a sibling thread is being traced.
> * Deprecated do not use in new code.
> * Use exec_update_lock instead.
> */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 9676965c0981..0b2822c762df 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
> return false;
> }
>
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable. Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> +{
> + if (!uid_eq(old->euid, new->euid) ||
> + !gid_eq(old->egid, new->egid) ||
> + !uid_eq(old->fsuid, new->fsuid) ||
> + !gid_eq(old->fsgid, new->fsgid) ||
> + !cred_cap_issubset(old, new))
> + return true;
> +
> + return false;
> +}
> +
> /**
> * commit_creds - Install new credentials upon the current task
> * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
> get_cred(new); /* we will require a ref for the subj creds too */
>
> /* dumpability changes */
> - if (!uid_eq(old->euid, new->euid) ||
> - !gid_eq(old->egid, new->egid) ||
> - !uid_eq(old->fsuid, new->fsuid) ||
> - !gid_eq(old->fsgid, new->fsgid) ||
> - !cred_cap_issubset(old, new)) {
> + if (is_dumpability_changed(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..230298817dbf 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
> #include <linux/pagemap.h>
> #include <linux/ptrace.h>
> #include <linux/security.h>
> +#include <linux/binfmts.h>
> #include <linux/signal.h>
> #include <linux/uio.h>
> #include <linux/audit.h>
> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
> return retval;
> }
>
> + if (unlikely(task == task->signal->group_exec_task)) {
> + retval = down_write_killable(&task->signal->exec_update_lock);
> + if (retval)
> + return retval;
> +
> + scoped_guard (task_lock, task) {
> + struct linux_binprm *bprm = task->signal->exec_bprm;
> + const struct cred __rcu *old_cred = task->real_cred;
> + struct mm_struct *old_mm = task->mm;
> +
> + rcu_assign_pointer(task->real_cred, bprm->cred);
> + task->mm = bprm->mm;
> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> + rcu_assign_pointer(task->real_cred, old_cred);
> + task->mm = old_mm;
> + }
> +
> + up_write(&task->signal->exec_update_lock);
> + if (retval)
> + return retval;
> + }
> +
> scoped_guard (write_lock_irq, &tasklist_lock) {
> if (unlikely(task->exit_state))
> return -EPERM;
> @@ -488,6 +511,14 @@ static int ptrace_traceme(void)
> {
> int ret = -EPERM;
>
> + if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> + return -ERESTARTNOINTR;
> +
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + return -ERESTARTNOINTR;
> + }
> +
> write_lock_irq(&tasklist_lock);
> /* Are we already being traced? */
> if (!current->ptrace) {
> @@ -503,6 +534,7 @@ static int ptrace_traceme(void)
> }
> }
> write_unlock_irq(&tasklist_lock);
> + mutex_unlock(¤t->signal->cred_guard_mutex);
>
> return ret;
> }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 41aa761c7738..d61fc275235a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
> * Make sure we cannot change seccomp or nnp state via TSYNC
> * while another thread is in the middle of calling exec.
> */
> - if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> - mutex_lock_killable(¤t->signal->cred_guard_mutex))
> - goto out_put_fd;
> + if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> + if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
> + goto out_put_fd;
> +
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + goto out_put_fd;
> + }
> + }
>
> spin_lock_irq(¤t->sighand->siglock);
>
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> index 4db327b44586..5d4a65eb5a8d 100644
> --- a/tools/testing/selftests/ptrace/vmaccess.c
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -14,6 +14,7 @@
> #include <signal.h>
> #include <unistd.h>
> #include <sys/ptrace.h>
> +#include <sys/syscall.h>
>
> static void *thread(void *arg)
> {
> @@ -23,7 +24,7 @@ static void *thread(void *arg)
>
> TEST(vmaccess)
> {
> - int f, pid = fork();
> + int s, f, pid = fork();
> char mm[64];
>
> if (!pid) {
> @@ -31,19 +32,42 @@ TEST(vmaccess)
>
> pthread_create(&pt, NULL, thread, NULL);
> pthread_join(pt, NULL);
> - execlp("true", "true", NULL);
> + execlp("false", "false", NULL);
> + return;
> }
>
> sleep(1);
> sprintf(mm, "/proc/%d/mem", pid);
> + /* deadlock did happen here */
> f = open(mm, O_RDONLY);
> ASSERT_GE(f, 0);
> close(f);
> - f = kill(pid, SIGCONT);
> - ASSERT_EQ(f, 0);
> + f = waitpid(-1, &s, WNOHANG);
> + ASSERT_NE(f, -1);
> + ASSERT_NE(f, 0);
> + ASSERT_NE(f, pid);
> + ASSERT_EQ(WIFEXITED(s), 1);
> + ASSERT_EQ(WEXITSTATUS(s), 0);
> + f = waitpid(-1, &s, 0);
> + ASSERT_EQ(f, pid);
> + ASSERT_EQ(WIFEXITED(s), 1);
> + ASSERT_EQ(WEXITSTATUS(s), 1);
> + f = waitpid(-1, NULL, 0);
> + ASSERT_EQ(f, -1);
> + ASSERT_EQ(errno, ECHILD);
> }
>
> -TEST(attach)
> +/*
> + * Same test as previous, except that
> + * we try to ptrace the group leader,
> + * which is about to call execve,
> + * when the other thread is already ptraced.
> + * This exercises the code in de_thread
> + * where it is waiting inside the
> + * while (sig->notify_count) {
> + * loop.
> + */
> +TEST(attach1)
> {
> int s, k, pid = fork();
>
> @@ -52,19 +76,76 @@ TEST(attach)
>
> pthread_create(&pt, NULL, thread, NULL);
> pthread_join(pt, NULL);
> - execlp("sleep", "sleep", "2", NULL);
> + execlp("false", "false", NULL);
> + return;
> }
>
> sleep(1);
> + /* deadlock may happen here */
> k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> - ASSERT_EQ(errno, EAGAIN);
> - ASSERT_EQ(k, -1);
> + ASSERT_EQ(k, 0);
> k = waitpid(-1, &s, WNOHANG);
> ASSERT_NE(k, -1);
> ASSERT_NE(k, 0);
> ASSERT_NE(k, pid);
> ASSERT_EQ(WIFEXITED(s), 1);
> ASSERT_EQ(WEXITSTATUS(s), 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFEXITED(s), 1);
> + ASSERT_EQ(WEXITSTATUS(s), 1);
> + k = waitpid(-1, NULL, 0);
> + ASSERT_EQ(k, -1);
> + ASSERT_EQ(errno, ECHILD);
> +}
> +
> +/*
> + * Same test as previous, except that
> + * the group leader is ptraced first,
> + * but this time with PTRACE_O_TRACEEXIT,
> + * and the thread that does execve is
> + * not yet ptraced. This exercises the
> + * code block in de_thread where the
> + * if (!thread_group_leader(tsk)) {
> + * is executed and enters a wait state.
> + */
> +static long thread2_tid;
> +static void *thread2(void *arg)
> +{
> + thread2_tid = syscall(__NR_gettid);
> + sleep(2);
> + execlp("false", "false", NULL);
> + return NULL;
> +}
> +
> +TEST(attach2)
> +{
> + int s, k, pid = fork();
> +
> + if (!pid) {
> + pthread_t pt;
> +
> + pthread_create(&pt, NULL, thread2, NULL);
> + pthread_join(pt, NULL);
> + return;
> + }
> +
> sleep(1);
> k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> ASSERT_EQ(k, 0);
> @@ -72,12 +153,46 @@ TEST(attach)
> ASSERT_EQ(k, pid);
> ASSERT_EQ(WIFSTOPPED(s), 1);
> ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> - k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> + k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
> + ASSERT_EQ(k, 0);
> + thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
> + ASSERT_NE(thread2_tid, -1);
> + ASSERT_NE(thread2_tid, 0);
> + ASSERT_NE(thread2_tid, pid);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + sleep(2);
> + /* deadlock may happen here */
> + k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_EQ(k, 0);
> + k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> ASSERT_EQ(k, 0);
> k = waitpid(-1, &s, 0);
> ASSERT_EQ(k, pid);
> ASSERT_EQ(WIFEXITED(s), 1);
> - ASSERT_EQ(WEXITSTATUS(s), 0);
> + ASSERT_EQ(WEXITSTATUS(s), 1);
> k = waitpid(-1, NULL, 0);
> ASSERT_EQ(k, -1);
> ASSERT_EQ(errno, ECHILD);
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
2025-10-27 6:26 ` Bernd Edlinger
@ 2025-10-27 12:06 ` Peter Zijlstra
2025-11-02 16:17 ` Oleg Nesterov
` (3 subsequent siblings)
5 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-10-27 12:06 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Cyrill Gorcunov, Eric Dumazet
On Thu, Aug 21, 2025 at 07:34:58PM +0200, Bernd Edlinger wrote:
> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
>
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes. If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
>
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending. In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.
> diff --git a/fs/exec.c b/fs/exec.c
> index 2a1e5e4042a1..31c6ceaa5f69 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
> return 0;
> }
>
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
> {
> struct signal_struct *sig = tsk->signal;
> struct sighand_struct *oldsighand = tsk->sighand;
> spinlock_t *lock = &oldsighand->siglock;
> + struct task_struct *t;
> + bool unsafe_execve_in_progress = false;
>
> if (thread_group_empty(tsk))
> goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
> if (!thread_group_leader(tsk))
> sig->notify_count--;
>
> + for_other_threads(tsk, t) {
> + if (unlikely(t->ptrace)
> + && (t != tsk->group_leader || !t->exit_state))
&& goes at the end of the previous line
> + unsafe_execve_in_progress = true;
> + }
> +
> + if (unlikely(unsafe_execve_in_progress)) {
> + spin_unlock_irq(lock);
> + sig->exec_bprm = bprm;
> + mutex_unlock(&sig->cred_guard_mutex);
> + spin_lock_irq(lock);
I'm not clear why we need to drop and re-acquire siglock here.
And I would like a very large comment here explaining why it is safe to
drop cred_guard_mutex here.
> + }
> +
> while (sig->notify_count) {
> __set_current_state(TASK_KILLABLE);
> spin_unlock_irq(lock);
> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
> release_task(leader);
> }
>
> + if (unlikely(unsafe_execve_in_progress)) {
> + mutex_lock(&sig->cred_guard_mutex);
> + sig->exec_bprm = NULL;
> + }
> +
> sig->group_exec_task = NULL;
> sig->notify_count = 0;
>
> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
> return 0;
>
> killed:
> + if (unlikely(unsafe_execve_in_progress)) {
> + mutex_lock(&sig->cred_guard_mutex);
> + sig->exec_bprm = NULL;
> + }
> +
> /* protects against exit_notify() and __exit_signal() */
> read_lock(&tasklist_lock);
> sig->group_exec_task = NULL;
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> */
> trace_sched_prepare_exec(current, bprm);
>
> + /* If the binary is not readable then enforce mm->dumpable=0 */
> + would_dump(bprm, bprm->file);
> + if (bprm->have_execfd)
> + would_dump(bprm, bprm->executable);
> +
> + /*
> + * Figure out dumpability. Note that this checking only of current
> + * is wrong, but userspace depends on it. This should be testing
> + * bprm->secureexec instead.
> + */
> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> + is_dumpability_changed(current_cred(), bprm->cred) ||
> + !(uid_eq(current_euid(), current_uid()) &&
> + gid_eq(current_egid(), current_gid())))
> + set_dumpable(bprm->mm, suid_dumpable);
> + else
> + set_dumpable(bprm->mm, SUID_DUMP_USER);
> +
I feel like moving this dumpable stuff around could be a separate patch.
Which can explain how that is correct and why it is needed and all that.
> /*
> * Ensure all future errors are fatal.
> */
> bprm->point_of_no_return = true;
>
> /* Make this the only thread in the thread group */
> - retval = de_thread(me);
> + retval = de_thread(me, bprm);
> if (retval)
> goto out;
> /* see the comment in check_unsafe_exec() */
> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> - /* If the binary is not readable then enforce mm->dumpable=0 */
> - would_dump(bprm, bprm->file);
> - if (bprm->have_execfd)
> - would_dump(bprm, bprm->executable);
> -
> /*
> * Release all of the old mmap stuff
> */
> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> me->sas_ss_sp = me->sas_ss_size = 0;
>
> - /*
> - * Figure out dumpability. Note that this checking only of current
> - * is wrong, but userspace depends on it. This should be testing
> - * bprm->secureexec instead.
> - */
> - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> - !(uid_eq(current_euid(), current_uid()) &&
> - gid_eq(current_egid(), current_gid())))
> - set_dumpable(current->mm, suid_dumpable);
> - else
> - set_dumpable(current->mm, SUID_DUMP_USER);
> -
> perf_event_exec();
>
> /*
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> return -ERESTARTNOINTR;
>
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + return -ERESTARTNOINTR;
> + }
#1
> +
> bprm->cred = prepare_exec_creds();
> if (likely(bprm->cred))
> return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 62d35631ba8c..e5bcf812cee0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
> if (rv < 0)
> goto out_free;
>
Comment explaining why this needs checking goes here.
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + rv = -ERESTARTNOINTR;
> + goto out_free;
> + }
> +
> rv = security_setprocattr(PROC_I(inode)->op.lsmid,
> file->f_path.dentry->d_name.name, page,
> count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index a102a10f833f..fb0361911489 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
> extern struct cred *cred_alloc_blank(void);
> extern struct cred *prepare_creds(void);
> extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
> extern int commit_creds(struct cred *);
> extern void abort_creds(struct cred *);
> extern struct cred *prepare_kernel_cred(struct task_struct *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..3c47d8b55863 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -237,9 +237,27 @@ struct signal_struct {
> struct mm_struct *oom_mm; /* recorded mm when the thread group got
> * killed by the oom killer */
>
> + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
> + * against new credentials while
> + * de_thread is waiting for other
> + * traced threads to terminate.
> + * Set while de_thread is executing.
> + * The cred_guard_mutex is released
> + * after de_thread() has called
> + * zap_other_threads(), therefore
> + * a fatal signal is guaranteed to be
> + * already pending in the unlikely
> + * event, that
> + * current->signal->exec_bprm happens
> + * to be non-zero after the
> + * cred_guard_mutex was acquired.
> + */
> +
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> * (notably. ptrace)
> + * Held while execve runs, except when
> + * a sibling thread is being traced.
> * Deprecated do not use in new code.
> * Use exec_update_lock instead.
> */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 9676965c0981..0b2822c762df 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
> return false;
> }
>
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable. Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> +{
> + if (!uid_eq(old->euid, new->euid) ||
> + !gid_eq(old->egid, new->egid) ||
> + !uid_eq(old->fsuid, new->fsuid) ||
> + !gid_eq(old->fsgid, new->fsgid) ||
> + !cred_cap_issubset(old, new))
> + return true;
> +
> + return false;
> +}
> +
> /**
> * commit_creds - Install new credentials upon the current task
> * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
> get_cred(new); /* we will require a ref for the subj creds too */
>
> /* dumpability changes */
> - if (!uid_eq(old->euid, new->euid) ||
> - !gid_eq(old->egid, new->egid) ||
> - !uid_eq(old->fsuid, new->fsuid) ||
> - !gid_eq(old->fsgid, new->fsgid) ||
> - !cred_cap_issubset(old, new)) {
> + if (is_dumpability_changed(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..230298817dbf 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
> #include <linux/pagemap.h>
> #include <linux/ptrace.h>
> #include <linux/security.h>
> +#include <linux/binfmts.h>
> #include <linux/signal.h>
> #include <linux/uio.h>
> #include <linux/audit.h>
> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
> return retval;
> }
>
> + if (unlikely(task == task->signal->group_exec_task)) {
> + retval = down_write_killable(&task->signal->exec_update_lock);
> + if (retval)
> + return retval;
This could be written like:
ACQUIRE(rwsem_write_kill, guard)(&task->signal->exec_update_lock);
retval = ACQUIRE_ERR(rwsem_write_kill, guard);
if (retval)
return retval;
> +
> + scoped_guard (task_lock, task) {
> + struct linux_binprm *bprm = task->signal->exec_bprm;
> + const struct cred __rcu *old_cred = task->real_cred;
> + struct mm_struct *old_mm = task->mm;
> +
> + rcu_assign_pointer(task->real_cred, bprm->cred);
> + task->mm = bprm->mm;
> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> + rcu_assign_pointer(task->real_cred, old_cred);
> + task->mm = old_mm;
> + }
> +
> + up_write(&task->signal->exec_update_lock);
And then this goes away ^
> + if (retval)
> + return retval;
> + }
> +
> scoped_guard (write_lock_irq, &tasklist_lock) {
> if (unlikely(task->exit_state))
> return -EPERM;
> @@ -488,6 +511,14 @@ static int ptrace_traceme(void)
> {
> int ret = -EPERM;
>
This needs comments.
> + if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> + return -ERESTARTNOINTR;
> +
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + return -ERESTARTNOINTR;
> + }
#2
> +
> write_lock_irq(&tasklist_lock);
> /* Are we already being traced? */
> if (!current->ptrace) {
> @@ -503,6 +534,7 @@ static int ptrace_traceme(void)
> }
> }
> write_unlock_irq(&tasklist_lock);
> + mutex_unlock(¤t->signal->cred_guard_mutex);
>
> return ret;
> }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 41aa761c7738..d61fc275235a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
> * Make sure we cannot change seccomp or nnp state via TSYNC
> * while another thread is in the middle of calling exec.
> */
> - if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> - mutex_lock_killable(¤t->signal->cred_guard_mutex))
> - goto out_put_fd;
> + if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> + if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
> + goto out_put_fd;
> +
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + goto out_put_fd;
> + }
#3, and after typing this same pattern 3 times, you didn't think it
needed a helper function ?
> + }
>
> spin_lock_irq(¤t->sighand->siglock);
>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
2025-10-27 6:26 ` Bernd Edlinger
2025-10-27 12:06 ` Peter Zijlstra
@ 2025-11-02 16:17 ` Oleg Nesterov
2025-11-05 14:32 ` Oleg Nesterov
` (2 subsequent siblings)
5 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-02 16:17 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 08/21, Bernd Edlinger wrote:
>
> v16: rebased to 6.17-rc2, fixed some minor merge conflicts.
>
> v17: avoid use of task->in_execve in ptrace_attach.
So I guess this version doesn't really differ from v14 I tried to review...
(yes, iirc my review wasn't really good, sorry).
Perhaps I am wrong, but I still think we need another approach. de_thread()
should not wait until all sub-threads are reaped, it should drop cred_guard_mutex
earlier.
I mean, something like
[PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
https://lore.kernel.org/lkml/20170213180454.GA2858@redhat.com/
which is hopelessly outdated now.
Again, perhaps I am wrong. I'll try to take another look next week.
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
` (2 preceding siblings ...)
2025-11-02 16:17 ` Oleg Nesterov
@ 2025-11-05 14:32 ` Oleg Nesterov
2025-11-11 9:21 ` Christian Brauner
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
2025-11-18 18:13 ` [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
5 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-05 14:32 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
I am still thinking about another approach, will write another email.
But let me take a closer look at your patch.
First of all, can you split it? See below.
On 08/21, Bernd Edlinger wrote:
>
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
> {
> struct signal_struct *sig = tsk->signal;
> struct sighand_struct *oldsighand = tsk->sighand;
> spinlock_t *lock = &oldsighand->siglock;
> + struct task_struct *t;
> + bool unsafe_execve_in_progress = false;
>
> if (thread_group_empty(tsk))
> goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
> if (!thread_group_leader(tsk))
> sig->notify_count--;
>
> + for_other_threads(tsk, t) {
> + if (unlikely(t->ptrace)
> + && (t != tsk->group_leader || !t->exit_state))
> + unsafe_execve_in_progress = true;
you can add "break" into the "if ()" block...
But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
unconditionally.
If you really think it makes sense, please make another patch with the
changelog.
I'd certainly prefer to avoid this boolean at least for the start. If nothing
else to catch the potential problems earlier.
> + if (unlikely(unsafe_execve_in_progress)) {
> + spin_unlock_irq(lock);
> + sig->exec_bprm = bprm;
> + mutex_unlock(&sig->cred_guard_mutex);
> + spin_lock_irq(lock);
I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> */
> trace_sched_prepare_exec(current, bprm);
>
> + /* If the binary is not readable then enforce mm->dumpable=0 */
> + would_dump(bprm, bprm->file);
> + if (bprm->have_execfd)
> + would_dump(bprm, bprm->executable);
> +
> + /*
> + * Figure out dumpability. Note that this checking only of current
> + * is wrong, but userspace depends on it. This should be testing
> + * bprm->secureexec instead.
> + */
> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> + is_dumpability_changed(current_cred(), bprm->cred) ||
> + !(uid_eq(current_euid(), current_uid()) &&
> + gid_eq(current_egid(), current_gid())))
> + set_dumpable(bprm->mm, suid_dumpable);
> + else
> + set_dumpable(bprm->mm, SUID_DUMP_USER);
> +
OK, we need to do this before de_thread() drops cred_guard_mutex.
But imo this too should be done in a separate patch, the changelog should
explain this change.
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> return -ERESTARTNOINTR;
>
> + if (unlikely(current->signal->exec_bprm)) {
> + mutex_unlock(¤t->signal->cred_guard_mutex);
> + return -ERESTARTNOINTR;
> + }
OK, if signal->exec_bprm != NULL, then current is already killed. But
proc_pid_attr_write() and ptrace_traceme() do the same. So how about
something like
int lock_current_cgm(void)
{
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
if (!current->signal->group_exec_task)
return 0;
WARN_ON(!fatal_signal_pending(current));
mutex_unlock(¤t->signal->cred_guard_mutex);
return -ERESTARTNOINTR;
}
?
Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
come in a separate patch too, but I won't insist.
> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
> return retval;
> }
>
> + if (unlikely(task == task->signal->group_exec_task)) {
> + retval = down_write_killable(&task->signal->exec_update_lock);
> + if (retval)
> + return retval;
> +
> + scoped_guard (task_lock, task) {
> + struct linux_binprm *bprm = task->signal->exec_bprm;
> + const struct cred __rcu *old_cred = task->real_cred;
> + struct mm_struct *old_mm = task->mm;
> +
> + rcu_assign_pointer(task->real_cred, bprm->cred);
> + task->mm = bprm->mm;
> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> + rcu_assign_pointer(task->real_cred, old_cred);
> + task->mm = old_mm;
> + }
This is the most problematic change which I can't review...
Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
looks dangerous to me.
Say, current_is_single_threaded() called by another CLONE_VM process can
miss group_exec_task and falsely return true. Probably not that bad, in
this case old_mm should go away soon, but still...
And I don't know if this can fool the users of task_cred_xxx/__task_cred
somehow.
Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
ptrace the execing task after that? I have no idea what the security hooks
can do...
Again, can't review this part.
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-05 14:32 ` Oleg Nesterov
@ 2025-11-11 9:21 ` Christian Brauner
2025-11-11 11:07 ` Bernd Edlinger
2025-11-17 6:31 ` Bernd Edlinger
0 siblings, 2 replies; 68+ messages in thread
From: Christian Brauner @ 2025-11-11 9:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
> I am still thinking about another approach, will write another email.
> But let me take a closer look at your patch.
>
> First of all, can you split it? See below.
>
> On 08/21, Bernd Edlinger wrote:
> >
> > -static int de_thread(struct task_struct *tsk)
> > +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
> > {
> > struct signal_struct *sig = tsk->signal;
> > struct sighand_struct *oldsighand = tsk->sighand;
> > spinlock_t *lock = &oldsighand->siglock;
> > + struct task_struct *t;
> > + bool unsafe_execve_in_progress = false;
> >
> > if (thread_group_empty(tsk))
> > goto no_thread_group;
> > @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
> > if (!thread_group_leader(tsk))
> > sig->notify_count--;
> >
> > + for_other_threads(tsk, t) {
> > + if (unlikely(t->ptrace)
> > + && (t != tsk->group_leader || !t->exit_state))
> > + unsafe_execve_in_progress = true;
>
> you can add "break" into the "if ()" block...
>
> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> unconditionally.
>
> If you really think it makes sense, please make another patch with the
> changelog.
>
> I'd certainly prefer to avoid this boolean at least for the start. If nothing
> else to catch the potential problems earlier.
>
> > + if (unlikely(unsafe_execve_in_progress)) {
> > + spin_unlock_irq(lock);
> > + sig->exec_bprm = bprm;
> > + mutex_unlock(&sig->cred_guard_mutex);
> > + spin_lock_irq(lock);
>
> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>
> > @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> > */
> > trace_sched_prepare_exec(current, bprm);
> >
> > + /* If the binary is not readable then enforce mm->dumpable=0 */
> > + would_dump(bprm, bprm->file);
> > + if (bprm->have_execfd)
> > + would_dump(bprm, bprm->executable);
> > +
> > + /*
> > + * Figure out dumpability. Note that this checking only of current
> > + * is wrong, but userspace depends on it. This should be testing
> > + * bprm->secureexec instead.
> > + */
> > + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> > + is_dumpability_changed(current_cred(), bprm->cred) ||
> > + !(uid_eq(current_euid(), current_uid()) &&
> > + gid_eq(current_egid(), current_gid())))
> > + set_dumpable(bprm->mm, suid_dumpable);
> > + else
> > + set_dumpable(bprm->mm, SUID_DUMP_USER);
> > +
>
> OK, we need to do this before de_thread() drops cred_guard_mutex.
> But imo this too should be done in a separate patch, the changelog should
> explain this change.
>
> > @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
> > if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> > return -ERESTARTNOINTR;
> >
> > + if (unlikely(current->signal->exec_bprm)) {
> > + mutex_unlock(¤t->signal->cred_guard_mutex);
> > + return -ERESTARTNOINTR;
> > + }
>
> OK, if signal->exec_bprm != NULL, then current is already killed. But
> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
> something like
>
> int lock_current_cgm(void)
> {
> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> return -ERESTARTNOINTR;
>
> if (!current->signal->group_exec_task)
> return 0;
>
> WARN_ON(!fatal_signal_pending(current));
> mutex_unlock(¤t->signal->cred_guard_mutex);
> return -ERESTARTNOINTR;
> }
>
> ?
>
> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
> come in a separate patch too, but I won't insist.
>
> > @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
> > return retval;
> > }
> >
> > + if (unlikely(task == task->signal->group_exec_task)) {
> > + retval = down_write_killable(&task->signal->exec_update_lock);
> > + if (retval)
> > + return retval;
> > +
> > + scoped_guard (task_lock, task) {
> > + struct linux_binprm *bprm = task->signal->exec_bprm;
> > + const struct cred __rcu *old_cred = task->real_cred;
> > + struct mm_struct *old_mm = task->mm;
> > +
> > + rcu_assign_pointer(task->real_cred, bprm->cred);
> > + task->mm = bprm->mm;
> > + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> > + rcu_assign_pointer(task->real_cred, old_cred);
> > + task->mm = old_mm;
> > + }
>
> This is the most problematic change which I can't review...
>
> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> looks dangerous to me.
Yeah, that is not ok. This is effectively override_creds for real_cred
and that is not a pattern I want to see us establish at all! Temporary
credential overrides for the subjective credentials is already terrible
but at least we have the explicit split between real_cred and cred
expressely for that. So no, that's not an acceptable solution.
>
> Say, current_is_single_threaded() called by another CLONE_VM process can
> miss group_exec_task and falsely return true. Probably not that bad, in
> this case old_mm should go away soon, but still...
>
> And I don't know if this can fool the users of task_cred_xxx/__task_cred
> somehow.
>
> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
> ptrace the execing task after that? I have no idea what the security hooks
> can do...
>
> Again, can't review this part.
>
> Oleg.
>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-11 9:21 ` Christian Brauner
@ 2025-11-11 11:07 ` Bernd Edlinger
2025-11-11 13:12 ` Oleg Nesterov
2025-11-17 6:31 ` Bernd Edlinger
1 sibling, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-11 11:07 UTC (permalink / raw)
To: Christian Brauner, Oleg Nesterov
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Andrew Morton, Michal Hocko, Serge Hallyn,
James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
Helge Deller, Eric W. Biederman, Adrian Reber, Thomas Gleixner,
Jens Axboe, Alexei Starovoitov, linux-fsdevel, linux-kernel,
linux-kselftest, linux-mm, linux-security-module, tiozhang,
Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/11/25 10:21, Christian Brauner wrote:
> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>> I am still thinking about another approach, will write another email.
>> But let me take a closer look at your patch.
>>
>> First of all, can you split it? See below.
>>
>> On 08/21, Bernd Edlinger wrote:
>>>
>>> -static int de_thread(struct task_struct *tsk)
>>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>> {
>>> struct signal_struct *sig = tsk->signal;
>>> struct sighand_struct *oldsighand = tsk->sighand;
>>> spinlock_t *lock = &oldsighand->siglock;
>>> + struct task_struct *t;
>>> + bool unsafe_execve_in_progress = false;
>>>
>>> if (thread_group_empty(tsk))
>>> goto no_thread_group;
>>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>> if (!thread_group_leader(tsk))
>>> sig->notify_count--;
>>>
>>> + for_other_threads(tsk, t) {
>>> + if (unlikely(t->ptrace)
>>> + && (t != tsk->group_leader || !t->exit_state))
>>> + unsafe_execve_in_progress = true;
>>
>> you can add "break" into the "if ()" block...
>>
>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>> unconditionally.
>>
>> If you really think it makes sense, please make another patch with the
>> changelog.
>>
>> I'd certainly prefer to avoid this boolean at least for the start. If nothing
>> else to catch the potential problems earlier.
>>
>>> + if (unlikely(unsafe_execve_in_progress)) {
>>> + spin_unlock_irq(lock);
>>> + sig->exec_bprm = bprm;
>>> + mutex_unlock(&sig->cred_guard_mutex);
>>> + spin_lock_irq(lock);
>>
>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>
>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>> */
>>> trace_sched_prepare_exec(current, bprm);
>>>
>>> + /* If the binary is not readable then enforce mm->dumpable=0 */
>>> + would_dump(bprm, bprm->file);
>>> + if (bprm->have_execfd)
>>> + would_dump(bprm, bprm->executable);
>>> +
>>> + /*
>>> + * Figure out dumpability. Note that this checking only of current
>>> + * is wrong, but userspace depends on it. This should be testing
>>> + * bprm->secureexec instead.
>>> + */
>>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>> + is_dumpability_changed(current_cred(), bprm->cred) ||
>>> + !(uid_eq(current_euid(), current_uid()) &&
>>> + gid_eq(current_egid(), current_gid())))
>>> + set_dumpable(bprm->mm, suid_dumpable);
>>> + else
>>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
>>> +
>>
>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>> But imo this too should be done in a separate patch, the changelog should
>> explain this change.
>>
>>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>>> return -ERESTARTNOINTR;
>>>
>>> + if (unlikely(current->signal->exec_bprm)) {
>>> + mutex_unlock(¤t->signal->cred_guard_mutex);
>>> + return -ERESTARTNOINTR;
>>> + }
>>
>> OK, if signal->exec_bprm != NULL, then current is already killed. But
>> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
>> something like
>>
>> int lock_current_cgm(void)
>> {
>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>> return -ERESTARTNOINTR;
>>
>> if (!current->signal->group_exec_task)
>> return 0;
>>
>> WARN_ON(!fatal_signal_pending(current));
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> return -ERESTARTNOINTR;
>> }
>>
>> ?
>>
>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>> come in a separate patch too, but I won't insist.
>>
>>> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>>> return retval;
>>> }
>>>
>>> + if (unlikely(task == task->signal->group_exec_task)) {
>>> + retval = down_write_killable(&task->signal->exec_update_lock);
>>> + if (retval)
>>> + return retval;
>>> +
>>> + scoped_guard (task_lock, task) {
>>> + struct linux_binprm *bprm = task->signal->exec_bprm;
>>> + const struct cred __rcu *old_cred = task->real_cred;
>>> + struct mm_struct *old_mm = task->mm;
>>> +
>>> + rcu_assign_pointer(task->real_cred, bprm->cred);
>>> + task->mm = bprm->mm;
>>> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>>> + rcu_assign_pointer(task->real_cred, old_cred);
>>> + task->mm = old_mm;
>>> + }
>>
>> This is the most problematic change which I can't review...
>>
>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>> looks dangerous to me.
>
> Yeah, that is not ok. This is effectively override_creds for real_cred
> and that is not a pattern I want to see us establish at all! Temporary
> credential overrides for the subjective credentials is already terrible
> but at least we have the explicit split between real_cred and cred
> expressely for that. So no, that's not an acceptable solution.
>
Well when this is absolutely not acceptable then I would have to change
all security engines to be aware of the current and the new credentials.
That may be as well be possible but would be a rather big change.
Of course that was only meant as a big exception, and somehow safe
as long as it is protected under the right mutexes: cred_guard_mutex,
exec_update_lock and task_lock at least.
>>
>> Say, current_is_single_threaded() called by another CLONE_VM process can
>> miss group_exec_task and falsely return true. Probably not that bad, in
>> this case old_mm should go away soon, but still...
>>
>> And I don't know if this can fool the users of task_cred_xxx/__task_cred
>> somehow.
>>
>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>> ptrace the execing task after that? I have no idea what the security hooks
>> can do...
>>
>> Again, can't review this part.
>>
Never mind, your review was really helpful. At the very least it pointed
out some places where better comments are needed.
Thanks
Bernd.
>> Oleg.
>>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-11 11:07 ` Bernd Edlinger
@ 2025-11-11 13:12 ` Oleg Nesterov
2025-11-11 13:45 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-11 13:12 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/11, Bernd Edlinger wrote:
>
> On 11/11/25 10:21, Christian Brauner wrote:
> > On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
> >>
> >> This is the most problematic change which I can't review...
> >>
> >> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> >> looks dangerous to me.
> >
> > Yeah, that is not ok. This is effectively override_creds for real_cred
> > and that is not a pattern I want to see us establish at all! Temporary
> > credential overrides for the subjective credentials is already terrible
> > but at least we have the explicit split between real_cred and cred
> > expressely for that. So no, that's not an acceptable solution.
> >
>
> Well when this is absolutely not acceptable then I would have to change
> all security engines to be aware of the current and the new credentials.
Hmm... even if we find another way to avoid the deadlock? Say, the patches
I sent...
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-11 13:12 ` Oleg Nesterov
@ 2025-11-11 13:45 ` Bernd Edlinger
2025-11-12 9:52 ` Oleg Nesterov
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-11 13:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/11/25 14:12, Oleg Nesterov wrote:
> On 11/11, Bernd Edlinger wrote:
>>
>> On 11/11/25 10:21, Christian Brauner wrote:
>>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>>>>
>>>> This is the most problematic change which I can't review...
>>>>
>>>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>>>> looks dangerous to me.
>>>
>>> Yeah, that is not ok. This is effectively override_creds for real_cred
>>> and that is not a pattern I want to see us establish at all! Temporary
>>> credential overrides for the subjective credentials is already terrible
>>> but at least we have the explicit split between real_cred and cred
>>> expressely for that. So no, that's not an acceptable solution.
>>>
>>
>> Well when this is absolutely not acceptable then I would have to change
>> all security engines to be aware of the current and the new credentials.
>
> Hmm... even if we find another way to avoid the deadlock? Say, the patches
> I sent...
>
Maybe, but it looks almost too simple ;-)
164 sleep(2);
165 /* deadlock may happen here */
166 k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
what happens if you change the test expectation here, that the
ptrace may fail instead of succeed?
What signals does the debugger receive after that point?
Is the debugger notified that the debugged process continues,
has the same PID, and is no longer ptraced?
Thanks
Bernd.
> Oleg.
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-11 13:45 ` Bernd Edlinger
@ 2025-11-12 9:52 ` Oleg Nesterov
0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-12 9:52 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/11, Bernd Edlinger wrote:
>
> On 11/11/25 14:12, Oleg Nesterov wrote:
> > On 11/11, Bernd Edlinger wrote:
> >>
> >> Well when this is absolutely not acceptable then I would have to change
> >> all security engines to be aware of the current and the new credentials.
> >
> > Hmm... even if we find another way to avoid the deadlock? Say, the patches
> > I sent...
> >
>
> Maybe, but it looks almost too simple ;-)
>
> 164 sleep(2);
> 165 /* deadlock may happen here */
> 166 k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
>
> what happens if you change the test expectation here, that the
> ptrace may fail instead of succeed?
>
> What signals does the debugger receive after that point?
> Is the debugger notified that the debugged process continues,
> has the same PID, and is no longer ptraced?
Ah, but this is another thing... OK, you dislike 3/3 and I have to agree.
Yes, de_thread() silently untraces/reaps the old leader and after 3/3 debugger
can't rely on PTRACE_EVENT_EXIT, so unless the debugger has already attached to
all sub-threads (at least to execing thread) it looks as if the leader was just
untraced somehow.
OK, this is probably too bad, we need another solution...
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-11 9:21 ` Christian Brauner
2025-11-11 11:07 ` Bernd Edlinger
@ 2025-11-17 6:31 ` Bernd Edlinger
2025-11-17 15:01 ` Oleg Nesterov
1 sibling, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-17 6:31 UTC (permalink / raw)
To: Christian Brauner, Oleg Nesterov
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Andrew Morton, Michal Hocko, Serge Hallyn,
James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
Helge Deller, Eric W. Biederman, Adrian Reber, Thomas Gleixner,
Jens Axboe, Alexei Starovoitov, linux-fsdevel, linux-kernel,
linux-kselftest, linux-mm, linux-security-module, tiozhang,
Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/11/25 10:21, Christian Brauner wrote:
> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>> I am still thinking about another approach, will write another email.
>> But let me take a closer look at your patch.
>>
>> First of all, can you split it? See below.
>>
>> On 08/21, Bernd Edlinger wrote:
>>>
>>> -static int de_thread(struct task_struct *tsk)
>>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>> {
>>> struct signal_struct *sig = tsk->signal;
>>> struct sighand_struct *oldsighand = tsk->sighand;
>>> spinlock_t *lock = &oldsighand->siglock;
>>> + struct task_struct *t;
>>> + bool unsafe_execve_in_progress = false;
>>>
>>> if (thread_group_empty(tsk))
>>> goto no_thread_group;
>>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>> if (!thread_group_leader(tsk))
>>> sig->notify_count--;
>>>
>>> + for_other_threads(tsk, t) {
>>> + if (unlikely(t->ptrace)
>>> + && (t != tsk->group_leader || !t->exit_state))
>>> + unsafe_execve_in_progress = true;
>>
>> you can add "break" into the "if ()" block...
>>
ok.
>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>> unconditionally.
>>
I would not like to drop the mutex when no absolutely necessary for performance reasons.
But I can at least try out if something crashes immedately if that is always done.
>> If you really think it makes sense, please make another patch with the
>> changelog.
>>
>> I'd certainly prefer to avoid this boolean at least for the start. If nothing
>> else to catch the potential problems earlier.
>>
>>> + if (unlikely(unsafe_execve_in_progress)) {
>>> + spin_unlock_irq(lock);
>>> + sig->exec_bprm = bprm;
>>> + mutex_unlock(&sig->cred_guard_mutex);
>>> + spin_lock_irq(lock);
>>
>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>
Since the spin lock was acquired while holding the mutex, both should be
unlocked in reverse sequence and the spin lock re-acquired after releasing
the mutex.
I'd expect the scheduler to do a task switch after the cred_guard_mutex is
unlocked, at least in the RT-linux variant, while the spin lock is not yet
unlocked.
>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>> */
>>> trace_sched_prepare_exec(current, bprm);
>>>
>>> + /* If the binary is not readable then enforce mm->dumpable=0 */
>>> + would_dump(bprm, bprm->file);
>>> + if (bprm->have_execfd)
>>> + would_dump(bprm, bprm->executable);
>>> +
>>> + /*
>>> + * Figure out dumpability. Note that this checking only of current
>>> + * is wrong, but userspace depends on it. This should be testing
>>> + * bprm->secureexec instead.
>>> + */
>>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>> + is_dumpability_changed(current_cred(), bprm->cred) ||
>>> + !(uid_eq(current_euid(), current_uid()) &&
>>> + gid_eq(current_egid(), current_gid())))
>>> + set_dumpable(bprm->mm, suid_dumpable);
>>> + else
>>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
>>> +
>>
>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>> But imo this too should be done in a separate patch, the changelog should
>> explain this change.
>>
The dumpability need to be determined before de_thread, because ptrace_may_access
needs this information to determine if the tracer is allowed to ptrace. That is
part of the core of the patch, it would not work without that.
I will add more comments to make that more easy to understand.
>>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>>> return -ERESTARTNOINTR;
>>>
>>> + if (unlikely(current->signal->exec_bprm)) {
>>> + mutex_unlock(¤t->signal->cred_guard_mutex);
>>> + return -ERESTARTNOINTR;
>>> + }
>>
>> OK, if signal->exec_bprm != NULL, then current is already killed. But
>> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
>> something like
>>
>> int lock_current_cgm(void)
>> {
>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>> return -ERESTARTNOINTR;
>>
>> if (!current->signal->group_exec_task)
>> return 0;
>>
>> WARN_ON(!fatal_signal_pending(current));
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> return -ERESTARTNOINTR;
>> }
>>
>> ?
>>
Some use mutex_lock_interruptible and some use mutex_lock_killable here,
so it wont work for all of them. I would not consider this a new kind
of dead-lock free mutex, but just an open-coded state machine, handling
the state that the tasks have whild de_thread is running.
>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>> come in a separate patch too, but I won't insist.
>>
>>> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>>> return retval;
>>> }
>>>
>>> + if (unlikely(task == task->signal->group_exec_task)) {
>>> + retval = down_write_killable(&task->signal->exec_update_lock);
>>> + if (retval)
>>> + return retval;
>>> +
>>> + scoped_guard (task_lock, task) {
>>> + struct linux_binprm *bprm = task->signal->exec_bprm;
>>> + const struct cred __rcu *old_cred = task->real_cred;
>>> + struct mm_struct *old_mm = task->mm;
>>> +
>>> + rcu_assign_pointer(task->real_cred, bprm->cred);
>>> + task->mm = bprm->mm;
>>> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>>> + rcu_assign_pointer(task->real_cred, old_cred);
>>> + task->mm = old_mm;
>>> + }
>>
>> This is the most problematic change which I can't review...
>>
>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>> looks dangerous to me.
>
> Yeah, that is not ok. This is effectively override_creds for real_cred
> and that is not a pattern I want to see us establish at all! Temporary
> credential overrides for the subjective credentials is already terrible
> but at least we have the explicit split between real_cred and cred
> expressely for that. So no, that's not an acceptable solution.
>
Okay I understand your point.
I did this originally just to avoid to have to change the interface to all
the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
the ptrace_may_access which must be handled in all security engines, to use
child->signal->exec_bprm->creds instead of __task_cred(child).
>>
>> Say, current_is_single_threaded() called by another CLONE_VM process can
>> miss group_exec_task and falsely return true. Probably not that bad, in
>> this case old_mm should go away soon, but still...
>>
Oh, that's nice, I was not aware of that one.
Access to current are not a problem, since the task is trapped in de_thread,
however by code review I found also other places where task credentials are
checked and then used without holdning any lock, e.g. in rdtgroup_task_write_permission
and in quite similar code in __cgroup1_procs_write, I dont know if that is a
security problem.
>> And I don't know if this can fool the users of task_cred_xxx/__task_cred
>> somehow.
>>
>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>> ptrace the execing task after that? I have no idea what the security hooks
>> can do...
That means the tracee is already ptraced before the execve, and SUID-bits
do not work as usual, and are more or less ignored. But in this patch
the tracee is not yet ptraced. Only some sibling threads. But they will
either be zapped and go away or the tracer wants to attach to the main thread,
and in that case the tracer is only able to ptrace the main thread if he has
also access permissions for the credentials that are in effect after the execve
completes.
>>
>> Again, can't review this part.
>>
>> Oleg.
>>
Thanks
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-17 6:31 ` Bernd Edlinger
@ 2025-11-17 15:01 ` Oleg Nesterov
2025-11-17 20:08 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-17 15:01 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/17, Bernd Edlinger wrote:
>
> On 11/11/25 10:21, Christian Brauner wrote:
> > On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>
> >> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> >> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> >> unconditionally.
> >>
>
> I would not like to drop the mutex when no absolutely necessary for performance reasons.
OK, I won't insist... But I don't really understand how this can help to
improve the performance. If nothing else, this adds another for_other_threads()
loop.
And again, the unsafe_execve_in_progress == T case is unlikely. I'm afraid this
case (de_thread() without cred_guard_mutex) won't have enough testing.
In any case, why you dislike the suggestion to add this unsafe_execve_in_progress
logic in a separate patch?
> >>> + if (unlikely(unsafe_execve_in_progress)) {
> >>> + spin_unlock_irq(lock);
> >>> + sig->exec_bprm = bprm;
> >>> + mutex_unlock(&sig->cred_guard_mutex);
> >>> + spin_lock_irq(lock);
> >>
> >> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> >>
>
> Since the spin lock was acquired while holding the mutex, both should be
> unlocked in reverse sequence and the spin lock re-acquired after releasing
> the mutex.
Why?
> I'd expect the scheduler to do a task switch after the cred_guard_mutex is
> unlocked, at least in the RT-linux variant, while the spin lock is not yet
> unlocked.
I must have missed something, but I still don't understand why this would
be wrong...
> >>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> >>> */
> >>> trace_sched_prepare_exec(current, bprm);
> >>>
> >>> + /* If the binary is not readable then enforce mm->dumpable=0 */
> >>> + would_dump(bprm, bprm->file);
> >>> + if (bprm->have_execfd)
> >>> + would_dump(bprm, bprm->executable);
> >>> +
> >>> + /*
> >>> + * Figure out dumpability. Note that this checking only of current
> >>> + * is wrong, but userspace depends on it. This should be testing
> >>> + * bprm->secureexec instead.
> >>> + */
> >>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >>> + is_dumpability_changed(current_cred(), bprm->cred) ||
> >>> + !(uid_eq(current_euid(), current_uid()) &&
> >>> + gid_eq(current_egid(), current_gid())))
> >>> + set_dumpable(bprm->mm, suid_dumpable);
> >>> + else
> >>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
> >>> +
> >>
> >> OK, we need to do this before de_thread() drops cred_guard_mutex.
> >> But imo this too should be done in a separate patch, the changelog should
> >> explain this change.
> >>
>
> The dumpability need to be determined before de_thread, because ptrace_may_access
> needs this information to determine if the tracer is allowed to ptrace. That is
> part of the core of the patch, it would not work without that.
Yes,
> I will add more comments to make that more easy to understand.
But again, why this change can't come in a separate patch? Before the patch which
drops cred_guard_mutex in de_thread().
> >> int lock_current_cgm(void)
> >> {
> >> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> >> return -ERESTARTNOINTR;
> >>
> >> if (!current->signal->group_exec_task)
> >> return 0;
> >>
> >> WARN_ON(!fatal_signal_pending(current));
> >> mutex_unlock(¤t->signal->cred_guard_mutex);
> >> return -ERESTARTNOINTR;
> >> }
> >>
> >> ?
> >>
>
> Some use mutex_lock_interruptible and some use mutex_lock_killable here,
> so it wont work for all of them. I would not consider this a new kind
> of dead-lock free mutex, but just an open-coded state machine, handling
> the state that the tasks have whild de_thread is running.
OK. and we don't have mutex_lock_state(). I think that all users could
use mutex_lock_killable(), but you are right anyway, and this is minor.
> >> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
> >> come in a separate patch too, but I won't insist.
Yes. Although this is minor too ;)
> >> This is the most problematic change which I can't review...
> >>
> >> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> >> looks dangerous to me.
> >
> > Yeah, that is not ok. This is effectively override_creds for real_cred
> > and that is not a pattern I want to see us establish at all! Temporary
> > credential overrides for the subjective credentials is already terrible
> > but at least we have the explicit split between real_cred and cred
> > expressely for that. So no, that's not an acceptable solution.
> >
>
> Okay I understand your point.
> I did this originally just to avoid to have to change the interface to all
> the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
> the ptrace_may_access which must be handled in all security engines, to use
> child->signal->exec_bprm->creds instead of __task_cred(child).
Can't comment... I don't understand your idea, but this is my fault. I guess
this needs more changes, in particular __ptrace_may_access_mm_cred(), but
most probably I misunderstood your idea.
>
> >> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
> >> ptrace the execing task after that? I have no idea what the security hooks
> >> can do...
>
> That means the tracee is already ptraced before the execve, and SUID-bits
> do not work as usual, and are more or less ignored. But in this patch
> the tracee is not yet ptraced.
Well. I meant that if LSM_UNSAFE_PTRACE is not set, then currently (say)
security_bprm_committing_creds() has all rights to assume that the execing
task is not ptraced. Yes, I don't see any potential problem right now, but
still.
And just in case... Lets look at this code
+ rcu_assign_pointer(task->real_cred, bprm->cred);
+ task->mm = bprm->mm;
+ retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ rcu_assign_pointer(task->real_cred, old_cred);
+ task->mm = old_mm;
again.
This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
thread will report SIGSEGV to debugger which can (say) read old_mm.
No?
I am starting to think that ptrace_attach() should simply fail with -EWOULDBLOCK
if it detects "unsafe_execve_in_progress" ... And perhaps this is what you already
tried to do in the past, I can't recall :/
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-17 15:01 ` Oleg Nesterov
@ 2025-11-17 20:08 ` Bernd Edlinger
2025-11-23 18:32 ` Oleg Nesterov
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-17 20:08 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/17/25 16:01, Oleg Nesterov wrote:
> On 11/17, Bernd Edlinger wrote:
>>
>> On 11/11/25 10:21, Christian Brauner wrote:
>>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>>
>>>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>>>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>>>> unconditionally.
>>>>
>>
>> I would not like to drop the mutex when no absolutely necessary for performance reasons.
>
> OK, I won't insist... But I don't really understand how this can help to
> improve the performance. If nothing else, this adds another for_other_threads()
> loop.
>
If no dead-lock is possible it is better to complete the de_thread without
releasing the mutex. For the debugger it is also the better experience,
no matter when the ptrace_attack happens it will succeed rather quickly either
before the execve or after the execve.
> And again, the unsafe_execve_in_progress == T case is unlikely. I'm afraid this
> case (de_thread() without cred_guard_mutex) won't have enough testing.
>
Same is the case of a multi-threaded application that does execve while other threads
are still alive. But I have test cases, they are pretty good at reproducing the
dead-locks.
> In any case, why you dislike the suggestion to add this unsafe_execve_in_progress
> logic in a separate patch?
>
I do not want to regress use cases where there is no dead-lock possible.
The saying "if it ain't broke, don't fix it" means you shouldn't try to change something
that is already working well, because meddling with it could potentially make it worse.
>>>>> + if (unlikely(unsafe_execve_in_progress)) {
>>>>> + spin_unlock_irq(lock);
>>>>> + sig->exec_bprm = bprm;
>>>>> + mutex_unlock(&sig->cred_guard_mutex);
>>>>> + spin_lock_irq(lock);
>>>>
>>>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>>>
>>
>> Since the spin lock was acquired while holding the mutex, both should be
>> unlocked in reverse sequence and the spin lock re-acquired after releasing
>> the mutex.
>
> Why?
>
It is generally more safe when each thread acquires its mutexes in order and
releases them in reverse order.
Consider this:
Thread A:
holds spin_lock_irq(siglock);
does mutes_unlock(cred_guard_mutex); with irq disabled.
task switch happens to Thread B which has irq enabled.
and is waiting for cred_guard_mutex.
Thrad B:
does mutex_lock(cred_guard_mutex);
but is interrupted this point and the interrupt handler I executes
now iterrupt handler I wants to take siglock and is blocked,
because the system one single CPU core.
>> I'd expect the scheduler to do a task switch after the cred_guard_mutex is
>> unlocked, at least in the RT-linux variant, while the spin lock is not yet
>> unlocked.
>
> I must have missed something, but I still don't understand why this would
> be wrong...
>
>>>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>>>> */
>>>>> trace_sched_prepare_exec(current, bprm);
>>>>>
>>>>> + /* If the binary is not readable then enforce mm->dumpable=0 */
>>>>> + would_dump(bprm, bprm->file);
>>>>> + if (bprm->have_execfd)
>>>>> + would_dump(bprm, bprm->executable);
>>>>> +
>>>>> + /*
>>>>> + * Figure out dumpability. Note that this checking only of current
>>>>> + * is wrong, but userspace depends on it. This should be testing
>>>>> + * bprm->secureexec instead.
>>>>> + */
>>>>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>>>> + is_dumpability_changed(current_cred(), bprm->cred) ||
>>>>> + !(uid_eq(current_euid(), current_uid()) &&
>>>>> + gid_eq(current_egid(), current_gid())))
>>>>> + set_dumpable(bprm->mm, suid_dumpable);
>>>>> + else
>>>>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
>>>>> +
>>>>
>>>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>>>> But imo this too should be done in a separate patch, the changelog should
>>>> explain this change.
>>>>
>>
>> The dumpability need to be determined before de_thread, because ptrace_may_access
>> needs this information to determine if the tracer is allowed to ptrace. That is
>> part of the core of the patch, it would not work without that.
>
> Yes,
>
>> I will add more comments to make that more easy to understand.
>
> But again, why this change can't come in a separate patch? Before the patch which
> drops cred_guard_mutex in de_thread().
>
I did this mistake already, the end result was that a patch was split in 8 different
parts, but one of them was not accepted, and therefore we have now the test failure
in the vmaccess since 5 years now, because the test was designed to test the complete
puzzle, but one puzzle part was missing...
>>>> int lock_current_cgm(void)
>>>> {
>>>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>>>> return -ERESTARTNOINTR;
>>>>
>>>> if (!current->signal->group_exec_task)
>>>> return 0;
>>>>
>>>> WARN_ON(!fatal_signal_pending(current));
>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>> return -ERESTARTNOINTR;
>>>> }
>>>>
>>>> ?
>>>>
>>
>> Some use mutex_lock_interruptible and some use mutex_lock_killable here,
>> so it wont work for all of them. I would not consider this a new kind
>> of dead-lock free mutex, but just an open-coded state machine, handling
>> the state that the tasks have whild de_thread is running.
>
> OK. and we don't have mutex_lock_state(). I think that all users could
> use mutex_lock_killable(), but you are right anyway, and this is minor.
>
>>>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>>>> come in a separate patch too, but I won't insist.
>
> Yes. Although this is minor too ;)
>
>>>> This is the most problematic change which I can't review...
>>>>
>>>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>>>> looks dangerous to me.
>>>
>>> Yeah, that is not ok. This is effectively override_creds for real_cred
>>> and that is not a pattern I want to see us establish at all! Temporary
>>> credential overrides for the subjective credentials is already terrible
>>> but at least we have the explicit split between real_cred and cred
>>> expressely for that. So no, that's not an acceptable solution.
>>>
>>
>> Okay I understand your point.
>> I did this originally just to avoid to have to change the interface to all
>> the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
>> the ptrace_may_access which must be handled in all security engines, to use
>> child->signal->exec_bprm->creds instead of __task_cred(child).
>
> Can't comment... I don't understand your idea, but this is my fault. I guess
> this needs more changes, in particular __ptrace_may_access_mm_cred(), but
> most probably I misunderstood your idea.
>
>>
>>>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>>>> ptrace the execing task after that? I have no idea what the security hooks
>>>> can do...
>>
>> That means the tracee is already ptraced before the execve, and SUID-bits
>> do not work as usual, and are more or less ignored. But in this patch
>> the tracee is not yet ptraced.
>
> Well. I meant that if LSM_UNSAFE_PTRACE is not set, then currently (say)
> security_bprm_committing_creds() has all rights to assume that the execing
> task is not ptraced. Yes, I don't see any potential problem right now, but
> still.
>
> And just in case... Lets look at this code
>
> + rcu_assign_pointer(task->real_cred, bprm->cred);
> + task->mm = bprm->mm;
> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> + rcu_assign_pointer(task->real_cred, old_cred);
> + task->mm = old_mm;
>
> again.
>
> This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
> and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
> thread will report SIGSEGV to debugger which can (say) read old_mm.
>
> No?
>
Yes, and that is the reason why the debugger has to prove the possession of access rights
to the process before the execve which are necessary in case exeve fails, yes the debugger
may inspect the result, and as well the debugger's access rights must be also sufficient
to ptrace the process after execve succeeds, moreover the debugged process shall stop
right at the first instruction where the new process starts.
> I am starting to think that ptrace_attach() should simply fail with -EWOULDBLOCK
> if it detects "unsafe_execve_in_progress" ... And perhaps this is what you already
> tried to do in the past, I can't recall :/
>
This was indeed the previous attempt, but I changed my mind in the mean time,
as the return code -EAGAIN from the ptrace_attach is not documented, and is therefore
an API change, but also the debugger might misunderstand that hint, and try the same
ptrace_attach in a loop, instead of calling wait for the pending SIGCHILD signal.
I know Linus pointed out that the tracer would better use a signal hander, to avoid
the problem, but I think that no debugger wants to implement the state machine that
handles the ptrace events in a signal handler.
Thanks
Bernd.
> Oleg.
>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-17 20:08 ` Bernd Edlinger
@ 2025-11-23 18:32 ` Oleg Nesterov
2025-11-29 15:06 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-23 18:32 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Hi Bernd,
sorry for delay, I am on PTO, didn't read emails this week...
On 11/17, Bernd Edlinger wrote:
>
> On 11/17/25 16:01, Oleg Nesterov wrote:
> > On 11/17, Bernd Edlinger wrote:
> >>
> >> On 11/11/25 10:21, Christian Brauner wrote:
> >>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
> >>
> >>>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> >>>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> >>>> unconditionally.
> >>>>
> >>
> >> I would not like to drop the mutex when no absolutely necessary for performance reasons.
> >
> > OK, I won't insist... But I don't really understand how this can help to
> > improve the performance. If nothing else, this adds another for_other_threads()
> > loop.
> >
>
> If no dead-lock is possible it is better to complete the de_thread without
> releasing the mutex. For the debugger it is also the better experience,
> no matter when the ptrace_attack happens it will succeed rather quickly either
> before the execve or after the execve.
I still disagree, I still don't understand the "performance reasons", but since I can't
convince you I won't really argue.
> >>>>> + if (unlikely(unsafe_execve_in_progress)) {
> >>>>> + spin_unlock_irq(lock);
> >>>>> + sig->exec_bprm = bprm;
> >>>>> + mutex_unlock(&sig->cred_guard_mutex);
> >>>>> + spin_lock_irq(lock);
> >>>>
> >>>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> >>>>
> >>
> >> Since the spin lock was acquired while holding the mutex, both should be
> >> unlocked in reverse sequence and the spin lock re-acquired after releasing
> >> the mutex.
> >
> > Why?
> >
>
> It is generally more safe when each thread acquires its mutexes in order and
> releases them in reverse order.
> Consider this:
> Thread A:
> holds spin_lock_irq(siglock);
> does mutes_unlock(cred_guard_mutex); with irq disabled.
> task switch happens to Thread B which has irq enabled.
> and is waiting for cred_guard_mutex.
> Thrad B:
> does mutex_lock(cred_guard_mutex);
> but is interrupted this point and the interrupt handler I executes
> now iterrupt handler I wants to take siglock and is blocked,
> because the system one single CPU core.
I don't follow. Do you mean PREEMPT_RT ?
If yes. In this case spin_lock_irq() is rt_spin_lock() which doesn't disable irqs,
it does rt_lock_lock() (takes rt_mutex) + migrate_disable().
I do think that spin/mutex/whatever_unlock() is always safe. In any order, and
regardless of RT.
> > And just in case... Lets look at this code
> >
> > + rcu_assign_pointer(task->real_cred, bprm->cred);
> > + task->mm = bprm->mm;
> > + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> > + rcu_assign_pointer(task->real_cred, old_cred);
> > + task->mm = old_mm;
> >
> > again.
> >
> > This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
> > and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
> > thread will report SIGSEGV to debugger which can (say) read old_mm.
> >
> > No?
> >
>
> Yes, and that is the reason why the debugger has to prove the possession of access rights
> to the process before the execve which are necessary in case exeve fails, yes the debugger
> may inspect the result, and as well the debugger's access rights must be also sufficient
> to ptrace the process after execve succeeds, moreover the debugged process shall stop
> right at the first instruction where the new process starts.
Not sure I understand... OK, I see that you sent V18, and in this version ptrace_attach()
calls __ptrace_may_access() twice, so IIUC ptrace_attach() can only succeed if the debugger
has rights to trace the execing thread both before and after exec...
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-23 18:32 ` Oleg Nesterov
@ 2025-11-29 15:06 ` Bernd Edlinger
2025-12-01 15:13 ` Oleg Nesterov
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-29 15:06 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/23/25 19:32, Oleg Nesterov wrote:
> Hi Bernd,
>
> sorry for delay, I am on PTO, didn't read emails this week...
>
> On 11/17, Bernd Edlinger wrote:
>>
>> On 11/17/25 16:01, Oleg Nesterov wrote:
>>> On 11/17, Bernd Edlinger wrote:
>>>>
>>>> On 11/11/25 10:21, Christian Brauner wrote:
>>>>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>>>>
>>>>>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>>>>>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>>>>>> unconditionally.
>>>>>>
>>>>
>>>> I would not like to drop the mutex when no absolutely necessary for performance reasons.
>>>
>>> OK, I won't insist... But I don't really understand how this can help to
>>> improve the performance. If nothing else, this adds another for_other_threads()
>>> loop.
>>>
>>
>> If no dead-lock is possible it is better to complete the de_thread without
>> releasing the mutex. For the debugger it is also the better experience,
>> no matter when the ptrace_attack happens it will succeed rather quickly either
>> before the execve or after the execve.
>
> I still disagree, I still don't understand the "performance reasons", but since I can't
> convince you I won't really argue.
>
>>>>>>> + if (unlikely(unsafe_execve_in_progress)) {
>>>>>>> + spin_unlock_irq(lock);
>>>>>>> + sig->exec_bprm = bprm;
>>>>>>> + mutex_unlock(&sig->cred_guard_mutex);
>>>>>>> + spin_lock_irq(lock);
>>>>>>
>>>>>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>>>>>
>>>>
>>>> Since the spin lock was acquired while holding the mutex, both should be
>>>> unlocked in reverse sequence and the spin lock re-acquired after releasing
>>>> the mutex.
>>>
>>> Why?
>>>
>>
>> It is generally more safe when each thread acquires its mutexes in order and
>> releases them in reverse order.
>> Consider this:
>> Thread A:
>> holds spin_lock_irq(siglock);
>> does mutes_unlock(cred_guard_mutex); with irq disabled.
>> task switch happens to Thread B which has irq enabled.
>> and is waiting for cred_guard_mutex.
>> Thrad B:
>> does mutex_lock(cred_guard_mutex);
>> but is interrupted this point and the interrupt handler I executes
>> now iterrupt handler I wants to take siglock and is blocked,
>> because the system one single CPU core.
>
> I don't follow. Do you mean PREEMPT_RT ?
>
> If yes. In this case spin_lock_irq() is rt_spin_lock() which doesn't disable irqs,
> it does rt_lock_lock() (takes rt_mutex) + migrate_disable().
>
> I do think that spin/mutex/whatever_unlock() is always safe. In any order, and
> regardless of RT.
>
Well, based on my experience with other embedded real-time O/S-es, I would
expect that something named spin_lock_irq locks the task-specific IRQ, and
prevents task switches due to time-slicing, while something called
mutes_unlock may cause an explicit task switch, when another task is waiting
for the mutex.
It is hard to follow how linux implements that spin_lock_irq exactly, but
to me it looks like it is done this way:
include/linux/spinlock_api_smp.h:static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
include/linux/spinlock_api_smp.h-{
include/linux/spinlock_api_smp.h- local_irq_disable();
include/linux/spinlock_api_smp.h- preempt_disable();
include/linux/spinlock_api_smp.h- spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
include/linux/spinlock_api_smp.h- LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
include/linux/spinlock_api_smp.h-}
so an explicit task switch while locka_irq_disable looks
very dangerous to me. Do you know other places where such
a code pattern is used?
I do just ask, because a close look at those might reveal
some serious bugs, WDYT?
Thanks
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-29 15:06 ` Bernd Edlinger
@ 2025-12-01 15:13 ` Oleg Nesterov
0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-12-01 15:13 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/29, Bernd Edlinger wrote:
>
> On 11/23/25 19:32, Oleg Nesterov wrote:
> > I don't follow. Do you mean PREEMPT_RT ?
> >
> > If yes. In this case spin_lock_irq() is rt_spin_lock() which doesn't disable irqs,
> > it does rt_lock_lock() (takes rt_mutex) + migrate_disable().
> >
> > I do think that spin/mutex/whatever_unlock() is always safe. In any order, and
> > regardless of RT.
> >
>
> It is hard to follow how linux implements that spin_lock_irq exactly,
Yes ;)
> but
> to me it looks like it is done this way:
>
> include/linux/spinlock_api_smp.h:static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
> include/linux/spinlock_api_smp.h-{
> include/linux/spinlock_api_smp.h- local_irq_disable();
> include/linux/spinlock_api_smp.h- preempt_disable();
> include/linux/spinlock_api_smp.h- spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> include/linux/spinlock_api_smp.h- LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> include/linux/spinlock_api_smp.h-}
Again, I will assume you mean RT.
In this case spinlock_t and raw_spinlock_t are not the same thing.
include/linux/spinlock_types.h:
typedef struct spinlock {
struct rt_mutex_base lock;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} spinlock_t;
include/linux/spinlock_rt.h:
static __always_inline void spin_lock_irq(spinlock_t *lock)
{
rt_spin_lock(lock);
}
rt_spin_lock() doesn't disable irqs, it takes "rt_mutex_base lock" and
disables migration.
> so an explicit task switch while locka_irq_disable looks
> very dangerous to me.
raw_spin_lock_irq() disables irqs/preemption regardless of RT, task switch
is not possible.
> Do you know other places where such
> a code pattern is used?
For example, double_lock_irq(). See task_numa_group(),
double_lock_irq(&my_grp->lock, &grp->lock);
....
spin_unlock(&my_grp->lock);
spin_unlock_irq(&grp->lock);
this can unlock the locks in reverse order.
I am sure there are more examples.
> I do just ask, because a close look at those might reveal
> some serious bugs, WDYT?
See above, I don't understand your concerns...
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
` (3 preceding siblings ...)
2025-11-05 14:32 ` Oleg Nesterov
@ 2025-11-09 17:14 ` Oleg Nesterov
2025-11-09 17:14 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
` (3 more replies)
2025-11-18 18:13 ` [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
5 siblings, 4 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:14 UTC (permalink / raw)
To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Not for inclusion yet. 2/2 is untested, incomplete, possibly buggy.
But could you review at least the intent? Do you see any problem with
this approach?
This problem is very, very old. It seems that nobody can suggest a
simple/clean fix...
Oleg.
---
fs/binfmt_elf.c | 4 +-
fs/binfmt_elf_fdpic.c | 4 +-
fs/binfmt_flat.c | 4 +-
fs/exec.c | 142 +++++++++++++++++++++++-------------------------
include/linux/binfmts.h | 2 +-
kernel/exit.c | 9 +--
kernel/signal.c | 6 +-
7 files changed, 87 insertions(+), 84 deletions(-)
^ permalink raw reply [flat|nested] 68+ messages in thread* [RFC PATCH 1/3] exec: make setup_new_exec() return int
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
@ 2025-11-09 17:14 ` Oleg Nesterov
2025-11-09 17:15 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:14 UTC (permalink / raw)
To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Preparation. After the next change setup_new_exec() can fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/binfmt_elf.c | 4 +++-
fs/binfmt_elf_fdpic.c | 4 +++-
fs/binfmt_flat.c | 4 +++-
fs/exec.c | 4 +++-
include/linux/binfmts.h | 2 +-
5 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e4653bb99946..8a38689ae4b0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1021,7 +1021,9 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (!(current->personality & ADDR_NO_RANDOMIZE) && snapshot_randomize_va_space)
current->flags |= PF_RANDOMIZE;
- setup_new_exec(bprm);
+ retval = setup_new_exec(bprm);
+ if (retval)
+ goto out_free_dentry;
/* Do this so that we can load the interpreter, if need be. We will
change some of these later */
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 48fd2de3bca0..7ad98b8132fc 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -351,7 +351,9 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
if (elf_read_implies_exec(&exec_params.hdr, executable_stack))
current->personality |= READ_IMPLIES_EXEC;
- setup_new_exec(bprm);
+ retval = setup_new_exec(bprm);
+ if (retval)
+ goto error;
set_binfmt(&elf_fdpic_format);
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b5b5ca1a44f7..78e9ca128ea7 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -512,7 +512,9 @@ static int load_flat_file(struct linux_binprm *bprm,
/* OK, This is the point of no return */
set_personality(PER_LINUX_32BIT);
- setup_new_exec(bprm);
+ ret = setup_new_exec(bprm);
+ if (ret)
+ goto err;
/*
* calculate the extra space we need to map in
diff --git a/fs/exec.c b/fs/exec.c
index 6b70c6726d31..136a7ab5d91c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1321,7 +1321,7 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
}
EXPORT_SYMBOL(would_dump);
-void setup_new_exec(struct linux_binprm * bprm)
+int setup_new_exec(struct linux_binprm * bprm)
{
/* Setup things that can depend upon the personality */
struct task_struct *me = current;
@@ -1337,6 +1337,8 @@ void setup_new_exec(struct linux_binprm * bprm)
me->mm->task_size = TASK_SIZE;
up_write(&me->signal->exec_update_lock);
mutex_unlock(&me->signal->cred_guard_mutex);
+
+ return 0;
}
EXPORT_SYMBOL(setup_new_exec);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 65abd5ab8836..678b7525ac5a 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -123,7 +123,7 @@ extern void unregister_binfmt(struct linux_binfmt *);
extern int __must_check remove_arg_zero(struct linux_binprm *);
extern int begin_new_exec(struct linux_binprm * bprm);
-extern void setup_new_exec(struct linux_binprm * bprm);
+extern int setup_new_exec(struct linux_binprm * bprm);
extern void finalize_exec(struct linux_binprm *bprm);
extern void would_dump(struct linux_binprm *, struct file *);
--
2.25.1.362.g51ebf55
^ permalink raw reply [flat|nested] 68+ messages in thread* [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
2025-11-09 17:14 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
@ 2025-11-09 17:15 ` Oleg Nesterov
2025-11-10 10:58 ` Cyrill Gorcunov
2025-11-09 17:16 ` [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
2025-11-10 5:28 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
3 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:15 UTC (permalink / raw)
To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
This simple program
#include <unistd.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <pthread.h>
void *thread(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return NULL;
}
int main(void)
{
int pid = fork();
if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("echo", "echo", "passed", NULL);
}
sleep(1);
ptrace(PTRACE_ATTACH, pid, 0,0);
kill(pid, SIGCONT);
return 0;
}
hangs because de_thread() waits for debugger which should release the killed
thread with cred_guard_mutex held, while the debugger sleeps waiting for the
same mutex. Not really that bad, the tracer can be killed, but still this is
a bug and people hit it in practice.
With this patch:
- de_thread() waits until all the sub-threads pass exit_notify() and
become zombies.
- setup_new_exec() waits until all the sub-threads are reaped without
cred_guard_mutex held.
- unshare_sighand() and flush_signal_handlers() are moved from
begin_new_exec() to setup_new_exec(), we can't call them until all
sub-threads go away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/exec.c | 140 +++++++++++++++++++++++-------------------------
kernel/exit.c | 9 ++--
kernel/signal.c | 2 +-
3 files changed, 71 insertions(+), 80 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 136a7ab5d91c..2bac7deb9a98 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -905,42 +905,56 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
-static int de_thread(struct task_struct *tsk)
+static int kill_sub_threads(struct task_struct *tsk)
{
struct signal_struct *sig = tsk->signal;
- struct sighand_struct *oldsighand = tsk->sighand;
- spinlock_t *lock = &oldsighand->siglock;
-
- if (thread_group_empty(tsk))
- goto no_thread_group;
+ int err = -EINTR;
- /*
- * Kill all other threads in the thread group.
- */
- spin_lock_irq(lock);
- if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) {
- /*
- * Another group action in progress, just
- * return so that the signal is processed.
- */
- spin_unlock_irq(lock);
- return -EAGAIN;
+ read_lock(&tasklist_lock);
+ spin_lock_irq(&tsk->sighand->siglock);
+ if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
+ sig->group_exec_task = tsk;
+ sig->notify_count = -zap_other_threads(tsk);
+ err = 0;
}
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
- sig->group_exec_task = tsk;
- sig->notify_count = zap_other_threads(tsk);
- if (!thread_group_leader(tsk))
- sig->notify_count--;
+ return err;
+}
- while (sig->notify_count) {
- __set_current_state(TASK_KILLABLE);
- spin_unlock_irq(lock);
- schedule();
+static int wait_for_notify_count(struct task_struct *tsk)
+{
+ for (;;) {
if (__fatal_signal_pending(tsk))
- goto killed;
- spin_lock_irq(lock);
+ return -EINTR;
+ set_current_state(TASK_KILLABLE);
+ if (!tsk->signal->notify_count)
+ break;
+ schedule();
}
- spin_unlock_irq(lock);
+ __set_current_state(TASK_RUNNING);
+ return 0;
+}
+
+static void clear_group_exec_task(struct task_struct *tsk)
+{
+ struct signal_struct *sig = tsk->signal;
+
+ /* protects against exit_notify() and __exit_signal() */
+ read_lock(&tasklist_lock);
+ sig->group_exec_task = NULL;
+ sig->notify_count = 0;
+ read_unlock(&tasklist_lock);
+}
+
+static int de_thread(struct task_struct *tsk)
+{
+ if (thread_group_empty(tsk))
+ goto no_thread_group;
+
+ if (kill_sub_threads(tsk) || wait_for_notify_count(tsk))
+ return -EINTR;
/*
* At this point all other threads have exited, all we have to
@@ -948,26 +962,10 @@ static int de_thread(struct task_struct *tsk)
* and to assume its PID:
*/
if (!thread_group_leader(tsk)) {
- struct task_struct *leader = tsk->group_leader;
-
- for (;;) {
- cgroup_threadgroup_change_begin(tsk);
- write_lock_irq(&tasklist_lock);
- /*
- * Do this under tasklist_lock to ensure that
- * exit_notify() can't miss ->group_exec_task
- */
- sig->notify_count = -1;
- if (likely(leader->exit_state))
- break;
- __set_current_state(TASK_KILLABLE);
- write_unlock_irq(&tasklist_lock);
- cgroup_threadgroup_change_end(tsk);
- schedule();
- if (__fatal_signal_pending(tsk))
- goto killed;
- }
+ struct task_struct *leader = tsk->group_leader, *t;
+ cgroup_threadgroup_change_begin(tsk);
+ write_lock_irq(&tasklist_lock);
/*
* The only record we have of the real-time age of a
* process, regardless of execs it's done, is start_time.
@@ -1000,8 +998,8 @@ static int de_thread(struct task_struct *tsk)
list_replace_rcu(&leader->tasks, &tsk->tasks);
list_replace_init(&leader->sibling, &tsk->sibling);
- tsk->group_leader = tsk;
- leader->group_leader = tsk;
+ for_each_thread(tsk, t)
+ t->group_leader = tsk;
tsk->exit_signal = SIGCHLD;
leader->exit_signal = -1;
@@ -1021,23 +1019,11 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}
- sig->group_exec_task = NULL;
- sig->notify_count = 0;
-
no_thread_group:
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
-
BUG_ON(!thread_group_leader(tsk));
return 0;
-
-killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exec_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
}
@@ -1171,13 +1157,6 @@ int begin_new_exec(struct linux_binprm * bprm)
flush_itimer_signals();
#endif
- /*
- * Make the signal table private.
- */
- retval = unshare_sighand(me);
- if (retval)
- goto out_unlock;
-
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
@@ -1249,7 +1228,6 @@ int begin_new_exec(struct linux_binprm * bprm)
/* An exec changes our domain. We are no longer part of the thread
group */
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
- flush_signal_handlers(me, 0);
retval = set_cred_ucounts(bprm->cred);
if (retval < 0)
@@ -1293,8 +1271,9 @@ int begin_new_exec(struct linux_binprm * bprm)
up_write(&me->signal->exec_update_lock);
if (!bprm->cred)
mutex_unlock(&me->signal->cred_guard_mutex);
-
out:
+ if (me->signal->group_exec_task == me)
+ clear_group_exec_task(me);
return retval;
}
EXPORT_SYMBOL(begin_new_exec);
@@ -1325,6 +1304,8 @@ int setup_new_exec(struct linux_binprm * bprm)
{
/* Setup things that can depend upon the personality */
struct task_struct *me = current;
+ struct signal_struct *sig = me->signal;
+ int err = 0;
arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
@@ -1335,10 +1316,23 @@ int setup_new_exec(struct linux_binprm * bprm)
* some architectures like powerpc
*/
me->mm->task_size = TASK_SIZE;
- up_write(&me->signal->exec_update_lock);
- mutex_unlock(&me->signal->cred_guard_mutex);
+ up_write(&sig->exec_update_lock);
+ mutex_unlock(&sig->cred_guard_mutex);
- return 0;
+ if (sig->group_exec_task) {
+ spin_lock_irq(&me->sighand->siglock);
+ sig->notify_count = sig->nr_threads - 1;
+ spin_unlock_irq(&me->sighand->siglock);
+
+ err = wait_for_notify_count(me);
+ clear_group_exec_task(me);
+ }
+
+ if (!err)
+ err = unshare_sighand(me);
+ if (!err)
+ flush_signal_handlers(me, 0);
+ return err;
}
EXPORT_SYMBOL(setup_new_exec);
diff --git a/kernel/exit.c b/kernel/exit.c
index f041f0c05ebb..bcde78c97253 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -178,10 +178,7 @@ static void __exit_signal(struct release_task_post *post, struct task_struct *ts
tty = sig->tty;
sig->tty = NULL;
} else {
- /*
- * If there is any task waiting for the group exit
- * then notify it:
- */
+ /* mt-exec, setup_new_exec() -> wait_for_notify_count() */
if (sig->notify_count > 0 && !--sig->notify_count)
wake_up_process(sig->group_exec_task);
@@ -766,8 +763,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
list_add(&tsk->ptrace_entry, &dead);
}
- /* mt-exec, de_thread() is waiting for group leader */
- if (unlikely(tsk->signal->notify_count < 0))
+ /* mt-exec, de_thread() -> wait_for_notify_count() */
+ if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
wake_up_process(tsk->signal->group_exec_task);
write_unlock_irq(&tasklist_lock);
diff --git a/kernel/signal.c b/kernel/signal.c
index fe9190d84f28..334212044940 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1343,13 +1343,13 @@ int zap_other_threads(struct task_struct *p)
for_other_threads(p, t) {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- count++;
/* Don't bother with already dead threads */
if (t->exit_state)
continue;
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
+ count++;
}
return count;
--
2.25.1.362.g51ebf55
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
2025-11-09 17:15 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2025-11-10 10:58 ` Cyrill Gorcunov
2025-11-10 15:09 ` Oleg Nesterov
0 siblings, 1 reply; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-11-10 10:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module
On Sun, Nov 09, 2025 at 06:15:33PM +0100, Oleg Nesterov wrote:
..
> static int kill_sub_threads(struct task_struct *tsk)
> {
> struct signal_struct *sig = tsk->signal;
> int err = -EINTR;
>
> read_lock(&tasklist_lock);
> spin_lock_irq(&tsk->sighand->siglock);
> if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> sig->group_exec_task = tsk;
> sig->notify_count = -zap_other_threads(tsk);
Hi Oleg! I somehow manage to miss a moment -- why negative result here?
> err = 0;
> }
> spin_unlock_irq(&tsk->sighand->siglock);
> read_unlock(&tasklist_lock);
>
> return err;
> }
p.s. i've dropped long CC but left ML intact)
Cyrill
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
2025-11-10 10:58 ` Cyrill Gorcunov
@ 2025-11-10 15:09 ` Oleg Nesterov
2025-11-10 21:49 ` Cyrill Gorcunov
0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-10 15:09 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module
Hi Cyrill,
On 11/10, Cyrill Gorcunov wrote:
>
> On Sun, Nov 09, 2025 at 06:15:33PM +0100, Oleg Nesterov wrote:
> ..
> > static int kill_sub_threads(struct task_struct *tsk)
> > {
> > struct signal_struct *sig = tsk->signal;
> > int err = -EINTR;
> >
> > read_lock(&tasklist_lock);
> > spin_lock_irq(&tsk->sighand->siglock);
> > if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> > sig->group_exec_task = tsk;
> > sig->notify_count = -zap_other_threads(tsk);
>
> Hi Oleg! I somehow manage to miss a moment -- why negative result here?
You know, initially I wrote
sig->notify_count = 0 - zap_other_threads(tsk);
to make it clear that this is not a typo ;)
This is for exit_notify() which does
/* mt-exec, de_thread() -> wait_for_notify_count() */
if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
wake_up_process(tsk->signal->group_exec_task);
Then setup_new_exec() sets notify_count > 0 for __exit_signal() which does
/* mt-exec, setup_new_exec() -> wait_for_notify_count() */
if (sig->notify_count > 0 && !--sig->notify_count)
wake_up_process(sig->group_exec_task);
Yes this needs more comments and (with or without this patch) cleanups.
Note that exit_notify() and __exit_signal() already (before this patch)
use ->notify_count almost the same way, just exit_notify() assumes that
notify_count < 0 means the !thread_group_leader() case in de_thread().
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
2025-11-10 15:09 ` Oleg Nesterov
@ 2025-11-10 21:49 ` Cyrill Gorcunov
2025-11-11 14:09 ` Oleg Nesterov
0 siblings, 1 reply; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-11-10 21:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module
On Mon, Nov 10, 2025 at 04:09:05PM +0100, Oleg Nesterov wrote:
...
> > > if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> > > sig->group_exec_task = tsk;
> > > sig->notify_count = -zap_other_threads(tsk);
> >
> > Hi Oleg! I somehow manage to miss a moment -- why negative result here?
>
> You know, initially I wrote
>
> sig->notify_count = 0 - zap_other_threads(tsk);
>
> to make it clear that this is not a typo ;)
Aha! Thanks a huge for explanation :)
>
> This is for exit_notify() which does
>
> /* mt-exec, de_thread() -> wait_for_notify_count() */
> if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
> wake_up_process(tsk->signal->group_exec_task);
>
> Then setup_new_exec() sets notify_count > 0 for __exit_signal() which does
>
> /* mt-exec, setup_new_exec() -> wait_for_notify_count() */
> if (sig->notify_count > 0 && !--sig->notify_count)
> wake_up_process(sig->group_exec_task);
>
> Yes this needs more comments and (with or without this patch) cleanups.
> Note that exit_notify() and __exit_signal() already (before this patch)
> use ->notify_count almost the same way, just exit_notify() assumes that
> notify_count < 0 means the !thread_group_leader() case in de_thread().
Yeah, just realized. It's been a long time since I looked into this signals
and tasks related code so to be honest don't think I would be helpful here)
Anyway while looking into patch I got wonder why
+static int wait_for_notify_count(struct task_struct *tsk)
+{
+ for (;;) {
+ return -EINTR;
+ set_current_state(TASK_KILLABLE);
+ if (!tsk->signal->notify_count)
+ break;
We have no any barrier here in fetching @notify_count? I mean updating
this value is done under locks (spin or read/write) in turn condition
test is a raw one. Not a big deal since set_current_state() and schedule()
are buffer flushers by themselves and after all not immediate update of
notify_count simply force us to yield one more schedule() call but I've
been a bit confused that we don't use some read_once here or something.
Another (more likely) that I've just said something stupid)
+ schedule();
}
+ __set_current_state(TASK_RUNNING);
+ return 0;
+}
Cyrill
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
2025-11-10 21:49 ` Cyrill Gorcunov
@ 2025-11-11 14:09 ` Oleg Nesterov
0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-11 14:09 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module
On 11/11, Cyrill Gorcunov wrote:
>
> Anyway while looking into patch I got wonder why
>
> +static int wait_for_notify_count(struct task_struct *tsk)
> +{
> + for (;;) {
> + return -EINTR;
> + set_current_state(TASK_KILLABLE);
> + if (!tsk->signal->notify_count)
> + break;
>
> We have no any barrier here in fetching @notify_count? I mean updating
> this value is done under locks (spin or read/write) in turn condition
> test is a raw one. Not a big deal since set_current_state() and schedule()
Yes, so I think that, correctness-wise, this doesn't need additional barriers.
> but I've
> been a bit confused that we don't use some read_once here or something.
Yes, this needs READ_ONCE() to avoid the warnings from KCSAN. And in fact
this code was written with READ_ONCE() but I removed it before sending this
RFC.
I was going to do this later. I always forget how KCSAN works, IIUC I also
need to add WRITE_ONCE() into exit_notify() and __exit_signal() to make
KCSAN happy, even if ->notify_count is always updated under the lock.
Same for the "if (me->signal->group_exec_task == me)" check in begin_new_exec().
Right now I would like to know if this RFC (approach) makes any sense,
especially because 3/3 adds a user-visible change.
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
2025-11-09 17:14 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
2025-11-09 17:15 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2025-11-09 17:16 ` Oleg Nesterov
2025-11-10 5:28 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
3 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:16 UTC (permalink / raw)
To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
The previous patch fixed the deadlock when mt-exec waits for debugger
which should reap a zombie thread, but we can hit the same problem if
the killed sub-thread stops in ptrace_event(PTRACE_EVENT_EXIT). Change
ptrace_stop() to check signal->group_exit_task.
This is a user-visible change. But hopefully it can't break anything.
Note that the semantics of PTRACE_EVENT_EXIT was never really defined,
it depends on /dev/random. Just for example, currently a sub-thread
killed by exec will stop, but if it exits on its own and races with
exec it will not stop, so nobody can rely on PTRACE_EVENT_EXIT anyway.
We really need to finally define what PTRACE_EVENT_EXIT should actually
do, but this needs other changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/signal.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/signal.c b/kernel/signal.c
index 334212044940..59f61e07905b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2376,6 +2376,10 @@ static int ptrace_stop(int exit_code, int why, unsigned long message,
if (!current->ptrace || __fatal_signal_pending(current))
return exit_code;
+ /* de_thread() -> wait_for_notify_count() waits for us */
+ if (current->signal->group_exec_task)
+ return exit_code;
+
set_special_state(TASK_TRACED);
current->jobctl |= JOBCTL_TRACED;
--
2.25.1.362.g51ebf55
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
` (2 preceding siblings ...)
2025-11-09 17:16 ` [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
@ 2025-11-10 5:28 ` Bernd Edlinger
2025-11-10 14:47 ` Oleg Nesterov
3 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-10 5:28 UTC (permalink / raw)
To: Oleg Nesterov, Linus Torvalds, Dmitry Levin
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Hi Oleg,
I have not been able to update my patch with your and Peter Peter Zijlstra's
kind suggestions, because I am currently too busy with my role as openssl maintainer.
Just for clarification, my patch is 10% about deadlocks, and 90% about security.
The idea is that if the de_thread is blocked, and the debugger may be trying to
ptrace the exec thread. That must succeed or fail. So the debugger can release
the zombie threads.
The security issue is when the debugged process tries to exec a SUID process
like /usr/bin/passwd
In that case the new credentials are determined differently when the PTRACE is
already attached (i.e. non-root), than when it is not yet attached (root user).
My attempt at fixing this, determines the new credentials and the new dumpability
as root user when the debugger did not yet attach before the de_thread.
And keeps this decision.
When the debugger wants to attach the de_thread the debug-user access rights are
checked against the current user and additionally against the new user credentials.
This I did by quickly switching the user credenitals to the next user and back again,
under the cred_guard_mutex, which should make that safe.
So at this time I have only one request for you.
Could you please try out how the test case in my patch behaves with your fix?
Thanks
Bernd.
On 11/9/25 18:14, Oleg Nesterov wrote:
> Not for inclusion yet. 2/2 is untested, incomplete, possibly buggy.
>
> But could you review at least the intent? Do you see any problem with
> this approach?
>
> This problem is very, very old. It seems that nobody can suggest a
> simple/clean fix...
>
> Oleg.
> ---
>
> fs/binfmt_elf.c | 4 +-
> fs/binfmt_elf_fdpic.c | 4 +-
> fs/binfmt_flat.c | 4 +-
> fs/exec.c | 142 +++++++++++++++++++++++-------------------------
> include/linux/binfmts.h | 2 +-
> kernel/exit.c | 9 +--
> kernel/signal.c | 6 +-
> 7 files changed, 87 insertions(+), 84 deletions(-)
>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
2025-11-10 5:28 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
@ 2025-11-10 14:47 ` Oleg Nesterov
0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-10 14:47 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Linus Torvalds, Dmitry Levin, Alexander Viro, Alexey Dobriyan,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Eric W. Biederman, Adrian Reber, Thomas Gleixner, Jens Axboe,
Alexei Starovoitov, linux-fsdevel, linux-kernel, linux-kselftest,
linux-mm, linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Hi Bernd,
On 11/10, Bernd Edlinger wrote:
>
> When the debugger wants to attach the de_thread the debug-user access rights are
> checked against the current user and additionally against the new user credentials.
> This I did by quickly switching the user credenitals to the next user and back again,
> under the cred_guard_mutex, which should make that safe.
Let me repeat, I can't really comment this part, I don't know if it is
actually safe. But the very fact your patch changes ->mm and ->cred of
the execing task in ptrace_attach() makes me worry... At least I think
you should update or remove this comment in begin_new_exec:
/*
* cred_guard_mutex must be held at least to this point to prevent
* ptrace_attach() from altering our determination of the task's
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
> So at this time I have only one request for you.
> Could you please try out how the test case in my patch behaves with your fix?
The new TEST(attach2) added by your patch fails as expected, see 3/3.
128 static long thread2_tid;
129 static void *thread2(void *arg)
130 {
131 thread2_tid = syscall(__NR_gettid);
132 sleep(2);
133 execlp("false", "false", NULL);
134 return NULL;
135 }
136
137 TEST(attach2)
138 {
139 int s, k, pid = fork();
140
141 if (!pid) {
142 pthread_t pt;
143
144 pthread_create(&pt, NULL, thread2, NULL);
145 pthread_join(pt, NULL);
146 return;
147 }
148
149 sleep(1);
150 k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
151 ASSERT_EQ(k, 0);
152 k = waitpid(-1, &s, 0);
153 ASSERT_EQ(k, pid);
154 ASSERT_EQ(WIFSTOPPED(s), 1);
155 ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
156 k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
157 ASSERT_EQ(k, 0);
158 thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
159 ASSERT_NE(thread2_tid, -1);
160 ASSERT_NE(thread2_tid, 0);
161 ASSERT_NE(thread2_tid, pid);
162 k = waitpid(-1, &s, WNOHANG);
163 ASSERT_EQ(k, 0);
164 sleep(2);
165 /* deadlock may happen here */
166 k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
PTRACE_ATTACH fails.
thread2() kills the old leader, takes it pid, execlp() succeeds.
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach
2025-08-21 17:34 ` [PATCH v17] " Bernd Edlinger
` (4 preceding siblings ...)
2025-11-09 17:14 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
@ 2025-11-18 18:13 ` Bernd Edlinger
2025-11-20 15:15 ` Eric W. Biederman
5 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-18 18:13 UTC (permalink / raw)
To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Alexey Dobriyan, Jens Axboe, Paul Moore, Elena Reshetova,
David Windsor, Mateusz Guzik, YueHaibing, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, tiozhang, Penglei Jiang, Lorenzo Stoakes,
Adrian Ratiu, Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.
The problem happens when a tracer tries to ptrace_attach
to a multi-threaded process, that does an execve in one of
the threads at the same time, without doing that in a forked
sub-process. That means: There is a race condition, when one
or more of the threads are already ptraced, but the thread
that invoked the execve is not yet traced. Now in this
case the execve locks the cred_guard_mutex and waits for
de_thread to complete. But that waits for the traced
sibling threads to exit, and those have to wait for the
tracer to receive the exit signal, but the tracer cannot
call wait right now, because it is waiting for the ptrace
call to complete, and this never does not happen.
The traced process and the tracer are now in a deadlock
situation, and can only be killed by a fatal signal.
The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.
When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check with the new PTRACE_MODE_BPRMCREDS flag to check
that the tracer has permission to trace the new credentials
and dumpability which are about to be used after execve
completes. If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.
Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending. In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.
This means there is no API change, unlike the previous
version of this patch which was discussed here:
https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/
See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.
Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
fs/exec.c | 86 +++++++++++---
fs/proc/base.c | 12 ++
include/linux/cred.h | 1 +
include/linux/ptrace.h | 1 +
include/linux/sched/signal.h | 18 +++
kernel/cred.c | 30 ++++-
kernel/ptrace.c | 29 ++++-
kernel/seccomp.c | 18 ++-
security/apparmor/lsm.c | 5 +-
security/commoncap.c | 5 +-
security/landlock/task.c | 7 +-
security/selinux/hooks.c | 7 +-
security/smack/smack_lsm.c | 5 +-
security/yama/yama_lsm.c | 11 +-
tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
15 files changed, 324 insertions(+), 46 deletions(-)
v10: Changes to previous version, make the PTRACE_ATTACH
return -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.
v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.
Note: I got actually one response from an automatic checker to the v11 patch,
https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
which is complaining about:
>> >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@
417 struct linux_binprm *bprm = task->signal->exec_bprm;
418 const struct cred *old_cred;
419 struct mm_struct *old_mm;
420
421 retval = down_write_killable(&task->signal->exec_update_lock);
422 if (retval)
423 goto unlock_creds;
424 task_lock(task);
> 425 old_cred = task->real_cred;
v12: Essentially identical to v11.
- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.
- re-tested the patch with all linux versions from v5.11 to v6.6
v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.
The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.
v13: Fixed duplicated Return section in function header of
is_dumpability_changed which was reported by the kernel test robot
v14: rebased to v6.7, refreshed and retested.
And added a more detailed description of the actual bug.
v15: rebased to v6.8-rc1, addressed some review comments.
Split the test case vmaccess into vmaccess1 and vmaccess2
to improve overall test coverage.
v16: rebased to 6.17-rc2, fixed some minor merge conflicts.
v17: avoid use of task->in_execve in ptrace_attach.
v18: Add some more comments, avoid the temporaty impersonation of the
new credentials, and use instead a new option to ptrace_may_access.
All security engines have to handle this option, but the advantage is
that the engines could detect and maybe also deny the unsafe execve.
Thanks
Bernd.
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e08d5d..02f3e8469125 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
{
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t;
+ bool unsafe_execve_in_progress = false;
if (thread_group_empty(tsk))
goto no_thread_group;
@@ -932,6 +934,36 @@ static int de_thread(struct task_struct *tsk)
if (!thread_group_leader(tsk))
sig->notify_count--;
+ for_other_threads(tsk, t) {
+ if (unlikely(t->ptrace) &&
+ (t != tsk->group_leader || !t->exit_state)) {
+ unsafe_execve_in_progress = true;
+ break;
+ }
+ }
+
+ if (unlikely(unsafe_execve_in_progress)) {
+ /*
+ * Since the spin lock was acquired while holding the
+ * mutex, both should be unlocked in reverse sequence and
+ * the spin lock re-acquired after releasing the mutex.
+ */
+ spin_unlock_irq(lock);
+ /*
+ * Sibling threads are notified by the non-zero exec_bprm,
+ * that they have just been zapped, and the cred_guard_mutex
+ * is to be released by them immediately.
+ * The caller of ptrace_attach on the other hand is allowed
+ * to ptrace any additional sibling threads that may not yet
+ * have ben ptraced, but if the group_exec_task is being
+ * ptraced, an additional check has to be performed, that the
+ * tracer is allowed to ptrace the new exec credentials.
+ */
+ sig->exec_bprm = bprm;
+ mutex_unlock(&sig->cred_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
@@ -1021,6 +1053,11 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
sig->group_exec_task = NULL;
sig->notify_count = 0;
@@ -1032,6 +1069,11 @@ static int de_thread(struct task_struct *tsk)
return 0;
killed:
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
sig->group_exec_task = NULL;
@@ -1114,13 +1156,31 @@ int begin_new_exec(struct linux_binprm * bprm)
*/
trace_sched_prepare_exec(current, bprm);
+ /* If the binary is not readable then enforce mm->dumpable=0 */
+ would_dump(bprm, bprm->file);
+ if (bprm->have_execfd)
+ would_dump(bprm, bprm->executable);
+
+ /*
+ * Figure out dumpability. Note that this checking only of current
+ * is wrong, but userspace depends on it. This should be testing
+ * bprm->secureexec instead.
+ */
+ if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+ is_dumpability_changed(current_cred(), bprm->cred) ||
+ !(uid_eq(current_euid(), current_uid()) &&
+ gid_eq(current_egid(), current_gid())))
+ set_dumpable(bprm->mm, suid_dumpable);
+ else
+ set_dumpable(bprm->mm, SUID_DUMP_USER);
+
/*
* Ensure all future errors are fatal.
*/
bprm->point_of_no_return = true;
/* Make this the only thread in the thread group */
- retval = de_thread(me);
+ retval = de_thread(me, bprm);
if (retval)
goto out;
/* see the comment in check_unsafe_exec() */
@@ -1144,11 +1204,6 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out;
- /* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
- would_dump(bprm, bprm->executable);
-
/*
* Release all of the old mmap stuff
*/
@@ -1210,18 +1265,6 @@ int begin_new_exec(struct linux_binprm * bprm)
me->sas_ss_sp = me->sas_ss_size = 0;
- /*
- * Figure out dumpability. Note that this checking only of current
- * is wrong, but userspace depends on it. This should be testing
- * bprm->secureexec instead.
- */
- if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
- !(uid_eq(current_euid(), current_uid()) &&
- gid_eq(current_egid(), current_gid())))
- set_dumpable(current->mm, suid_dumpable);
- else
- set_dumpable(current->mm, SUID_DUMP_USER);
-
perf_event_exec();
/*
@@ -1275,6 +1318,10 @@ int begin_new_exec(struct linux_binprm * bprm)
* cred_guard_mutex must be held at least to this point to prevent
* ptrace_attach() from altering our determination of the task's
* credentials; any time after this it may be unlocked.
+ * Note that de_thread may temporarily release the cred_guard_mutex,
+ * but the credentials are pre-determined in that case and the ptrace
+ * access check guarantees, that the access permissions of the tracer
+ * are sufficient to trace the task also with the new credentials.
*/
security_bprm_committed_creds(bprm);
@@ -1361,6 +1408,7 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ /* It is not necessary to check current->signal->exec_bprm here. */
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878e3d97..f554c2638ffb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2838,6 +2838,18 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
if (rv < 0)
goto out_free;
+ /*
+ * A fatal signal is guaranteed to be already pending in the
+ * unlikely event, that current->signal->exec_bprm happens
+ * to be non-zero here, so just release the mutex again
+ * and continue as if mutex_lock_interruptible did fail.
+ */
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ rv = -ERESTARTNOINTR;
+ goto out_free;
+ }
+
rv = security_setprocattr(PROC_I(inode)->op.lsmid,
file->f_path.dentry->d_name.name, page,
count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 89ae50ad2ace..4204c19de714 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
extern struct cred *cred_alloc_blank(void);
extern struct cred *prepare_creds(void);
extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
extern int commit_creds(struct cred *);
extern void abort_creds(struct cred *);
extern struct cred *prepare_kernel_cred(struct task_struct *);
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 90507d4afcd6..dbd58a4807bc 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -64,6 +64,7 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
#define PTRACE_MODE_NOAUDIT 0x04
#define PTRACE_MODE_FSCREDS 0x08
#define PTRACE_MODE_REALCREDS 0x10
+#define PTRACE_MODE_BPRMCREDS 0x20
/* shorthands for READ/ATTACH and FSCREDS/REALCREDS combinations */
#define PTRACE_MODE_READ_FSCREDS (PTRACE_MODE_READ | PTRACE_MODE_FSCREDS)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7d6449982822..ade7d7173875 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -241,9 +241,27 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
+ struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
+ * against new credentials while
+ * de_thread is waiting for other
+ * traced threads to terminate.
+ * Set while de_thread is executing.
+ * The cred_guard_mutex is released
+ * after de_thread() has called
+ * zap_other_threads(), therefore
+ * a fatal signal is guaranteed to be
+ * already pending in the unlikely
+ * event, that
+ * current->signal->exec_bprm happens
+ * to be non-zero after the
+ * cred_guard_mutex was acquired.
+ */
+
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace)
+ * Held while execve runs, except when
+ * a sibling thread is being traced.
* Deprecated do not use in new code.
* Use exec_update_lock instead.
*/
diff --git a/kernel/cred.c b/kernel/cred.c
index dbf6b687dc5c..69fd0de813c0 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
return false;
}
+/**
+ * is_dumpability_changed - Will changing creds affect dumpability?
+ * @old: The old credentials.
+ * @new: The new credentials.
+ *
+ * If the @new credentials have no elevated privileges compared to the
+ * @old credentials, the task may remain dumpable. Otherwise we have
+ * to mark the task as undumpable to avoid information leaks from higher
+ * to lower privilege domains.
+ *
+ * Return: True if the task will become undumpable.
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+ if (!uid_eq(old->euid, new->euid) ||
+ !gid_eq(old->egid, new->egid) ||
+ !uid_eq(old->fsuid, new->fsuid) ||
+ !gid_eq(old->fsgid, new->fsgid) ||
+ !cred_cap_issubset(old, new))
+ return true;
+
+ return false;
+}
+
/**
* commit_creds - Install new credentials upon the current task
* @new: The credentials to be assigned
@@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
get_cred(new); /* we will require a ref for the subj creds too */
/* dumpability changes */
- if (!uid_eq(old->euid, new->euid) ||
- !gid_eq(old->egid, new->egid) ||
- !uid_eq(old->fsuid, new->fsuid) ||
- !gid_eq(old->fsgid, new->fsgid) ||
- !cred_cap_issubset(old, new)) {
+ if (is_dumpability_changed(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..ac750d1ccd04 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
#include <linux/pagemap.h>
#include <linux/ptrace.h>
#include <linux/security.h>
+#include <linux/binfmts.h>
#include <linux/signal.h>
#include <linux/uio.h>
#include <linux/audit.h>
@@ -285,6 +286,11 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
return -EPERM;
}
+ if ((mode & PTRACE_MODE_BPRMCREDS) && !task->signal->exec_bprm) {
+ WARN(1, "denying ptrace access check with PTRACE_MODE_BPRMCREDS\n");
+ return -EPERM;
+ }
+
/* May we inspect the given task?
* This check is used both for attaching with ptrace
* and for allowing access to sensitive information in /proc.
@@ -313,7 +319,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
caller_uid = cred->uid;
caller_gid = cred->gid;
}
- tcred = __task_cred(task);
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ tcred = task->signal->exec_bprm->cred;
+ else
+ tcred = __task_cred(task);
if (uid_eq(caller_uid, tcred->euid) &&
uid_eq(caller_uid, tcred->suid) &&
uid_eq(caller_uid, tcred->uid) &&
@@ -337,7 +346,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
* Pairs with a write barrier in commit_creds().
*/
smp_rmb();
- mm = task->mm;
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ mm = task->signal->exec_bprm->mm;
+ else
+ mm = task->mm;
if (mm &&
((get_dumpable(mm) != SUID_DUMP_USER) &&
!ptrace_has_cap(mm->user_ns, mode)))
@@ -451,6 +463,14 @@ static int ptrace_attach(struct task_struct *task, long request,
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
if (retval)
return retval;
+
+ if (unlikely(task == task->signal->group_exec_task)) {
+ retval = __ptrace_may_access(task,
+ PTRACE_MODE_ATTACH_REALCREDS |
+ PTRACE_MODE_BPRMCREDS);
+ if (retval)
+ return retval;
+ }
}
scoped_guard (write_lock_irq, &tasklist_lock) {
@@ -488,6 +508,10 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ /* It is not necessary to check current->signal->exec_bprm here. */
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
if (!current->ptrace) {
@@ -503,6 +527,7 @@ static int ptrace_traceme(void)
}
}
write_unlock_irq(&tasklist_lock);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 25f62867a16d..6e4ff108faa0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2010,9 +2010,21 @@ static long seccomp_set_mode_filter(unsigned int flags,
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
- if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
- goto out_put_fd;
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ goto out_put_fd;
+
+ /*
+ * A fatal signal is guaranteed to be already pending in the
+ * unlikely event, that current->signal->exec_bprm happens
+ * to be non-zero here, so just release the mutex again
+ * and continue as if mutex_lock_killable did fail.
+ */
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ goto out_put_fd;
+ }
+ }
spin_lock_irq(¤t->sighand->siglock);
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index b3f7a3258a2c..80c2cd968f05 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -129,7 +129,10 @@ static int apparmor_ptrace_access_check(struct task_struct *child,
int error;
bool needput;
- cred = get_task_cred(child);
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ cred = get_cred(child->signal->exec_bprm->cred);
+ else
+ cred = get_task_cred(child);
tracee = cred_label(cred); /* ref count on cred */
tracer = __begin_current_label_crit_section(&needput);
error = aa_may_ptrace(current_cred(), tracer, cred, tracee,
diff --git a/security/commoncap.c b/security/commoncap.c
index 6bd4adeb4795..e23a78a4514c 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -169,7 +169,10 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
rcu_read_lock();
cred = current_cred();
- child_cred = __task_cred(child);
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ child_cred = child->signal->exec_bprm->cred;
+ else
+ child_cred = __task_cred(child);
if (mode & PTRACE_MODE_FSCREDS)
caller_caps = &cred->cap_effective;
else
diff --git a/security/landlock/task.c b/security/landlock/task.c
index 2385017418ca..46ce9b6e4728 100644
--- a/security/landlock/task.c
+++ b/security/landlock/task.c
@@ -17,6 +17,7 @@
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/sched/signal.h>
+#include <linux/binfmts.h>
#include <net/af_unix.h>
#include <net/sock.h>
@@ -96,7 +97,11 @@ static int hook_ptrace_access_check(struct task_struct *const child,
scoped_guard(rcu)
{
- child_dom = landlock_get_task_domain(child);
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ child_dom = landlock_cred(child->signal->
+ exec_bprm->cred)->domain;
+ else
+ child_dom = landlock_get_task_domain(child);
err = domain_ptrace(parent_subject->domain, child_dom);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index dfc22da42f30..016e21180e96 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2111,7 +2111,12 @@ static int selinux_ptrace_access_check(struct task_struct *child,
unsigned int mode)
{
u32 sid = current_sid();
- u32 csid = task_sid_obj(child);
+ u32 csid;
+
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ csid = cred_sid(child->signal->exec_bprm->cred);
+ else
+ csid = task_sid_obj(child);
if (mode & PTRACE_MODE_READ)
return avc_has_perm(sid, csid, SECCLASS_FILE, FILE__READ,
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index af986587841d..5c2ca49baa05 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -476,7 +476,10 @@ static int smack_ptrace_access_check(struct task_struct *ctp, unsigned int mode)
{
struct smack_known *skp;
- skp = smk_of_task_struct_obj(ctp);
+ if (mode & PTRACE_MODE_BPRMCREDS)
+ skp = smk_of_task(smack_cred(ctp->signal->exec_bprm->cred));
+ else
+ skp = smk_of_task_struct_obj(ctp);
return smk_ptrace_rule_check(current, skp, mode, __func__);
}
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 3d064dd4e03f..8ac9ce41d4f0 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -11,6 +11,7 @@
#include <linux/lsm_hooks.h>
#include <linux/sysctl.h>
#include <linux/ptrace.h>
+#include <linux/binfmts.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/workqueue.h>
@@ -363,13 +364,19 @@ static int yama_ptrace_access_check(struct task_struct *child,
rc = -EPERM;
if (!rc && !task_is_descendant(current, child) &&
!ptracer_exception_found(current, child) &&
- !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
+ !ns_capable(mode & PTRACE_MODE_BPRMCREDS ?
+ child->signal->exec_bprm->cred->user_ns :
+ __task_cred(child)->user_ns,
+ CAP_SYS_PTRACE))
rc = -EPERM;
rcu_read_unlock();
break;
case YAMA_SCOPE_CAPABILITY:
rcu_read_lock();
- if (!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
+ if (!ns_capable(mode & PTRACE_MODE_BPRMCREDS ?
+ child->signal->exec_bprm->cred->user_ns :
+ __task_cred(child)->user_ns,
+ CAP_SYS_PTRACE))
rc = -EPERM;
rcu_read_unlock();
break;
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..5d4a65eb5a8d 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -14,6 +14,7 @@
#include <signal.h>
#include <unistd.h>
#include <sys/ptrace.h>
+#include <sys/syscall.h>
static void *thread(void *arg)
{
@@ -23,7 +24,7 @@ static void *thread(void *arg)
TEST(vmaccess)
{
- int f, pid = fork();
+ int s, f, pid = fork();
char mm[64];
if (!pid) {
@@ -31,19 +32,42 @@ TEST(vmaccess)
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
- execlp("true", "true", NULL);
+ execlp("false", "false", NULL);
+ return;
}
sleep(1);
sprintf(mm, "/proc/%d/mem", pid);
+ /* deadlock did happen here */
f = open(mm, O_RDONLY);
ASSERT_GE(f, 0);
close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+ f = waitpid(-1, &s, WNOHANG);
+ ASSERT_NE(f, -1);
+ ASSERT_NE(f, 0);
+ ASSERT_NE(f, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ f = waitpid(-1, &s, 0);
+ ASSERT_EQ(f, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, -1);
+ ASSERT_EQ(errno, ECHILD);
}
-TEST(attach)
+/*
+ * Same test as previous, except that
+ * we try to ptrace the group leader,
+ * which is about to call execve,
+ * when the other thread is already ptraced.
+ * This exercises the code in de_thread
+ * where it is waiting inside the
+ * while (sig->notify_count) {
+ * loop.
+ */
+TEST(attach1)
{
int s, k, pid = fork();
@@ -52,19 +76,76 @@ TEST(attach)
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
- execlp("sleep", "sleep", "2", NULL);
+ execlp("false", "false", NULL);
+ return;
}
sleep(1);
+ /* deadlock may happen here */
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
- ASSERT_EQ(k, -1);
+ ASSERT_EQ(k, 0);
k = waitpid(-1, &s, WNOHANG);
ASSERT_NE(k, -1);
ASSERT_NE(k, 0);
ASSERT_NE(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
+ k = waitpid(-1, NULL, 0);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Same test as previous, except that
+ * the group leader is ptraced first,
+ * but this time with PTRACE_O_TRACEEXIT,
+ * and the thread that does execve is
+ * not yet ptraced. This exercises the
+ * code block in de_thread where the
+ * if (!thread_group_leader(tsk)) {
+ * is executed and enters a wait state.
+ */
+static long thread2_tid;
+static void *thread2(void *arg)
+{
+ thread2_tid = syscall(__NR_gettid);
+ sleep(2);
+ execlp("false", "false", NULL);
+ return NULL;
+}
+
+TEST(attach2)
+{
+ int s, k, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread2, NULL);
+ pthread_join(pt, NULL);
+ return;
+ }
+
sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
ASSERT_EQ(k, 0);
@@ -72,12 +153,46 @@ TEST(attach)
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
+ ASSERT_EQ(k, 0);
+ thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
+ ASSERT_NE(thread2_tid, -1);
+ ASSERT_NE(thread2_tid, 0);
+ ASSERT_NE(thread2_tid, pid);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ sleep(2);
+ /* deadlock may happen here */
+ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
- ASSERT_EQ(WEXITSTATUS(s), 0);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
k = waitpid(-1, NULL, 0);
ASSERT_EQ(k, -1);
ASSERT_EQ(errno, ECHILD);
--
2.39.5
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-18 18:13 ` [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
@ 2025-11-20 15:15 ` Eric W. Biederman
2025-11-20 17:29 ` Eric W. Biederman
0 siblings, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-20 15:15 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.
A small quibble it isn't a dead lock. It isn't even really a live lock,
as it is possible to SIGKILL our way out.
Thinking about this there is a really silly and simple way we can deal
with this situation for PTRACE_ATTACH. We can send SIGSTOP and wait for
the thread to stop before doing anything with cred_guard_mutex.
PTRACE_ATTACH already implies sending SIGSTOP so as long as we have
enough permissions to send SIGSTOP I don't see that being a problem.
The worst case I can see is that we get a case where we stop the
process, the permission check fails under cred_guard_mutex and
and ptrace attach has fails and has to send SIGCONT to undo it's
premature SIGSTOP. That might almost be visible, but it would still
be legitimate because we can still check that we have permission to
send SIGSTOP.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-20 15:15 ` Eric W. Biederman
@ 2025-11-20 17:29 ` Eric W. Biederman
2025-11-20 20:57 ` [RFC][PATCH] exec: Move cred computation under exec_update_lock Eric W. Biederman
2025-11-22 17:10 ` [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
0 siblings, 2 replies; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-20 17:29 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
"Eric W. Biederman" <ebiederm@xmission.com> writes:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> This introduces signal->exec_bprm, which is used to
>> fix the case when at least one of the sibling threads
>> is traced, and therefore the trace process may dead-lock
>> in ptrace_attach, but de_thread will need to wait for the
>> tracer to continue execution.
>
> A small quibble it isn't a dead lock. It isn't even really a live lock,
> as it is possible to SIGKILL our way out.
>
> Thinking about this there is a really silly and simple way we can deal
> with this situation for PTRACE_ATTACH. We can send SIGSTOP and wait for
> the thread to stop before doing anything with cred_guard_mutex.
>
> PTRACE_ATTACH already implies sending SIGSTOP so as long as we have
> enough permissions to send SIGSTOP I don't see that being a problem.
>
> The worst case I can see is that we get a case where we stop the
> process, the permission check fails under cred_guard_mutex and
> and ptrace attach has fails and has to send SIGCONT to undo it's
> premature SIGSTOP. That might almost be visible, but it would still
> be legitimate because we can still check that we have permission to
> send SIGSTOP.
Bah no I am full of it.
The challenging behavior is in the semantics of the kernel operations.
We need to describe it as such please.
It is the same class of problem as a single threaded process calls exec
with a pipe attached to both stdin and stdout of the new process.
For the stdin and stdout we can say just use pull and nonblocking I/O.
The problem is that both PTRACE_ATTACH and PTRACE_SEIZE block over
the duration of exec, and if exec is waiting for a thread to exit,
and that thread is blocked in PTRACE_EVENT_EXIT waiting for that very
same tracer those processes will hang. Not deadlock.
I haven't seen anyone clearly describe the problem lately so I am
repeating it.
Just looking at the code I don't think there is any fundamental reason
to call commit_creds after de_thread. If we can change that we can sort
this out without any change in userspace semantics.
If we can't move commit_creds we have to either give
PTRACE_ATTACH/PTRACE_SEIZE a non-block mode, or break out of
PTRACE_EVENT_EXIT in de_thread.
I will post a proof of concept of moving commit_creds in just a minute.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-20 17:29 ` Eric W. Biederman
@ 2025-11-20 20:57 ` Eric W. Biederman
2025-11-20 23:50 ` Eric W. Biederman
` (2 more replies)
2025-11-22 17:10 ` [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
1 sibling, 3 replies; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-20 20:57 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Instead of computing the new cred before we pass the point of no
return compute the new cred just before we use it.
This allows the removal of fs_struct->in_exec and cred_guard_mutex.
I am not certain why we wanted to compute the cred for the new
executable so early. Perhaps I missed something but I did not see any
common errors being signaled. So I don't think we loose anything by
computing the new cred later.
We gain a lot.
We stop holding the cred_guard_mutex over places where the code sleeps
and waits for userspace. These places include the waiting for the
tracer in PTRACE_EVENT_EXIT, "put_user(0, tsk->clear_child_tid)" in
mm_release, and "get_user(futex_offset, ...") in exit_robust_mutex.
We can remove fs_struct->in_exec. The case where it was used simply
never comes up, when we compute the cred after de_thread completes.
We remove the possibility of a hang between a tracer calling
PTRACE_ATTACH/PTRACE_SIEZE and the kernel waiting for the tracer
in PTRACE_EVENT_EXIT.
---
Oleg, Kees, Bernd, Can you see anything I am missing?
The code compiles but I haven't test it yet.
I thought I was going to move commit_creds before de_thread, but that
would have taken commit_cred out of exec_update_lock (which introduces
races).
However I can't see any drawbacks of going the other direction.
fs/exec.c | 88 ++++++++++++++----------------------
fs/fs_struct.c | 1 -
fs/proc/base.c | 4 +-
include/linux/fs_struct.h | 1 -
include/linux/sched/signal.h | 6 ---
init/init_task.c | 1 -
kernel/cred.c | 2 +-
kernel/fork.c | 8 +---
kernel/ptrace.c | 4 +-
kernel/seccomp.c | 12 ++---
10 files changed, 45 insertions(+), 82 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e08d5d..5ae96584dab0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1090,6 +1090,9 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
perf_event_comm(tsk, exec);
}
+static int prepare_bprm_creds(struct linux_binprm *bprm);
+static void check_unsafe_exec(struct linux_binprm *bprm);
+
/*
* Calling this is the point of no return. None of the failures will be
* seen by userspace since either the process is already taking a fatal
@@ -1101,10 +1104,6 @@ int begin_new_exec(struct linux_binprm * bprm)
struct task_struct *me = current;
int retval;
- /* Once we are committed compute the creds */
- retval = bprm_creds_from_file(bprm);
- if (retval)
- return retval;
/*
* This tracepoint marks the point before flushing the old exec where
@@ -1123,8 +1122,6 @@ int begin_new_exec(struct linux_binprm * bprm)
retval = de_thread(me);
if (retval)
goto out;
- /* see the comment in check_unsafe_exec() */
- current->fs->in_exec = 0;
/*
* Cancel any io_uring activity across execve
*/
@@ -1251,6 +1248,25 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
+ retval = prepare_bprm_creds(bprm);
+ if (retval)
+ goto out_unlock;
+
+ /*
+ * Check for unsafe execution states before exec_binprm(), which
+ * will call back into begin_new_exec(), into bprm_creds_from_file(),
+ * where setuid-ness is evaluated.
+ */
+ check_unsafe_exec(bprm);
+
+ /* Set the unchanging part of bprm->cred */
+ retval = security_bprm_creds_for_exec(bprm);
+
+ /* Once we are committed compute the creds */
+ retval = bprm_creds_from_file(bprm);
+ if (retval)
+ goto out_unlock;
+
retval = set_cred_ucounts(bprm->cred);
if (retval < 0)
goto out_unlock;
@@ -1272,9 +1288,9 @@ int begin_new_exec(struct linux_binprm * bprm)
if (get_dumpable(me->mm) != SUID_DUMP_USER)
perf_event_exit_task(me);
/*
- * cred_guard_mutex must be held at least to this point to prevent
+ * exec_update_lock must be held at least to this point to prevent
* ptrace_attach() from altering our determination of the task's
- * credentials; any time after this it may be unlocked.
+ * credentials.
*/
security_bprm_committed_creds(bprm);
@@ -1291,8 +1307,6 @@ int begin_new_exec(struct linux_binprm * bprm)
out_unlock:
up_write(&me->signal->exec_update_lock);
- if (!bprm->cred)
- mutex_unlock(&me->signal->cred_guard_mutex);
out:
return retval;
@@ -1336,7 +1350,6 @@ void setup_new_exec(struct linux_binprm * bprm)
*/
me->mm->task_size = TASK_SIZE;
up_write(&me->signal->exec_update_lock);
- mutex_unlock(&me->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(setup_new_exec);
@@ -1351,21 +1364,15 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
- * setup_new_exec() commits the new creds and drops the lock.
- * Or, if exec fails before, free_bprm() should release ->cred
- * and unlock.
+ * Prepare credentials. begin_new_exec() commits the new creds.
+ * Or, if exec fails before, free_bprm() should release ->cred.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
- return -ERESTARTNOINTR;
-
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
- mutex_unlock(¤t->signal->cred_guard_mutex);
return -ENOMEM;
}
@@ -1386,9 +1393,7 @@ static void free_bprm(struct linux_binprm *bprm)
}
free_arg_pages(bprm);
if (bprm->cred) {
- /* in case exec fails before de_thread() succeeds */
- current->fs->in_exec = 0;
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ /* in case exec fails before commit_creds succeeds */
abort_creds(bprm->cred);
}
do_close_execat(bprm->file);
@@ -1486,13 +1491,12 @@ EXPORT_SYMBOL(bprm_change_interp);
/*
* determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must hold ->exec_update_lock to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
static void check_unsafe_exec(struct linux_binprm *bprm)
{
- struct task_struct *p = current, *t;
- unsigned n_fs;
+ struct task_struct *p = current;
if (p->ptrace)
bprm->unsafe |= LSM_UNSAFE_PTRACE;
@@ -1509,25 +1513,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
* suid exec because the differently privileged task
* will be able to manipulate the current directory, etc.
* It would be nice to force an unshare instead...
- *
- * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
- * from another sub-thread until de_thread() succeeds, this
- * state is protected by cred_guard_mutex we hold.
*/
- n_fs = 1;
- read_seqlock_excl(&p->fs->seq);
- rcu_read_lock();
- for_other_threads(p, t) {
- if (t->fs == p->fs)
- n_fs++;
- }
- rcu_read_unlock();
-
- /* "users" and "in_exec" locked for copy_fs() */
- if (p->fs->users > n_fs)
+ if (p->fs->users > 1)
bprm->unsafe |= LSM_UNSAFE_SHARE;
- else
- p->fs->in_exec = 1;
read_sequnlock_excl(&p->fs->seq);
}
@@ -1731,25 +1719,15 @@ static int bprm_execve(struct linux_binprm *bprm)
{
int retval;
- retval = prepare_bprm_creds(bprm);
- if (retval)
- return retval;
+ if (bprm->is_check)
+ return 0;
- /*
- * Check for unsafe execution states before exec_binprm(), which
- * will call back into begin_new_exec(), into bprm_creds_from_file(),
- * where setuid-ness is evaluated.
- */
- check_unsafe_exec(bprm);
current->in_execve = 1;
sched_mm_cid_before_execve(current);
sched_exec();
- /* Set the unchanging part of bprm->cred */
- retval = security_bprm_creds_for_exec(bprm);
- if (retval || bprm->is_check)
- goto out;
+
retval = exec_binprm(bprm);
if (retval < 0)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 28be762ac1c6..945bc0916f65 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -109,7 +109,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
/* We don't need to lock fs - think why ;-) */
if (fs) {
fs->users = 1;
- fs->in_exec = 0;
seqlock_init(&fs->seq);
fs->umask = old->umask;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878e3d97..7041fb4d1689 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2834,14 +2834,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
}
/* Guard against adverse ptrace interaction */
- rv = mutex_lock_interruptible(¤t->signal->cred_guard_mutex);
+ rv = down_write_killable(¤t->signal->exec_update_lock);
if (rv < 0)
goto out_free;
rv = security_setprocattr(PROC_I(inode)->op.lsmid,
file->f_path.dentry->d_name.name, page,
count);
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ up_write(¤t->signal->exec_update_lock);
out_free:
kfree(page);
out:
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index baf200ab5c77..29d0f7d57743 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -10,7 +10,6 @@ struct fs_struct {
int users;
seqlock_t seq;
int umask;
- int in_exec;
struct path root, pwd;
} __randomize_layout;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7d6449982822..7e9259c8fb2b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -241,12 +241,6 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
- struct mutex cred_guard_mutex; /* guard against foreign influences on
- * credential calculations
- * (notably. ptrace)
- * Deprecated do not use in new code.
- * Use exec_update_lock instead.
- */
struct rw_semaphore exec_update_lock; /* Held while task_struct is
* being updated during exec,
* and may have inconsistent
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..4813bffe217e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -30,7 +30,6 @@ static struct signal_struct init_signals = {
#ifdef CONFIG_CGROUPS
.cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
#endif
- .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = HLIST_HEAD_INIT,
diff --git a/kernel/cred.c b/kernel/cred.c
index dbf6b687dc5c..80e376ce005f 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -252,7 +252,7 @@ EXPORT_SYMBOL(prepare_creds);
/*
* Prepare credentials for current to perform an execve()
- * - The caller must hold ->cred_guard_mutex
+ * - The caller must hold ->exec_update_lock
*/
struct cred *prepare_exec_creds(void)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..996c649b9a4c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1555,11 +1555,6 @@ static int copy_fs(u64 clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_FS) {
/* tsk->fs is already what we want */
read_seqlock_excl(&fs->seq);
- /* "users" and "in_exec" locked for check_unsafe_exec() */
- if (fs->in_exec) {
- read_sequnlock_excl(&fs->seq);
- return -EAGAIN;
- }
fs->users++;
read_sequnlock_excl(&fs->seq);
return 0;
@@ -1699,7 +1694,6 @@ static int copy_signal(u64 clone_flags, struct task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
- mutex_init(&sig->cred_guard_mutex);
init_rwsem(&sig->exec_update_lock);
return 0;
@@ -1710,7 +1704,7 @@ static void copy_seccomp(struct task_struct *p)
#ifdef CONFIG_SECCOMP
/*
* Must be called with sighand->lock held, which is common to
- * all threads in the group. Holding cred_guard_mutex is not
+ * all threads in the group. Holding exec_update_lock is not
* needed because this new task is not yet running and cannot
* be racing exec.
*/
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..8140d4bfc279 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,8 +444,8 @@ static int ptrace_attach(struct task_struct *task, long request,
* SUID, SGID and LSM creds get determined differently
* under ptrace.
*/
- scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
- &task->signal->cred_guard_mutex) {
+ scoped_cond_guard (rwsem_read_intr, return -ERESTARTNOINTR,
+ &task->signal->exec_update_lock) {
scoped_guard (task_lock, task) {
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 25f62867a16d..87de8d47d876 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -479,7 +479,7 @@ static int is_ancestor(struct seccomp_filter *parent,
/**
* seccomp_can_sync_threads: checks if all threads can be synchronized
*
- * Expects sighand and cred_guard_mutex locks to be held.
+ * Expects sighand and exec_update_lock locks to be held.
*
* Returns 0 on success, -ve on error, or the pid of a thread which was
* either not in the correct seccomp mode or did not have an ancestral
@@ -489,7 +489,7 @@ static inline pid_t seccomp_can_sync_threads(void)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!rwsem_is_locked(¤t->signal->exec_update_lock));
assert_spin_locked(¤t->sighand->siglock);
/* Validate all threads being eligible for synchronization. */
@@ -590,7 +590,7 @@ void seccomp_filter_release(struct task_struct *tsk)
*
* @flags: SECCOMP_FILTER_FLAG_* flags to set during sync.
*
- * Expects sighand and cred_guard_mutex locks to be held, and for
+ * Expects sighand and exec_update_lock locks to be held, and for
* seccomp_can_sync_threads() to have returned success already
* without dropping the locks.
*
@@ -599,7 +599,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!rwsem_is_locked(¤t->signal->exec_update_lock));
assert_spin_locked(¤t->sighand->siglock);
/*
@@ -2011,7 +2011,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
* while another thread is in the middle of calling exec.
*/
if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ down_read_killable(¤t->signal->exec_update_lock))
goto out_put_fd;
spin_lock_irq(¤t->sighand->siglock);
@@ -2034,7 +2034,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
out:
spin_unlock_irq(¤t->sighand->siglock);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ up_read(¤t->signal->exec_update_lock);
out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret) {
--
2.41.0
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-20 20:57 ` [RFC][PATCH] exec: Move cred computation under exec_update_lock Eric W. Biederman
@ 2025-11-20 23:50 ` Eric W. Biederman
2025-11-21 2:59 ` Bernd Edlinger
2025-11-23 18:52 ` Oleg Nesterov
2025-11-25 11:55 ` Roberto Sassu
2 siblings, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-20 23:50 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
"Eric W. Biederman" <ebiederm@xmission.com> writes:
> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
>
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>
> I am not certain why we wanted to compute the cred for the new
> executable so early. Perhaps I missed something but I did not see any
> common errors being signaled. So I don't think we loose anything by
> computing the new cred later.
I should add that the permission checks happen in open_exec,
everything that follows credential wise is just about representing in
struct cred the credentials the new executable will have.
So I am really at a loss why we have had this complicated way of
computing of computed the credentials all of these years full of
time of check to time of use problems.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-20 23:50 ` Eric W. Biederman
@ 2025-11-21 2:59 ` Bernd Edlinger
2025-11-21 7:18 ` Eric W. Biederman
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-21 2:59 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Hi Eric,
thanks for you valuable input on the topic.
On 11/21/25 00:50, Eric W. Biederman wrote:
> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
>> Instead of computing the new cred before we pass the point of no
>> return compute the new cred just before we use it.
>>
>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>
>> I am not certain why we wanted to compute the cred for the new
>> executable so early. Perhaps I missed something but I did not see any
>> common errors being signaled. So I don't think we loose anything by
>> computing the new cred later.
>
> I should add that the permission checks happen in open_exec,
> everything that follows credential wise is just about representing in
> struct cred the credentials the new executable will have.
>
> So I am really at a loss why we have had this complicated way of
> computing of computed the credentials all of these years full of
> time of check to time of use problems.
>
Well, I think I see a problem with your patch:
When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
or the smack security engines at least. Previously that callback
was called before the point of no return, and the return code should
be returned as a return code the the caller of execve. But if we move
that check after the point of no return, the caller will get killed
due to the failed security check.
Or did I miss something?
Thanks
Bernd.
> Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-21 2:59 ` Bernd Edlinger
@ 2025-11-21 7:18 ` Eric W. Biederman
2025-11-21 9:35 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-21 7:18 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> Hi Eric,
>
> thanks for you valuable input on the topic.
>
> On 11/21/25 00:50, Eric W. Biederman wrote:
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>>> Instead of computing the new cred before we pass the point of no
>>> return compute the new cred just before we use it.
>>>
>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>>
>>> I am not certain why we wanted to compute the cred for the new
>>> executable so early. Perhaps I missed something but I did not see any
>>> common errors being signaled. So I don't think we loose anything by
>>> computing the new cred later.
>>
>> I should add that the permission checks happen in open_exec,
>> everything that follows credential wise is just about representing in
>> struct cred the credentials the new executable will have.
>>
>> So I am really at a loss why we have had this complicated way of
>> computing of computed the credentials all of these years full of
>> time of check to time of use problems.
>>
>
> Well, I think I see a problem with your patch:
>
> When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
> e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
> or the smack security engines at least. Previously that callback
> was called before the point of no return, and the return code should
> be returned as a return code the the caller of execve. But if we move
> that check after the point of no return, the caller will get killed
> due to the failed security check.
>
> Or did I miss something?
I think we definitely need to document this change in behavior. I would
call ending the exec with SIGSEGV vs -EPERM a quality of implementation
issue. The exec is failing one way or the other so I don't see it as a
correctness issue.
In the case of ptrace in general I think it is a bug if the mere act of
debugging a program changes it's behavior. So which buggy behavior
should we prefer? SIGSEGV where it is totally clear that the behavior
has changed or -EPERM and ask the debugged program to handle it.
I lean towards SIGSEGV because then it is clear the code should not
handle it.
In the case of LSM_UNSAFE_NO_NEW_PRIVS I believe the preferred way to
handle unexpected things happening is to terminate the application.
In the case of LSM_UNSAFE_SHARE -EPERM might be better. I don't know
of any good uses of any good uses of sys_clone(CLONE_FS ...) outside
of CLONE_THREAD.
Plus all of these things are only considerations if we are exec'ing a
program that transitions to a different set of credentials. Something
that happens but is quite rare itself.
In practice I don't expect there is anything that depends on the exact
behavior of what happens when exec'ing a suid executable to gain
privileges when ptraced. The closes I can imagine is upstart and
I think upstart ran as root when ptracing other programs so there is no
gaining of privilege and thus no reason for a security module to
complain.
Who knows I could be wrong, and someone could actually care. Which is
hy I think we should document it.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-21 7:18 ` Eric W. Biederman
@ 2025-11-21 9:35 ` Bernd Edlinger
2025-11-21 11:26 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-21 9:35 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/21/25 08:18, Eric W. Biederman wrote:
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
>> Hi Eric,
>>
>> thanks for you valuable input on the topic.
>>
>> On 11/21/25 00:50, Eric W. Biederman wrote:
>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>
>>>> Instead of computing the new cred before we pass the point of no
>>>> return compute the new cred just before we use it.
>>>>
>>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>>>
>>>> I am not certain why we wanted to compute the cred for the new
>>>> executable so early. Perhaps I missed something but I did not see any
>>>> common errors being signaled. So I don't think we loose anything by
>>>> computing the new cred later.
>>>
>>> I should add that the permission checks happen in open_exec,
>>> everything that follows credential wise is just about representing in
>>> struct cred the credentials the new executable will have.
>>>
>>> So I am really at a loss why we have had this complicated way of
>>> computing of computed the credentials all of these years full of
>>> time of check to time of use problems.
>>>
>>
>> Well, I think I see a problem with your patch:
>>
>> When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
>> e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
>> or the smack security engines at least. Previously that callback
>> was called before the point of no return, and the return code should
>> be returned as a return code the the caller of execve. But if we move
>> that check after the point of no return, the caller will get killed
>> due to the failed security check.
>>
>> Or did I miss something?
>
> I think we definitely need to document this change in behavior. I would
> call ending the exec with SIGSEGV vs -EPERM a quality of implementation
> issue. The exec is failing one way or the other so I don't see it as a
> correctness issue.
>
> In the case of ptrace in general I think it is a bug if the mere act of
> debugging a program changes it's behavior. So which buggy behavior
> should we prefer? SIGSEGV where it is totally clear that the behavior
> has changed or -EPERM and ask the debugged program to handle it.
> I lean towards SIGSEGV because then it is clear the code should not
> handle it.
>
> In the case of LSM_UNSAFE_NO_NEW_PRIVS I believe the preferred way to
> handle unexpected things happening is to terminate the application.
>
> In the case of LSM_UNSAFE_SHARE -EPERM might be better. I don't know
> of any good uses of any good uses of sys_clone(CLONE_FS ...) outside
> of CLONE_THREAD.
>
>
> Plus all of these things are only considerations if we are exec'ing a
> program that transitions to a different set of credentials. Something
> that happens but is quite rare itself.
>
> In practice I don't expect there is anything that depends on the exact
> behavior of what happens when exec'ing a suid executable to gain
> privileges when ptraced. The closes I can imagine is upstart and
> I think upstart ran as root when ptracing other programs so there is no
> gaining of privilege and thus no reason for a security module to
> complain.
>
> Who knows I could be wrong, and someone could actually care. Which is> hy I think we should document it.
>
Well, I dont know for sure, but the security engine could deny the execution
for any reason, not only because of being ptraced.
Maybe there can be a policy which denies user X to execute e.g. any suid programs.
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-21 9:35 ` Bernd Edlinger
@ 2025-11-21 11:26 ` Bernd Edlinger
2025-11-21 19:19 ` Eric W. Biederman
0 siblings, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-21 11:26 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/21/25 10:35, Bernd Edlinger wrote:
> On 11/21/25 08:18, Eric W. Biederman wrote:
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> Hi Eric,
>>>
>>> thanks for you valuable input on the topic.
>>>
>>> On 11/21/25 00:50, Eric W. Biederman wrote:
>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>
>>>>> Instead of computing the new cred before we pass the point of no
>>>>> return compute the new cred just before we use it.
>>>>>
>>>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>>>>
>>>>> I am not certain why we wanted to compute the cred for the new
>>>>> executable so early. Perhaps I missed something but I did not see any
>>>>> common errors being signaled. So I don't think we loose anything by
>>>>> computing the new cred later.
>>>>
>>>> I should add that the permission checks happen in open_exec,
>>>> everything that follows credential wise is just about representing in
>>>> struct cred the credentials the new executable will have.
>>>>
>>>> So I am really at a loss why we have had this complicated way of
>>>> computing of computed the credentials all of these years full of
>>>> time of check to time of use problems.
>>>>
>>>
>>> Well, I think I see a problem with your patch:
>>>
>>> When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
>>> e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
>>> or the smack security engines at least. Previously that callback
>>> was called before the point of no return, and the return code should
>>> be returned as a return code the the caller of execve. But if we move
>>> that check after the point of no return, the caller will get killed
>>> due to the failed security check.
>>>
>>> Or did I miss something?
>>
>> I think we definitely need to document this change in behavior. I would
>> call ending the exec with SIGSEGV vs -EPERM a quality of implementation
>> issue. The exec is failing one way or the other so I don't see it as a
>> correctness issue.
>>
>> In the case of ptrace in general I think it is a bug if the mere act of
>> debugging a program changes it's behavior. So which buggy behavior
>> should we prefer? SIGSEGV where it is totally clear that the behavior
>> has changed or -EPERM and ask the debugged program to handle it.
>> I lean towards SIGSEGV because then it is clear the code should not
>> handle it.
>>
>> In the case of LSM_UNSAFE_NO_NEW_PRIVS I believe the preferred way to
>> handle unexpected things happening is to terminate the application.
>>
>> In the case of LSM_UNSAFE_SHARE -EPERM might be better. I don't know
>> of any good uses of any good uses of sys_clone(CLONE_FS ...) outside
>> of CLONE_THREAD.
>>
>>
>> Plus all of these things are only considerations if we are exec'ing a
>> program that transitions to a different set of credentials. Something
>> that happens but is quite rare itself.
>>
>> In practice I don't expect there is anything that depends on the exact
>> behavior of what happens when exec'ing a suid executable to gain
>> privileges when ptraced. The closes I can imagine is upstart and
>> I think upstart ran as root when ptracing other programs so there is no
>> gaining of privilege and thus no reason for a security module to
>> complain.
>>
>> Who knows I could be wrong, and someone could actually care. Which is
>> hy I think we should document it.>>
>
>
> Well, I dont know for sure, but the security engine could deny the execution
> for any reason, not only because of being ptraced.
> Maybe there can be a policy which denies user X to execute e.g. any suid programs.
>
>
> Bernd.
>
Hmm, funny..
I installed this patch on top of
commit fd95357fd8c6778ac7dea6c57a19b8b182b6e91f (HEAD -> master, origin/master, origin/HEAD)
Merge: c966813ea120 7b6216baae75
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu Nov 20 11:04:37 2025 -0800
but it does panic when I try to boot:
[ 0.870539] TERM=1inux
[ 0.870573] Starting init: /bin/sh exists but couldn't execute it (error -14) 0.8705751 Kernel panic- not syncing: No working init found. Try passing i mit= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance
[ 0.870577] CPU: UID: 0 PID: 1 Comm: sh Not tainted 6.18.0-rc6+ #1 PREEMPT(voluntary)
[ 0.870579] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBo x 12/01/2006
[ 0.870580] Call Trace:
[ 0.870590] <TASK>
[ 0.870592] vpanic+0x36d/0x380
[ 0.870607] ? __pfx_kernel_init+0x10/0x10
[ 0.870615] panic+0x5b/0x60
[ 0.870617] kernel_init+0x17d/0x1c0
[ 0.870623] ret_from_fork+0x124/0x150
[ 0.870625} ? __pfx_kernel_init+0x10/0x10
[ 0.870627] ret_from_fork_asm+0x1a/0x30
[ 0.870632] </TASK>
[ 0.8706631 Kernel Offset: 0x3a800000 from Oxffffffff81000000 (relocation ran ge: 0xffffffff80000000-0xffffffffbfffffff)
[ 0.880034] ---[ end Kernel panic - not syncing: No working init found. Try passing init option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. 1---`
Is that a known problem?
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-21 11:26 ` Bernd Edlinger
@ 2025-11-21 19:19 ` Eric W. Biederman
2025-11-21 23:06 ` Ryan Lee
0 siblings, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-21 19:19 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> On 11/21/25 10:35, Bernd Edlinger wrote:
>> On 11/21/25 08:18, Eric W. Biederman wrote:
>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> thanks for you valuable input on the topic.
>>>>
>>>> On 11/21/25 00:50, Eric W. Biederman wrote:
>>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>>
>>>>>> Instead of computing the new cred before we pass the point of no
>>>>>> return compute the new cred just before we use it.
>>>>>>
>>>>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>>>>>
>>>>>> I am not certain why we wanted to compute the cred for the new
>>>>>> executable so early. Perhaps I missed something but I did not see any
>>>>>> common errors being signaled. So I don't think we loose anything by
>>>>>> computing the new cred later.
>>>>>
>>>>> I should add that the permission checks happen in open_exec,
>>>>> everything that follows credential wise is just about representing in
>>>>> struct cred the credentials the new executable will have.
>>>>>
>>>>> So I am really at a loss why we have had this complicated way of
>>>>> computing of computed the credentials all of these years full of
>>>>> time of check to time of use problems.
>>>>>
>>>>
>>>> Well, I think I see a problem with your patch:
>>>>
>>>> When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
>>>> e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
>>>> or the smack security engines at least. Previously that callback
>>>> was called before the point of no return, and the return code should
>>>> be returned as a return code the the caller of execve. But if we move
>>>> that check after the point of no return, the caller will get killed
>>>> due to the failed security check.
>>>>
>>>> Or did I miss something?
>>>
>>> I think we definitely need to document this change in behavior. I would
>>> call ending the exec with SIGSEGV vs -EPERM a quality of implementation
>>> issue. The exec is failing one way or the other so I don't see it as a
>>> correctness issue.
>>>
>>> In the case of ptrace in general I think it is a bug if the mere act of
>>> debugging a program changes it's behavior. So which buggy behavior
>>> should we prefer? SIGSEGV where it is totally clear that the behavior
>>> has changed or -EPERM and ask the debugged program to handle it.
>>> I lean towards SIGSEGV because then it is clear the code should not
>>> handle it.
>>>
>>> In the case of LSM_UNSAFE_NO_NEW_PRIVS I believe the preferred way to
>>> handle unexpected things happening is to terminate the application.
>>>
>>> In the case of LSM_UNSAFE_SHARE -EPERM might be better. I don't know
>>> of any good uses of any good uses of sys_clone(CLONE_FS ...) outside
>>> of CLONE_THREAD.
>>>
>>>
>>> Plus all of these things are only considerations if we are exec'ing a
>>> program that transitions to a different set of credentials. Something
>>> that happens but is quite rare itself.
>>>
>>> In practice I don't expect there is anything that depends on the exact
>>> behavior of what happens when exec'ing a suid executable to gain
>>> privileges when ptraced. The closes I can imagine is upstart and
>>> I think upstart ran as root when ptracing other programs so there is no
>>> gaining of privilege and thus no reason for a security module to
>>> complain.
>>>
>>> Who knows I could be wrong, and someone could actually care. Which is
>>> hy I think we should document it.>>
>>
>>
>> Well, I dont know for sure, but the security engine could deny the execution
>> for any reason, not only because of being ptraced.
>> Maybe there can be a policy which denies user X to execute e.g. any suid programs.
>>
>>
>> Bernd.
>>
>
> Hmm, funny..
>
> I installed this patch on top of
>
> commit fd95357fd8c6778ac7dea6c57a19b8b182b6e91f (HEAD -> master, origin/master, origin/HEAD)
> Merge: c966813ea120 7b6216baae75
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu Nov 20 11:04:37 2025 -0800
>
> but it does panic when I try to boot:
>
> [ 0.870539] TERM=1inux
> [ 0.870573] Starting init: /bin/sh exists but couldn't execute it (error -14) 0.8705751 Kernel panic- not syncing: No working init found. Try passing i mit= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance
> [ 0.870577] CPU: UID: 0 PID: 1 Comm: sh Not tainted 6.18.0-rc6+ #1 PREEMPT(voluntary)
> [ 0.870579] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBo x 12/01/2006
> [ 0.870580] Call Trace:
> [ 0.870590] <TASK>
> [ 0.870592] vpanic+0x36d/0x380
> [ 0.870607] ? __pfx_kernel_init+0x10/0x10
> [ 0.870615] panic+0x5b/0x60
> [ 0.870617] kernel_init+0x17d/0x1c0
> [ 0.870623] ret_from_fork+0x124/0x150
> [ 0.870625} ? __pfx_kernel_init+0x10/0x10
> [ 0.870627] ret_from_fork_asm+0x1a/0x30
> [ 0.870632] </TASK>
> [ 0.8706631 Kernel Offset: 0x3a800000 from Oxffffffff81000000 (relocation ran ge: 0xffffffff80000000-0xffffffffbfffffff)
> [ 0.880034] ---[ end Kernel panic - not syncing: No working init found. Try passing init option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. 1---`
>
>
> Is that a known problem?
Nope. It looks like the code needs a little bit bug fixing testing.
I will take see about taking a look.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-21 19:19 ` Eric W. Biederman
@ 2025-11-21 23:06 ` Ryan Lee
0 siblings, 0 replies; 68+ messages in thread
From: Ryan Lee @ 2025-11-21 23:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, apparmor
On Fri, Nov 21, 2025 at 11:20 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>
> > On 11/21/25 10:35, Bernd Edlinger wrote:
> >> On 11/21/25 08:18, Eric W. Biederman wrote:
> >>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
> >>>
> >>>> Hi Eric,
> >>>>
> >>>> thanks for you valuable input on the topic.
> >>>>
> >>>> On 11/21/25 00:50, Eric W. Biederman wrote:
> >>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
> >>>>>
> >>>>>> Instead of computing the new cred before we pass the point of no
> >>>>>> return compute the new cred just before we use it.
> >>>>>>
> >>>>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
> >>>>>>
> >>>>>> I am not certain why we wanted to compute the cred for the new
> >>>>>> executable so early. Perhaps I missed something but I did not see any
> >>>>>> common errors being signaled. So I don't think we loose anything by
> >>>>>> computing the new cred later.
> >>>>>
> >>>>> I should add that the permission checks happen in open_exec,
> >>>>> everything that follows credential wise is just about representing in
> >>>>> struct cred the credentials the new executable will have.
> >>>>>
> >>>>> So I am really at a loss why we have had this complicated way of
> >>>>> computing of computed the credentials all of these years full of
> >>>>> time of check to time of use problems.
> >>>>>
> >>>>
> >>>> Well, I think I see a problem with your patch:
> >>>>
> >>>> When the security engine gets the LSM_UNSAFE_PTRACE flag, it might
> >>>> e.g. return -EPERM in bprm_creds_for_exec in the apparmor, selinux
> >>>> or the smack security engines at least. Previously that callback
> >>>> was called before the point of no return, and the return code should
> >>>> be returned as a return code the the caller of execve. But if we move
> >>>> that check after the point of no return, the caller will get killed
> >>>> due to the failed security check.
> >>>>
> >>>> Or did I miss something?
> >>>
> >>> I think we definitely need to document this change in behavior. I would
> >>> call ending the exec with SIGSEGV vs -EPERM a quality of implementation
> >>> issue. The exec is failing one way or the other so I don't see it as a
> >>> correctness issue.
> >>>
> >>> In the case of ptrace in general I think it is a bug if the mere act of
> >>> debugging a program changes it's behavior. So which buggy behavior
> >>> should we prefer? SIGSEGV where it is totally clear that the behavior
> >>> has changed or -EPERM and ask the debugged program to handle it.
> >>> I lean towards SIGSEGV because then it is clear the code should not
> >>> handle it.
> >>>
> >>> In the case of LSM_UNSAFE_NO_NEW_PRIVS I believe the preferred way to
> >>> handle unexpected things happening is to terminate the application.
> >>>
> >>> In the case of LSM_UNSAFE_SHARE -EPERM might be better. I don't know
> >>> of any good uses of any good uses of sys_clone(CLONE_FS ...) outside
> >>> of CLONE_THREAD.
> >>>
> >>>
> >>> Plus all of these things are only considerations if we are exec'ing a
> >>> program that transitions to a different set of credentials. Something
> >>> that happens but is quite rare itself.
AppArmor's exec rules rely heavily on transitioning to different creds
on exec. For example, an AppArmor policy like
profile example_1 /usr/bin/example_1 {
/usr/bin/example_2 Px -> example_2_profile,
/usr/bin/example_3 Px,
}
will allow binary example_1 to execute binaries example_2 and
example_3, launching those processes under a different confinement
(example_2_profile and a profile that attaches to /usr/bin/example_3,
respectively). We will need to look into how much this patch (or a
corresponding change in behavior) would affect our use case, but
confinement transitions (where the confinement information is stored
as an LSM blob on the cred struct) are extremely common in a system
that uses AppArmor as an LSM.
> >>>
> >>> In practice I don't expect there is anything that depends on the exact
> >>> behavior of what happens when exec'ing a suid executable to gain
> >>> privileges when ptraced. The closes I can imagine is upstart and
> >>> I think upstart ran as root when ptracing other programs so there is no
> >>> gaining of privilege and thus no reason for a security module to
> >>> complain.
> >>>
> >>> Who knows I could be wrong, and someone could actually care. Which is
> >>> hy I think we should document it.>>
> >>
> >>
> >> Well, I dont know for sure, but the security engine could deny the execution
> >> for any reason, not only because of being ptraced.
> >> Maybe there can be a policy which denies user X to execute e.g. any suid programs.
> >>
> >>
> >> Bernd.
> >>
> >
> > Hmm, funny..
> >
> > I installed this patch on top of
> >
> > commit fd95357fd8c6778ac7dea6c57a19b8b182b6e91f (HEAD -> master, origin/master, origin/HEAD)
> > Merge: c966813ea120 7b6216baae75
> > Author: Linus Torvalds <torvalds@linux-foundation.org>
> > Date: Thu Nov 20 11:04:37 2025 -0800
> >
> > but it does panic when I try to boot:
> >
> > [ 0.870539] TERM=1inux
> > [ 0.870573] Starting init: /bin/sh exists but couldn't execute it (error -14) 0.8705751 Kernel panic- not syncing: No working init found. Try passing i mit= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance
> > [ 0.870577] CPU: UID: 0 PID: 1 Comm: sh Not tainted 6.18.0-rc6+ #1 PREEMPT(voluntary)
> > [ 0.870579] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBo x 12/01/2006
> > [ 0.870580] Call Trace:
> > [ 0.870590] <TASK>
> > [ 0.870592] vpanic+0x36d/0x380
> > [ 0.870607] ? __pfx_kernel_init+0x10/0x10
> > [ 0.870615] panic+0x5b/0x60
> > [ 0.870617] kernel_init+0x17d/0x1c0
> > [ 0.870623] ret_from_fork+0x124/0x150
> > [ 0.870625} ? __pfx_kernel_init+0x10/0x10
> > [ 0.870627] ret_from_fork_asm+0x1a/0x30
> > [ 0.870632] </TASK>
> > [ 0.8706631 Kernel Offset: 0x3a800000 from Oxffffffff81000000 (relocation ran ge: 0xffffffff80000000-0xffffffffbfffffff)
> > [ 0.880034] ---[ end Kernel panic - not syncing: No working init found. Try passing init option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. 1---`
> >
> >
> > Is that a known problem?
>
> Nope. It looks like the code needs a little bit bug fixing testing.
>
> I will take see about taking a look.
>
> Eric
>
I've also CC'ed the AppArmor mailing list on this patch to facilitate
discussion if, upon further investigation, this patch would require
changes or cause other problems on the AppArmor side.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-20 20:57 ` [RFC][PATCH] exec: Move cred computation under exec_update_lock Eric W. Biederman
2025-11-20 23:50 ` Eric W. Biederman
@ 2025-11-23 18:52 ` Oleg Nesterov
2025-11-23 23:22 ` Eric W. Biederman
2025-11-25 11:55 ` Roberto Sassu
2 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2025-11-23 18:52 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Eric,
sorry for delay, I am on PTO, didn't read emails this week...
On 11/20, Eric W. Biederman wrote:
>
> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
>
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>
> I am not certain why we wanted to compute the cred for the new
> executable so early. Perhaps I missed something but I did not see any
> common errors being signaled. So I don't think we loose anything by
> computing the new cred later.
>
> We gain a lot.
Yes. I LIKE your approach after a quick glance. And I swear, I thought about
it too ;)
But is it correct? I don't know. I'll try to actually read your patch next
week (I am on PTO untill the end of November), but I am not sure I can
provide a valuable feedback.
One "obvious" problem is that, after this patch, the execing process can crash
in a case when currently exec() returns an error...
Oleg.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-23 18:52 ` Oleg Nesterov
@ 2025-11-23 23:22 ` Eric W. Biederman
2025-11-25 16:19 ` Bernd Edlinger
0 siblings, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2025-11-23 23:22 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
Oleg Nesterov <oleg@redhat.com> writes:
> Eric,
>
> sorry for delay, I am on PTO, didn't read emails this week...
>
> On 11/20, Eric W. Biederman wrote:
>>
>> Instead of computing the new cred before we pass the point of no
>> return compute the new cred just before we use it.
>>
>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>
>> I am not certain why we wanted to compute the cred for the new
>> executable so early. Perhaps I missed something but I did not see any
>> common errors being signaled. So I don't think we loose anything by
>> computing the new cred later.
>>
>> We gain a lot.
>
> Yes. I LIKE your approach after a quick glance. And I swear, I thought about
> it too ;)
>
> But is it correct? I don't know. I'll try to actually read your patch next
> week (I am on PTO untill the end of November), but I am not sure I can
> provide a valuable feedback.
>
> One "obvious" problem is that, after this patch, the execing process can crash
> in a case when currently exec() returns an error...
Yes.
I have been testing and looking at it, and I have found a few issues,
and I am trying to see if I can resolve them.
The good news is that with the advent of AT_EXECVE_CHECK we have a
really clear API boundary between errors that must be diagnosed
and errors of happenstance like running out of memory.
The bad news is that the implementation of AT_EXECVE_CHECK seems to been
rather hackish especially with respect to security_bprm_creds_for_exec.
What I am hoping for is to get the 3 causes of errors of brpm->unsafe
( LSM_UNSAFE_SHARE, LSM_UNSAFE_PTRACE, and LSM_UNSAFE_NO_NEW_PRIVS )
handled cleanly outside of the cred_guard_mutex, and simply
retested when it is time to build the credentials of the new process.
In practice that should get the same failures modes as we have now
but it would get SIGSEGV in rare instances where things changed
during exec. That feels acceptable.
I thought of one other approach that might be enough to put the issue to
bed if cleaning up exec is too much work. We could have ptrace_attach
use a trylock and fail when it doesn't succeed. That would solve the
worst of the symptoms.
I think this would be a complete patch:
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..5dd2144e5789 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,7 +444,7 @@ static int ptrace_attach(struct task_struct *task, long request,
* SUID, SGID and LSM creds get determined differently
* under ptrace.
*/
- scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
+ scoped_cond_guard (mutex_try, return -EAGAIN,
&task->signal->cred_guard_mutex) {
scoped_guard (task_lock, task) {
--
2.41.0
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-23 23:22 ` Eric W. Biederman
@ 2025-11-25 16:19 ` Bernd Edlinger
0 siblings, 0 replies; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-25 16:19 UTC (permalink / raw)
To: Eric W. Biederman, Oleg Nesterov
Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Adrian Reber, Thomas Gleixner,
Jens Axboe, Alexei Starovoitov, linux-fsdevel, linux-kernel,
linux-kselftest, linux-mm, linux-security-module, tiozhang,
Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet
On 11/24/25 00:22, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
>
>> Eric,
>>
>> sorry for delay, I am on PTO, didn't read emails this week...
>>
>> On 11/20, Eric W. Biederman wrote:
>>>
>>> Instead of computing the new cred before we pass the point of no
>>> return compute the new cred just before we use it.
>>>
>>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>>
>>> I am not certain why we wanted to compute the cred for the new
>>> executable so early. Perhaps I missed something but I did not see any
>>> common errors being signaled. So I don't think we loose anything by
>>> computing the new cred later.
>>>
>>> We gain a lot.
>>
>> Yes. I LIKE your approach after a quick glance. And I swear, I thought about
>> it too ;)
>>
>> But is it correct? I don't know. I'll try to actually read your patch next
>> week (I am on PTO untill the end of November), but I am not sure I can
>> provide a valuable feedback.
>>
>> One "obvious" problem is that, after this patch, the execing process can crash
>> in a case when currently exec() returns an error...
>
> Yes.
>
> I have been testing and looking at it, and I have found a few issues,
> and I am trying to see if I can resolve them.
>
> The good news is that with the advent of AT_EXECVE_CHECK we have a
> really clear API boundary between errors that must be diagnosed
> and errors of happenstance like running out of memory.
>
> The bad news is that the implementation of AT_EXECVE_CHECK seems to been
> rather hackish especially with respect to security_bprm_creds_for_exec.
>
> What I am hoping for is to get the 3 causes of errors of brpm->unsafe
> ( LSM_UNSAFE_SHARE, LSM_UNSAFE_PTRACE, and LSM_UNSAFE_NO_NEW_PRIVS )
> handled cleanly outside of the cred_guard_mutex, and simply
> retested when it is time to build the credentials of the new process.
>
> In practice that should get the same failures modes as we have now
> but it would get SIGSEGV in rare instances where things changed
> during exec. That feels acceptable.
>
>
>
> I thought of one other approach that might be enough to put the issue to
> bed if cleaning up exec is too much work. We could have ptrace_attach
> use a trylock and fail when it doesn't succeed. That would solve the
> worst of the symptoms.
>
> I think this would be a complete patch:
>
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..5dd2144e5789 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -444,7 +444,7 @@ static int ptrace_attach(struct task_struct *task, long request,
> * SUID, SGID and LSM creds get determined differently
> * under ptrace.
> */
> - scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
> + scoped_cond_guard (mutex_try, return -EAGAIN,
> &task->signal->cred_guard_mutex) {
>
> scoped_guard (task_lock, task) {
This is very similar to my initial attempt of fixing the problem, as you
can see the test expectaion of the currently failing test in vmattach.c
is that ptrace(PTRACE_ATTACH, pid, 0L, 0L) returns -1 with errno = EAGAIN.
The disadvantage of that approach was, that it is a user-visible API-change,
but also that the debugger does not know when to retry the PTRACE_ATTACH,
in worst case it will go into an endless loop not knowing that a waitpid
and/or PTRACE_CONT is necessary to unblock the traced process.
But The main reason why I preferred the overlapping lifetime of the current
and the new credentials, is that the tracee can escape the PTRACE_ATTACH
if it is very short-lived, and indeed I had to cheat a little to make the
test case function TEST(attach) pass reliably:
The traced process does execlp("sleep", "sleep", "2", NULL);
If it did execlp("true", "true", NULL); like the first test case, it would
have failed randomly, because the debugger could not attach quickly enoguh,
and IMHO the expectaion of the debugger is probably to be able to stop the
new process at the first instruction after the execve.
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
2025-11-20 20:57 ` [RFC][PATCH] exec: Move cred computation under exec_update_lock Eric W. Biederman
2025-11-20 23:50 ` Eric W. Biederman
2025-11-23 18:52 ` Oleg Nesterov
@ 2025-11-25 11:55 ` Roberto Sassu
2025-12-01 16:06 ` Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec) Eric W. Biederman
2 siblings, 1 reply; 68+ messages in thread
From: Roberto Sassu @ 2025-11-25 11:55 UTC (permalink / raw)
To: Eric W. Biederman, Bernd Edlinger
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity
On Thu, 2025-11-20 at 14:57 -0600, Eric W. Biederman wrote:
> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
>
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>
> I am not certain why we wanted to compute the cred for the new
> executable so early. Perhaps I missed something but I did not see any
> common errors being signaled. So I don't think we loose anything by
> computing the new cred later.
>
> We gain a lot.
>
> We stop holding the cred_guard_mutex over places where the code sleeps
> and waits for userspace. These places include the waiting for the
> tracer in PTRACE_EVENT_EXIT, "put_user(0, tsk->clear_child_tid)" in
> mm_release, and "get_user(futex_offset, ...") in exit_robust_mutex.
>
> We can remove fs_struct->in_exec. The case where it was used simply
> never comes up, when we compute the cred after de_thread completes.
>
> We remove the possibility of a hang between a tracer calling
> PTRACE_ATTACH/PTRACE_SIEZE and the kernel waiting for the tracer
> in PTRACE_EVENT_EXIT.
>
> ---
> Oleg, Kees, Bernd, Can you see anything I am missing?
+ Mimi, linux-integrity (would be nice if we are in CC when linux-
security-module is in CC).
Apologies for not answering earlier, it seems I don't receive the
emails from the linux-security-module mailing list (thanks Serge for
letting me know!).
I tested your patch but there are a few warnings like this:
[ 2.702374] =====================================
[ 2.702854] WARNING: bad unlock balance detected!
[ 2.703350] 6.18.0-rc6+ #409 Not tainted
[ 2.703755] -------------------------------------
[ 2.704241] init/1 is trying to release lock (init_fs.seq) at:
[ 2.704829] [<ffffffff81836100>] begin_new_exec+0xfe0/0x1710
[ 2.705421] but there are no more locks to release!
[ 2.705931]
[ 2.705931] other info that might help us debug this:
[ 2.706610] 1 lock held by init/1:
[ 2.706958] #0: ffff88810083e538 (&sig->exec_update_lock){+.+.}-{4:4}, at: begin_new_exec+0x769/0x1710
and then the system hangs.
I see two main effects of this patch. First, the bprm_check_security
hook implementations will not see bprm->cred populated. That was a
problem before we made this patch:
https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
to work around the problem of not calculating the final DAC credentials
early enough (well, we actually had to change our CREDS_CHECK hook
behavior).
The second, I could not check. If I remember well, unlike the
capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
based on the first file being executed (thus the script, not the
interpreter). Is this patch keeping the same behavior despite preparing
the credentials when the final binary is found?
Thanks
Roberto
> The code compiles but I haven't test it yet.
>
> I thought I was going to move commit_creds before de_thread, but that
> would have taken commit_cred out of exec_update_lock (which introduces
> races).
>
> However I can't see any drawbacks of going the other direction.
>
>
> fs/exec.c | 88 ++++++++++++++----------------------
> fs/fs_struct.c | 1 -
> fs/proc/base.c | 4 +-
> include/linux/fs_struct.h | 1 -
> include/linux/sched/signal.h | 6 ---
> init/init_task.c | 1 -
> kernel/cred.c | 2 +-
> kernel/fork.c | 8 +---
> kernel/ptrace.c | 4 +-
> kernel/seccomp.c | 12 ++---
> 10 files changed, 45 insertions(+), 82 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 4298e7e08d5d..5ae96584dab0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1090,6 +1090,9 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
> perf_event_comm(tsk, exec);
> }
>
> +static int prepare_bprm_creds(struct linux_binprm *bprm);
> +static void check_unsafe_exec(struct linux_binprm *bprm);
> +
> /*
> * Calling this is the point of no return. None of the failures will be
> * seen by userspace since either the process is already taking a fatal
> @@ -1101,10 +1104,6 @@ int begin_new_exec(struct linux_binprm * bprm)
> struct task_struct *me = current;
> int retval;
>
> - /* Once we are committed compute the creds */
> - retval = bprm_creds_from_file(bprm);
> - if (retval)
> - return retval;
>
> /*
> * This tracepoint marks the point before flushing the old exec where
> @@ -1123,8 +1122,6 @@ int begin_new_exec(struct linux_binprm * bprm)
> retval = de_thread(me);
> if (retval)
> goto out;
> - /* see the comment in check_unsafe_exec() */
> - current->fs->in_exec = 0;
> /*
> * Cancel any io_uring activity across execve
> */
> @@ -1251,6 +1248,25 @@ int begin_new_exec(struct linux_binprm * bprm)
> WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
> flush_signal_handlers(me, 0);
>
> + retval = prepare_bprm_creds(bprm);
> + if (retval)
> + goto out_unlock;
> +
> + /*
> + * Check for unsafe execution states before exec_binprm(), which
> + * will call back into begin_new_exec(), into bprm_creds_from_file(),
> + * where setuid-ness is evaluated.
> + */
> + check_unsafe_exec(bprm);
> +
> + /* Set the unchanging part of bprm->cred */
> + retval = security_bprm_creds_for_exec(bprm);
> +
> + /* Once we are committed compute the creds */
> + retval = bprm_creds_from_file(bprm);
> + if (retval)
> + goto out_unlock;
> +
> retval = set_cred_ucounts(bprm->cred);
> if (retval < 0)
> goto out_unlock;
> @@ -1272,9 +1288,9 @@ int begin_new_exec(struct linux_binprm * bprm)
> if (get_dumpable(me->mm) != SUID_DUMP_USER)
> perf_event_exit_task(me);
> /*
> - * cred_guard_mutex must be held at least to this point to prevent
> + * exec_update_lock must be held at least to this point to prevent
> * ptrace_attach() from altering our determination of the task's
> - * credentials; any time after this it may be unlocked.
> + * credentials.
> */
> security_bprm_committed_creds(bprm);
>
> @@ -1291,8 +1307,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> out_unlock:
> up_write(&me->signal->exec_update_lock);
> - if (!bprm->cred)
> - mutex_unlock(&me->signal->cred_guard_mutex);
>
> out:
> return retval;
> @@ -1336,7 +1350,6 @@ void setup_new_exec(struct linux_binprm * bprm)
> */
> me->mm->task_size = TASK_SIZE;
> up_write(&me->signal->exec_update_lock);
> - mutex_unlock(&me->signal->cred_guard_mutex);
> }
> EXPORT_SYMBOL(setup_new_exec);
>
> @@ -1351,21 +1364,15 @@ void finalize_exec(struct linux_binprm *bprm)
> EXPORT_SYMBOL(finalize_exec);
>
> /*
> - * Prepare credentials and lock ->cred_guard_mutex.
> - * setup_new_exec() commits the new creds and drops the lock.
> - * Or, if exec fails before, free_bprm() should release ->cred
> - * and unlock.
> + * Prepare credentials. begin_new_exec() commits the new creds.
> + * Or, if exec fails before, free_bprm() should release ->cred.
> */
> static int prepare_bprm_creds(struct linux_binprm *bprm)
> {
> - if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> - return -ERESTARTNOINTR;
> -
> bprm->cred = prepare_exec_creds();
> if (likely(bprm->cred))
> return 0;
>
> - mutex_unlock(¤t->signal->cred_guard_mutex);
> return -ENOMEM;
> }
>
> @@ -1386,9 +1393,7 @@ static void free_bprm(struct linux_binprm *bprm)
> }
> free_arg_pages(bprm);
> if (bprm->cred) {
> - /* in case exec fails before de_thread() succeeds */
> - current->fs->in_exec = 0;
> - mutex_unlock(¤t->signal->cred_guard_mutex);
> + /* in case exec fails before commit_creds succeeds */
> abort_creds(bprm->cred);
> }
> do_close_execat(bprm->file);
> @@ -1486,13 +1491,12 @@ EXPORT_SYMBOL(bprm_change_interp);
>
> /*
> * determine how safe it is to execute the proposed program
> - * - the caller must hold ->cred_guard_mutex to protect against
> + * - the caller must hold ->exec_update_lock to protect against
> * PTRACE_ATTACH or seccomp thread-sync
> */
> static void check_unsafe_exec(struct linux_binprm *bprm)
> {
> - struct task_struct *p = current, *t;
> - unsigned n_fs;
> + struct task_struct *p = current;
>
> if (p->ptrace)
> bprm->unsafe |= LSM_UNSAFE_PTRACE;
> @@ -1509,25 +1513,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
> * suid exec because the differently privileged task
> * will be able to manipulate the current directory, etc.
> * It would be nice to force an unshare instead...
> - *
> - * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
> - * from another sub-thread until de_thread() succeeds, this
> - * state is protected by cred_guard_mutex we hold.
> */
> - n_fs = 1;
> - read_seqlock_excl(&p->fs->seq);
> - rcu_read_lock();
> - for_other_threads(p, t) {
> - if (t->fs == p->fs)
> - n_fs++;
> - }
> - rcu_read_unlock();
> -
> - /* "users" and "in_exec" locked for copy_fs() */
> - if (p->fs->users > n_fs)
> + if (p->fs->users > 1)
> bprm->unsafe |= LSM_UNSAFE_SHARE;
> - else
> - p->fs->in_exec = 1;
> read_sequnlock_excl(&p->fs->seq);
> }
>
> @@ -1731,25 +1719,15 @@ static int bprm_execve(struct linux_binprm *bprm)
> {
> int retval;
>
> - retval = prepare_bprm_creds(bprm);
> - if (retval)
> - return retval;
> + if (bprm->is_check)
> + return 0;
>
> - /*
> - * Check for unsafe execution states before exec_binprm(), which
> - * will call back into begin_new_exec(), into bprm_creds_from_file(),
> - * where setuid-ness is evaluated.
> - */
> - check_unsafe_exec(bprm);
> current->in_execve = 1;
> sched_mm_cid_before_execve(current);
>
> sched_exec();
>
> - /* Set the unchanging part of bprm->cred */
> - retval = security_bprm_creds_for_exec(bprm);
> - if (retval || bprm->is_check)
> - goto out;
> +
>
> retval = exec_binprm(bprm);
> if (retval < 0)
> diff --git a/fs/fs_struct.c b/fs/fs_struct.c
> index 28be762ac1c6..945bc0916f65 100644
> --- a/fs/fs_struct.c
> +++ b/fs/fs_struct.c
> @@ -109,7 +109,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
> /* We don't need to lock fs - think why ;-) */
> if (fs) {
> fs->users = 1;
> - fs->in_exec = 0;
> seqlock_init(&fs->seq);
> fs->umask = old->umask;
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 6299878e3d97..7041fb4d1689 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2834,14 +2834,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
> }
>
> /* Guard against adverse ptrace interaction */
> - rv = mutex_lock_interruptible(¤t->signal->cred_guard_mutex);
> + rv = down_write_killable(¤t->signal->exec_update_lock);
> if (rv < 0)
> goto out_free;
>
> rv = security_setprocattr(PROC_I(inode)->op.lsmid,
> file->f_path.dentry->d_name.name, page,
> count);
> - mutex_unlock(¤t->signal->cred_guard_mutex);
> + up_write(¤t->signal->exec_update_lock);
> out_free:
> kfree(page);
> out:
> diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
> index baf200ab5c77..29d0f7d57743 100644
> --- a/include/linux/fs_struct.h
> +++ b/include/linux/fs_struct.h
> @@ -10,7 +10,6 @@ struct fs_struct {
> int users;
> seqlock_t seq;
> int umask;
> - int in_exec;
> struct path root, pwd;
> } __randomize_layout;
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 7d6449982822..7e9259c8fb2b 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -241,12 +241,6 @@ struct signal_struct {
> struct mm_struct *oom_mm; /* recorded mm when the thread group got
> * killed by the oom killer */
>
> - struct mutex cred_guard_mutex; /* guard against foreign influences on
> - * credential calculations
> - * (notably. ptrace)
> - * Deprecated do not use in new code.
> - * Use exec_update_lock instead.
> - */
> struct rw_semaphore exec_update_lock; /* Held while task_struct is
> * being updated during exec,
> * and may have inconsistent
> diff --git a/init/init_task.c b/init/init_task.c
> index a55e2189206f..4813bffe217e 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -30,7 +30,6 @@ static struct signal_struct init_signals = {
> #ifdef CONFIG_CGROUPS
> .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
> #endif
> - .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
> #ifdef CONFIG_POSIX_TIMERS
> .posix_timers = HLIST_HEAD_INIT,
> diff --git a/kernel/cred.c b/kernel/cred.c
> index dbf6b687dc5c..80e376ce005f 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -252,7 +252,7 @@ EXPORT_SYMBOL(prepare_creds);
>
> /*
> * Prepare credentials for current to perform an execve()
> - * - The caller must hold ->cred_guard_mutex
> + * - The caller must hold ->exec_update_lock
> */
> struct cred *prepare_exec_creds(void)
> {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3da0f08615a9..996c649b9a4c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1555,11 +1555,6 @@ static int copy_fs(u64 clone_flags, struct task_struct *tsk)
> if (clone_flags & CLONE_FS) {
> /* tsk->fs is already what we want */
> read_seqlock_excl(&fs->seq);
> - /* "users" and "in_exec" locked for check_unsafe_exec() */
> - if (fs->in_exec) {
> - read_sequnlock_excl(&fs->seq);
> - return -EAGAIN;
> - }
> fs->users++;
> read_sequnlock_excl(&fs->seq);
> return 0;
> @@ -1699,7 +1694,6 @@ static int copy_signal(u64 clone_flags, struct task_struct *tsk)
> sig->oom_score_adj = current->signal->oom_score_adj;
> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> - mutex_init(&sig->cred_guard_mutex);
> init_rwsem(&sig->exec_update_lock);
>
> return 0;
> @@ -1710,7 +1704,7 @@ static void copy_seccomp(struct task_struct *p)
> #ifdef CONFIG_SECCOMP
> /*
> * Must be called with sighand->lock held, which is common to
> - * all threads in the group. Holding cred_guard_mutex is not
> + * all threads in the group. Holding exec_update_lock is not
> * needed because this new task is not yet running and cannot
> * be racing exec.
> */
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..8140d4bfc279 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -444,8 +444,8 @@ static int ptrace_attach(struct task_struct *task, long request,
> * SUID, SGID and LSM creds get determined differently
> * under ptrace.
> */
> - scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
> - &task->signal->cred_guard_mutex) {
> + scoped_cond_guard (rwsem_read_intr, return -ERESTARTNOINTR,
> + &task->signal->exec_update_lock) {
>
> scoped_guard (task_lock, task) {
> retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 25f62867a16d..87de8d47d876 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -479,7 +479,7 @@ static int is_ancestor(struct seccomp_filter *parent,
> /**
> * seccomp_can_sync_threads: checks if all threads can be synchronized
> *
> - * Expects sighand and cred_guard_mutex locks to be held.
> + * Expects sighand and exec_update_lock locks to be held.
> *
> * Returns 0 on success, -ve on error, or the pid of a thread which was
> * either not in the correct seccomp mode or did not have an ancestral
> @@ -489,7 +489,7 @@ static inline pid_t seccomp_can_sync_threads(void)
> {
> struct task_struct *thread, *caller;
>
> - BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> + BUG_ON(!rwsem_is_locked(¤t->signal->exec_update_lock));
> assert_spin_locked(¤t->sighand->siglock);
>
> /* Validate all threads being eligible for synchronization. */
> @@ -590,7 +590,7 @@ void seccomp_filter_release(struct task_struct *tsk)
> *
> * @flags: SECCOMP_FILTER_FLAG_* flags to set during sync.
> *
> - * Expects sighand and cred_guard_mutex locks to be held, and for
> + * Expects sighand and exec_update_lock locks to be held, and for
> * seccomp_can_sync_threads() to have returned success already
> * without dropping the locks.
> *
> @@ -599,7 +599,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
> {
> struct task_struct *thread, *caller;
>
> - BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> + BUG_ON(!rwsem_is_locked(¤t->signal->exec_update_lock));
> assert_spin_locked(¤t->sighand->siglock);
>
> /*
> @@ -2011,7 +2011,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
> * while another thread is in the middle of calling exec.
> */
> if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> - mutex_lock_killable(¤t->signal->cred_guard_mutex))
> + down_read_killable(¤t->signal->exec_update_lock))
> goto out_put_fd;
>
> spin_lock_irq(¤t->sighand->siglock);
> @@ -2034,7 +2034,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
> out:
> spin_unlock_irq(¤t->sighand->siglock);
> if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> - mutex_unlock(¤t->signal->cred_guard_mutex);
> + up_read(¤t->signal->exec_update_lock);
> out_put_fd:
> if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> if (ret) {
^ permalink raw reply [flat|nested] 68+ messages in thread* Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-11-25 11:55 ` Roberto Sassu
@ 2025-12-01 16:06 ` Eric W. Biederman
2025-12-01 16:49 ` Roberto Sassu
2025-12-04 15:43 ` Stephen Smalley
0 siblings, 2 replies; 68+ messages in thread
From: Eric W. Biederman @ 2025-12-01 16:06 UTC (permalink / raw)
To: Roberto Sassu
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
> + Mimi, linux-integrity (would be nice if we are in CC when linux-
> security-module is in CC).
>
> Apologies for not answering earlier, it seems I don't receive the
> emails from the linux-security-module mailing list (thanks Serge for
> letting me know!).
>
> I see two main effects of this patch. First, the bprm_check_security
> hook implementations will not see bprm->cred populated. That was a
> problem before we made this patch:
>
> https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
Thanks, that is definitely needed.
Does calling process_measurement(CREDS_CHECK) on only the final file
pass review? Do you know of any cases where that will break things?
As it stands I don't think it should be assumed that any LSM has
computed it's final creds until bprm_creds_from_file. Not just the
uid and gid.
If the patch you posted for review works that helps sort that mess out.
> to work around the problem of not calculating the final DAC credentials
> early enough (well, we actually had to change our CREDS_CHECK hook
> behavior).
>
> The second, I could not check. If I remember well, unlike the
> capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
> based on the first file being executed (thus the script, not the
> interpreter). Is this patch keeping the same behavior despite preparing
> the credentials when the final binary is found?
The patch I posted was.
My brain is still reeling from the realization that our security modules
have the implicit assumption that it is safe to calculate their security
information from shell scripts.
In the first half of the 90's I remember there was lots of effort to try
and make setuid shell scripts and setuid perl scripts work, and the
final conclusion was it was a lost cause.
Now I look at security_bprm_creds_for_exec and security_bprm_check which
both have the implicit assumption that it is indeed safe to compute the
credentials from a shell script.
When passing a file descriptor to execat we have
BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
which reduces some of the races.
However when just plain executing a shell script we pass the filename of
the shell script as a command line argument, and expect the shell to
open the filename again. This has been a time of check to time of use
race for decades, and one of the reasons we don't have setuid shell
scripts.
Yet the IMA implementation (without the above mentioned patch) assumes
the final creds will be calculated before security_bprm_check is called,
and security_bprm_creds_for_exec busily calculate the final creds.
For some of the security modules I believe anyone can set any label they
want on a file and they remain secure (At which point I don't understand
the point of having labels on files). I don't believe that is the case
for selinux, or in general.
So just to remove the TOCTOU race the security_bprm_creds_for_exec
and security_bprm_check hooks need to be removed, after moving their
code into something like security_bprm_creds_from_file.
Or am I missing something and even with the TOCTOU race are setuid shell
scripts somehow safe now?
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-01 16:06 ` Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec) Eric W. Biederman
@ 2025-12-01 16:49 ` Roberto Sassu
2025-12-01 18:53 ` Eric W. Biederman
2025-12-04 15:43 ` Stephen Smalley
1 sibling, 1 reply; 68+ messages in thread
From: Roberto Sassu @ 2025-12-01 16:49 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On Mon, 2025-12-01 at 10:06 -0600, Eric W. Biederman wrote:
> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>
> > + Mimi, linux-integrity (would be nice if we are in CC when linux-
> > security-module is in CC).
> >
> > Apologies for not answering earlier, it seems I don't receive the
> > emails from the linux-security-module mailing list (thanks Serge for
> > letting me know!).
> >
> > I see two main effects of this patch. First, the bprm_check_security
> > hook implementations will not see bprm->cred populated. That was a
> > problem before we made this patch:
> >
> > https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
>
> Thanks, that is definitely needed.
>
> Does calling process_measurement(CREDS_CHECK) on only the final file
> pass review? Do you know of any cases where that will break things?
We intentionally changed the behavior of CREDS_CHECK to be invoked only
for the final file. We are monitoring for bug reports, if we receive
complains from people that the patch breaks their expectation we will
revisit the issue.
Any LSM implementing bprm_check_security looking for brpm->cred would
be affected by recalculating the DAC credentials for the final binary.
> As it stands I don't think it should be assumed that any LSM has
> computed it's final creds until bprm_creds_from_file. Not just the
> uid and gid.
Uhm, I can be wrong, but most LSMs calculate their state change in
bprm_creds_for_exec (git grep bprm_creds_for_exec|grep LSM_HOOK_INIT).
> If the patch you posted for review works that helps sort that mess out.
Well, it works because we changed the expectation :)
> > to work around the problem of not calculating the final DAC credentials
> > early enough (well, we actually had to change our CREDS_CHECK hook
> > behavior).
> >
> > The second, I could not check. If I remember well, unlike the
> > capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
> > based on the first file being executed (thus the script, not the
> > interpreter). Is this patch keeping the same behavior despite preparing
> > the credentials when the final binary is found?
>
> The patch I posted was.
>
> My brain is still reeling from the realization that our security modules
> have the implicit assumption that it is safe to calculate their security
> information from shell scripts.
If I'm interpreting this behavior correctly (please any LSM maintainer
could comment on it), the intent is just to transition to a different
security context where a different set of rules could apply (since we
are executing a script).
Imagine if for every script, the security transition is based on the
interpreter, it would be hard to differentiate between scripts and
associate to the respective processes different security labels.
> In the first half of the 90's I remember there was lots of effort to try
> and make setuid shell scripts and setuid perl scripts work, and the
> final conclusion was it was a lost cause.
Definitely I lack a lot of context...
> Now I look at security_bprm_creds_for_exec and security_bprm_check which
> both have the implicit assumption that it is indeed safe to compute the
> credentials from a shell script.
>
> When passing a file descriptor to execat we have
> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
> which reduces some of the races.
>
> However when just plain executing a shell script we pass the filename of
> the shell script as a command line argument, and expect the shell to
> open the filename again. This has been a time of check to time of use
> race for decades, and one of the reasons we don't have setuid shell
> scripts.
Yes, it would be really nice to fix it!
> Yet the IMA implementation (without the above mentioned patch) assumes
> the final creds will be calculated before security_bprm_check is called,
> and security_bprm_creds_for_exec busily calculate the final creds.
>
> For some of the security modules I believe anyone can set any label they
> want on a file and they remain secure (At which point I don't understand
> the point of having labels on files). I don't believe that is the case
> for selinux, or in general.
A simple example for SELinux. Suppose that the parent process has type
initrc_t, then the SELinux policy configures the following transitions
based on the label of the first file executed (sesearch -T -s initrc_t
-c process):
type_transition initrc_t NetworkManager_dispatcher_exec_t:process NetworkManager_dispatcher_t;
type_transition initrc_t NetworkManager_exec_t:process NetworkManager_t;
type_transition initrc_t NetworkManager_initrc_exec_t:process initrc_t;
type_transition initrc_t NetworkManager_priv_helper_exec_t:process NetworkManager_priv_helper_t;
type_transition initrc_t abrt_dump_oops_exec_t:process abrt_dump_oops_t;
type_transition initrc_t abrt_exec_t:process abrt_t;
[...]
(there are 747 rules in my system).
If the transition would be based on the interpreter label, it would be
hard to express with rules.
If the transition does not occur for any reason the parent process
policy would still apply, but maybe it would not have the necessary
permissions for the execution of the script.
> So just to remove the TOCTOU race the security_bprm_creds_for_exec
> and security_bprm_check hooks need to be removed, after moving their
> code into something like security_bprm_creds_from_file.
>
> Or am I missing something and even with the TOCTOU race are setuid shell
> scripts somehow safe now?
Take this with a looot of salt, if there is a TOCTOU race, the script
will be executed with a security context that does not belong to it.
But the transition already happened. Not sure if it is safe.
I also don't know how the TOCTOU race could be solved, but I also would
like it to be fixed. I'm available to comment on any proposal!
Roberto
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-01 16:49 ` Roberto Sassu
@ 2025-12-01 18:53 ` Eric W. Biederman
2025-12-01 21:39 ` David Laight
2025-12-03 13:16 ` Bernd Edlinger
0 siblings, 2 replies; 68+ messages in thread
From: Eric W. Biederman @ 2025-12-01 18:53 UTC (permalink / raw)
To: Roberto Sassu
Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
> On Mon, 2025-12-01 at 10:06 -0600, Eric W. Biederman wrote:
>> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>>
>> > + Mimi, linux-integrity (would be nice if we are in CC when linux-
>> > security-module is in CC).
>> >
>> > Apologies for not answering earlier, it seems I don't receive the
>> > emails from the linux-security-module mailing list (thanks Serge for
>> > letting me know!).
>> >
>> > I see two main effects of this patch. First, the bprm_check_security
>> > hook implementations will not see bprm->cred populated. That was a
>> > problem before we made this patch:
>> >
>> > https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
>>
>> Thanks, that is definitely needed.
>>
>> Does calling process_measurement(CREDS_CHECK) on only the final file
>> pass review? Do you know of any cases where that will break things?
>
> We intentionally changed the behavior of CREDS_CHECK to be invoked only
> for the final file. We are monitoring for bug reports, if we receive
> complains from people that the patch breaks their expectation we will
> revisit the issue.
>
> Any LSM implementing bprm_check_security looking for brpm->cred would
> be affected by recalculating the DAC credentials for the final binary.
>
>> As it stands I don't think it should be assumed that any LSM has
>> computed it's final creds until bprm_creds_from_file. Not just the
>> uid and gid.
>
> Uhm, I can be wrong, but most LSMs calculate their state change in
> bprm_creds_for_exec (git grep bprm_creds_for_exec|grep LSM_HOOK_INIT).
>
>> If the patch you posted for review works that helps sort that mess out.
>
> Well, it works because we changed the expectation :)
I just haven't seen that code land in Linus's tree yet so I am a bit
cautious in adopting that. It is definitely needed as the behavior
of IMA as v6.18 simply does not work in general.
>> > to work around the problem of not calculating the final DAC credentials
>> > early enough (well, we actually had to change our CREDS_CHECK hook
>> > behavior).
>> >
>> > The second, I could not check. If I remember well, unlike the
>> > capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
>> > based on the first file being executed (thus the script, not the
>> > interpreter). Is this patch keeping the same behavior despite preparing
>> > the credentials when the final binary is found?
>>
>> The patch I posted was.
>>
>> My brain is still reeling from the realization that our security modules
>> have the implicit assumption that it is safe to calculate their security
>> information from shell scripts.
>
> If I'm interpreting this behavior correctly (please any LSM maintainer
> could comment on it), the intent is just to transition to a different
> security context where a different set of rules could apply (since we
> are executing a script).
>
> Imagine if for every script, the security transition is based on the
> interpreter, it would be hard to differentiate between scripts and
> associate to the respective processes different security labels.
>
>> In the first half of the 90's I remember there was lots of effort to try
>> and make setuid shell scripts and setuid perl scripts work, and the
>> final conclusion was it was a lost cause.
>
> Definitely I lack a lot of context...
From the usenet comp.unix.faq that was probably updated in 1994:
http://www.faqs.org/faqs/unix-faq/faq/part4/section-7.html
I have been trying to remember enough details by looking it up, but the
short version is that one of the big problems is there is a race between
the kernel doing it's thing and the shell opening the shell script.
Clever people have been able to take advantage of that race and insert
arbitrary code in that window for the shell to execute. All you have to
do is google for how to find a reproducer if the one in the link above
is not enough.
>> Now I look at security_bprm_creds_for_exec and security_bprm_check which
>> both have the implicit assumption that it is indeed safe to compute the
>> credentials from a shell script.
>>
>> When passing a file descriptor to execat we have
>> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
>> which reduces some of the races.
>>
>> However when just plain executing a shell script we pass the filename of
>> the shell script as a command line argument, and expect the shell to
>> open the filename again. This has been a time of check to time of use
>> race for decades, and one of the reasons we don't have setuid shell
>> scripts.
>
> Yes, it would be really nice to fix it!
After 30 years I really don't expect that is even a reasonable request.
I think we are solidly into "Don't do that then", and the LSM security
hooks are definitely doing that.
There is the partial solution of passing /dev/fd instead of passing the
name of the script. I suspect that would break things. I don't
remember why that was never adopted.
I think even with the TOCTOU race fixed there were other serious issues.
I really think it behooves any security module people who want to use
the shell script as the basis of their security decisions to research
all of the old well known issues and describe how they don't apply.
All I have energy for is to point out it is broken as is and to start
moving code down into bprm_creds_from_file to avoid the race.
Right now as far as I can tell anything based upon the script itself
is worthless junk so changing that would not be breaking anything that
wasn't already broken.
>> Yet the IMA implementation (without the above mentioned patch) assumes
>> the final creds will be calculated before security_bprm_check is called,
>> and security_bprm_creds_for_exec busily calculate the final creds.
>>
>> For some of the security modules I believe anyone can set any label they
>> want on a file and they remain secure (At which point I don't understand
>> the point of having labels on files). I don't believe that is the case
>> for selinux, or in general.
>
> A simple example for SELinux. Suppose that the parent process has type
> initrc_t, then the SELinux policy configures the following transitions
> based on the label of the first file executed (sesearch -T -s initrc_t
> -c process):
>
> type_transition initrc_t NetworkManager_dispatcher_exec_t:process NetworkManager_dispatcher_t;
> type_transition initrc_t NetworkManager_exec_t:process NetworkManager_t;
> type_transition initrc_t NetworkManager_initrc_exec_t:process initrc_t;
> type_transition initrc_t NetworkManager_priv_helper_exec_t:process NetworkManager_priv_helper_t;
> type_transition initrc_t abrt_dump_oops_exec_t:process abrt_dump_oops_t;
> type_transition initrc_t abrt_exec_t:process abrt_t;
> [...]
>
> (there are 747 rules in my system).
>
> If the transition would be based on the interpreter label, it would be
> hard to express with rules.
Which is a problem for the people making the rules engine. Because
30 years of experience with this problem says basing anything on the
script is already broken.
I understand the frustration, but it requires a new way of launching
shell scripts to even begin to make it secure.
> If the transition does not occur for any reason the parent process
> policy would still apply, but maybe it would not have the necessary
> permissions for the execution of the script.
Yep.
>> So just to remove the TOCTOU race the security_bprm_creds_for_exec
>> and security_bprm_check hooks need to be removed, after moving their
>> code into something like security_bprm_creds_from_file.
>>
>> Or am I missing something and even with the TOCTOU race are setuid shell
>> scripts somehow safe now?
>
> Take this with a looot of salt, if there is a TOCTOU race, the script
> will be executed with a security context that does not belong to it.
> But the transition already happened. Not sure if it is safe.
Historically it hasn't been safe.
> I also don't know how the TOCTOU race could be solved, but I also would
> like it to be fixed. I'm available to comment on any proposal!
I am hoping someone who helped put these security hooks where they are
will speak up, and tell me what I am missing.
All I have the energy for right now is to point out security policies
based upon shell scripts appear to be security policies that only
protect you from well behaved programs.
Eric
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-01 18:53 ` Eric W. Biederman
@ 2025-12-01 21:39 ` David Laight
2025-12-03 13:16 ` Bernd Edlinger
1 sibling, 0 replies; 68+ messages in thread
From: David Laight @ 2025-12-01 21:39 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Roberto Sassu, Bernd Edlinger, Alexander Viro, Alexey Dobriyan,
Oleg Nesterov, Kees Cook, Andy Lutomirski, Will Drewry,
Christian Brauner, Andrew Morton, Michal Hocko, Serge Hallyn,
James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
Helge Deller, Adrian Reber, Thomas Gleixner, Jens Axboe,
Alexei Starovoitov, linux-fsdevel, linux-kernel, linux-kselftest,
linux-mm, linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On Mon, 01 Dec 2025 12:53:10 -0600
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
...
> There is the partial solution of passing /dev/fd instead of passing the
> name of the script. I suspect that would break things. I don't
> remember why that was never adopted.
I thought that was what was done - and stopped the problem of a user
flipping a symlink between a suid script and one the user had written.
It has only ever been done for suid scripts when the uid actually changes.
Which makes it possible to set the permissions so that owner can't
run the script!
(The kernel only needs 'x' access, the shell needs 'r' access, so with 'x+s'
the owner can't execute the script but everyone else can.)
There is a much older problem that probably only affected the original 1970s
'sh' (not even the SVSV/Sunos version) that quoted redirects on the command
line would get actioned when the parameter was substituted - which I think
means the original 'sh' did post-substitution syntax analysis (the same
as cmd.exe still does).
That doesn't affect any shells used since the early 1980s.
David
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-01 18:53 ` Eric W. Biederman
2025-12-01 21:39 ` David Laight
@ 2025-12-03 13:16 ` Bernd Edlinger
2025-12-04 5:49 ` Al Viro
1 sibling, 1 reply; 68+ messages in thread
From: Bernd Edlinger @ 2025-12-03 13:16 UTC (permalink / raw)
To: Eric W. Biederman, Roberto Sassu
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On 12/1/25 19:53, Eric W. Biederman wrote:
> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>
>> On Mon, 2025-12-01 at 10:06 -0600, Eric W. Biederman wrote:
>>> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>>>
>>>> + Mimi, linux-integrity (would be nice if we are in CC when linux-
>>>> security-module is in CC).
>>>>
>>>> Apologies for not answering earlier, it seems I don't receive the
>>>> emails from the linux-security-module mailing list (thanks Serge for
>>>> letting me know!).
>>>>
>>>> I see two main effects of this patch. First, the bprm_check_security
>>>> hook implementations will not see bprm->cred populated. That was a
>>>> problem before we made this patch:
>>>>
>>>> https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
>>>
>>> Thanks, that is definitely needed.
>>>
>>> Does calling process_measurement(CREDS_CHECK) on only the final file
>>> pass review? Do you know of any cases where that will break things?
>>
>> We intentionally changed the behavior of CREDS_CHECK to be invoked only
>> for the final file. We are monitoring for bug reports, if we receive
>> complains from people that the patch breaks their expectation we will
>> revisit the issue.
>>
>> Any LSM implementing bprm_check_security looking for brpm->cred would
>> be affected by recalculating the DAC credentials for the final binary.
>>
>>> As it stands I don't think it should be assumed that any LSM has
>>> computed it's final creds until bprm_creds_from_file. Not just the
>>> uid and gid.
>>
>> Uhm, I can be wrong, but most LSMs calculate their state change in
>> bprm_creds_for_exec (git grep bprm_creds_for_exec|grep LSM_HOOK_INIT).
>>
>>> If the patch you posted for review works that helps sort that mess out.
>>
>> Well, it works because we changed the expectation :)
>
> I just haven't seen that code land in Linus's tree yet so I am a bit
> cautious in adopting that. It is definitely needed as the behavior
> of IMA as v6.18 simply does not work in general.
>
>>>> to work around the problem of not calculating the final DAC credentials
>>>> early enough (well, we actually had to change our CREDS_CHECK hook
>>>> behavior).
>>>>
>>>> The second, I could not check. If I remember well, unlike the
>>>> capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
>>>> based on the first file being executed (thus the script, not the
>>>> interpreter). Is this patch keeping the same behavior despite preparing
>>>> the credentials when the final binary is found?
>>>
>>> The patch I posted was.
>>>
>>> My brain is still reeling from the realization that our security modules
>>> have the implicit assumption that it is safe to calculate their security
>>> information from shell scripts.
>>
>> If I'm interpreting this behavior correctly (please any LSM maintainer
>> could comment on it), the intent is just to transition to a different
>> security context where a different set of rules could apply (since we
>> are executing a script).
>>
>> Imagine if for every script, the security transition is based on the
>> interpreter, it would be hard to differentiate between scripts and
>> associate to the respective processes different security labels.
>>
>>> In the first half of the 90's I remember there was lots of effort to try
>>> and make setuid shell scripts and setuid perl scripts work, and the
>>> final conclusion was it was a lost cause.
>>
>> Definitely I lack a lot of context...
>
> From the usenet comp.unix.faq that was probably updated in 1994:
> http://www.faqs.org/faqs/unix-faq/faq/part4/section-7.html
>
> I have been trying to remember enough details by looking it up, but the
> short version is that one of the big problems is there is a race between
> the kernel doing it's thing and the shell opening the shell script.
>
> Clever people have been able to take advantage of that race and insert
> arbitrary code in that window for the shell to execute. All you have to
> do is google for how to find a reproducer if the one in the link above
> is not enough.
>
>>> Now I look at security_bprm_creds_for_exec and security_bprm_check which
>>> both have the implicit assumption that it is indeed safe to compute the
>>> credentials from a shell script.
>>>
>>> When passing a file descriptor to execat we have
>>> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
>>> which reduces some of the races.
>>>
>>> However when just plain executing a shell script we pass the filename of
>>> the shell script as a command line argument, and expect the shell to
>>> open the filename again. This has been a time of check to time of use
>>> race for decades, and one of the reasons we don't have setuid shell
>>> scripts.
>>
>> Yes, it would be really nice to fix it!
>
> After 30 years I really don't expect that is even a reasonable request.
>
> I think we are solidly into "Don't do that then", and the LSM security
> hooks are definitely doing that.
>
> There is the partial solution of passing /dev/fd instead of passing the
> name of the script. I suspect that would break things. I don't
> remember why that was never adopted.
>
> I think even with the TOCTOU race fixed there were other serious issues.
>
> I really think it behooves any security module people who want to use
> the shell script as the basis of their security decisions to research
> all of the old well known issues and describe how they don't apply.
>
> All I have energy for is to point out it is broken as is and to start
> moving code down into bprm_creds_from_file to avoid the race.
>
> Right now as far as I can tell anything based upon the script itself
> is worthless junk so changing that would not be breaking anything that
> wasn't already broken.
>
>>> Yet the IMA implementation (without the above mentioned patch) assumes
>>> the final creds will be calculated before security_bprm_check is called,
>>> and security_bprm_creds_for_exec busily calculate the final creds.
>>>
>>> For some of the security modules I believe anyone can set any label they
>>> want on a file and they remain secure (At which point I don't understand
>>> the point of having labels on files). I don't believe that is the case
>>> for selinux, or in general.
>>
>> A simple example for SELinux. Suppose that the parent process has type
>> initrc_t, then the SELinux policy configures the following transitions
>> based on the label of the first file executed (sesearch -T -s initrc_t
>> -c process):
>>
>> type_transition initrc_t NetworkManager_dispatcher_exec_t:process NetworkManager_dispatcher_t;
>> type_transition initrc_t NetworkManager_exec_t:process NetworkManager_t;
>> type_transition initrc_t NetworkManager_initrc_exec_t:process initrc_t;
>> type_transition initrc_t NetworkManager_priv_helper_exec_t:process NetworkManager_priv_helper_t;
>> type_transition initrc_t abrt_dump_oops_exec_t:process abrt_dump_oops_t;
>> type_transition initrc_t abrt_exec_t:process abrt_t;
>> [...]
>>
>> (there are 747 rules in my system).
>>
>> If the transition would be based on the interpreter label, it would be
>> hard to express with rules.
>
> Which is a problem for the people making the rules engine. Because
> 30 years of experience with this problem says basing anything on the
> script is already broken.
>
> I understand the frustration, but it requires a new way of launching
> shell scripts to even begin to make it secure.
>
>> If the transition does not occur for any reason the parent process
>> policy would still apply, but maybe it would not have the necessary
>> permissions for the execution of the script.
>
> Yep.
>
>>> So just to remove the TOCTOU race the security_bprm_creds_for_exec
>>> and security_bprm_check hooks need to be removed, after moving their
>>> code into something like security_bprm_creds_from_file.
>>>
>>> Or am I missing something and even with the TOCTOU race are setuid shell
>>> scripts somehow safe now?
>>
>> Take this with a looot of salt, if there is a TOCTOU race, the script
>> will be executed with a security context that does not belong to it.
>> But the transition already happened. Not sure if it is safe.
>
> Historically it hasn't been safe.
>
>> I also don't know how the TOCTOU race could be solved, but I also would
>> like it to be fixed. I'm available to comment on any proposal!
>
> I am hoping someone who helped put these security hooks where they are
> will speak up, and tell me what I am missing.
>
> All I have the energy for right now is to point out security policies
> based upon shell scripts appear to be security policies that only
> protect you from well behaved programs.
>
Hmm, yes, that looks like an issue.
I would have expected the security engine to look at bprm->filenanme
especially in the case, when bprm->interp != bprm->filename,
and check that it is not a sym-link with write-access for the
current user and of course also that the bprm->file is not a regular file
which is writable by the current user, if that is the case I would have expected
the secuity engine to enforce non-new-privs on a SUID executable somehow.
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-03 13:16 ` Bernd Edlinger
@ 2025-12-04 5:49 ` Al Viro
2025-12-04 9:32 ` David Laight
2025-12-04 13:03 ` Bernd Edlinger
0 siblings, 2 replies; 68+ messages in thread
From: Al Viro @ 2025-12-04 5:49 UTC (permalink / raw)
To: Bernd Edlinger
Cc: Eric W. Biederman, Roberto Sassu, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On Wed, Dec 03, 2025 at 02:16:29PM +0100, Bernd Edlinger wrote:
> Hmm, yes, that looks like an issue.
>
> I would have expected the security engine to look at bprm->filenanme
> especially in the case, when bprm->interp != bprm->filename,
> and check that it is not a sym-link with write-access for the
> current user and of course also that the bprm->file is not a regular file
> which is writable by the current user, if that is the case I would have expected
> the secuity engine to enforce non-new-privs on a SUID executable somehow.
Check that _what_ is not a symlink? And while we are at it, what do write
permissions to any symlinks have to do with anything whatsoever?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-04 5:49 ` Al Viro
@ 2025-12-04 9:32 ` David Laight
2025-12-04 13:03 ` Bernd Edlinger
1 sibling, 0 replies; 68+ messages in thread
From: David Laight @ 2025-12-04 9:32 UTC (permalink / raw)
To: Al Viro
Cc: Bernd Edlinger, Eric W. Biederman, Roberto Sassu,
Alexey Dobriyan, Oleg Nesterov, Kees Cook, Andy Lutomirski,
Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
Yafang Shao, Helge Deller, Adrian Reber, Thomas Gleixner,
Jens Axboe, Alexei Starovoitov, linux-fsdevel, linux-kernel,
linux-kselftest, linux-mm, linux-security-module, tiozhang,
Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On Thu, 4 Dec 2025 05:49:15 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Dec 03, 2025 at 02:16:29PM +0100, Bernd Edlinger wrote:
>
> > Hmm, yes, that looks like an issue.
> >
> > I would have expected the security engine to look at bprm->filenanme
> > especially in the case, when bprm->interp != bprm->filename,
> > and check that it is not a sym-link with write-access for the
> > current user and of course also that the bprm->file is not a regular file
> > which is writable by the current user, if that is the case I would have expected
> > the secuity engine to enforce non-new-privs on a SUID executable somehow.
>
> Check that _what_ is not a symlink? And while we are at it, what do write
> permissions to any symlinks have to do with anything whatsoever?
>
You'd need to check for write permissions to all the directories in the
full path of the symlink and in all the directories traversed by the symlink.
(and that may not be enough....)
Passing the shell (or whatever) /dev/fd/n doesn't seem (to me) any different
from what happens when the elf interpreter runs a suid program.
You might want to check for non-owner write permissions to the /dev/fd/n entry,
but that is true for any suid executable, not just scripts.
FWIW the SYSV shells normally set the effective uid back the real uid.
So making a script suid didn't work unless the script started "#!/bin/sh -p".
Whether that improved security (rather than being annoying) is another matter.
David
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-04 5:49 ` Al Viro
2025-12-04 9:32 ` David Laight
@ 2025-12-04 13:03 ` Bernd Edlinger
1 sibling, 0 replies; 68+ messages in thread
From: Bernd Edlinger @ 2025-12-04 13:03 UTC (permalink / raw)
To: Al Viro
Cc: Eric W. Biederman, Roberto Sassu, Alexey Dobriyan, Oleg Nesterov,
Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On 12/4/25 06:49, Al Viro wrote:
> On Wed, Dec 03, 2025 at 02:16:29PM +0100, Bernd Edlinger wrote:
>
>> Hmm, yes, that looks like an issue.
>>
>> I would have expected the security engine to look at bprm->filenanme
>> especially in the case, when bprm->interp != bprm->filename,
>> and check that it is not a sym-link with write-access for the
>> current user and of course also that the bprm->file is not a regular file
>> which is writable by the current user, if that is the case I would have expected
>> the secuity engine to enforce non-new-privs on a SUID executable somehow.
>
> Check that _what_ is not a symlink? And while we are at it, what do write
> permissions to any symlinks have to do with anything whatsoever?
When we execve a normal executable, we do open the binary file with deny_write_access
so this might allow the security engine to inspaect the binary, before it is used.
However this behavior has changed recently, now it has some exceptions, where even
this behavior is no longer guaranteed for binary executables, due to
commit 0357ef03c94ef835bd44a0658b8edb672a9dbf51, but why? I have no idea...
But with shell scripts an attack is possible, where a sym-link is executed,
and the SUID bit of the target file is used but a race condition might allow
the attacker to replace the script that is used by the shell:
Consider this:
ln -s /usr/bin/legitimate-suid-sctipt.sh
where legitimate-suid-sctipt.sh starts with "#! /bin/bash -"
and the attack works this way:
./legitmate-suid-script.sh &
ln -f -s do-what-i-want.sh legitimate-suid-script.sh
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
2025-12-01 16:06 ` Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec) Eric W. Biederman
2025-12-01 16:49 ` Roberto Sassu
@ 2025-12-04 15:43 ` Stephen Smalley
1 sibling, 0 replies; 68+ messages in thread
From: Stephen Smalley @ 2025-12-04 15:43 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Roberto Sassu, Bernd Edlinger, Alexander Viro, Alexey Dobriyan,
Oleg Nesterov, Kees Cook, Andy Lutomirski, Will Drewry,
Christian Brauner, Andrew Morton, Michal Hocko, Serge Hallyn,
James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
Helge Deller, Adrian Reber, Thomas Gleixner, Jens Axboe,
Alexei Starovoitov, linux-fsdevel, linux-kernel, linux-kselftest,
linux-mm, linux-security-module, tiozhang, Luis Chamberlain,
Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, zohar, linux-integrity, Ryan Lee,
apparmor
On Mon, Dec 1, 2025 at 11:34 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>
> > + Mimi, linux-integrity (would be nice if we are in CC when linux-
> > security-module is in CC).
> >
> > Apologies for not answering earlier, it seems I don't receive the
> > emails from the linux-security-module mailing list (thanks Serge for
> > letting me know!).
> >
> > I see two main effects of this patch. First, the bprm_check_security
> > hook implementations will not see bprm->cred populated. That was a
> > problem before we made this patch:
> >
> > https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
>
> Thanks, that is definitely needed.
>
> Does calling process_measurement(CREDS_CHECK) on only the final file
> pass review? Do you know of any cases where that will break things?
>
> As it stands I don't think it should be assumed that any LSM has
> computed it's final creds until bprm_creds_from_file. Not just the
> uid and gid.
>
> If the patch you posted for review works that helps sort that mess out.
>
> > to work around the problem of not calculating the final DAC credentials
> > early enough (well, we actually had to change our CREDS_CHECK hook
> > behavior).
> >
> > The second, I could not check. If I remember well, unlike the
> > capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
> > based on the first file being executed (thus the script, not the
> > interpreter). Is this patch keeping the same behavior despite preparing
> > the credentials when the final binary is found?
>
> The patch I posted was.
>
> My brain is still reeling from the realization that our security modules
> have the implicit assumption that it is safe to calculate their security
> information from shell scripts.
>
> In the first half of the 90's I remember there was lots of effort to try
> and make setuid shell scripts and setuid perl scripts work, and the
> final conclusion was it was a lost cause.
>
> Now I look at security_bprm_creds_for_exec and security_bprm_check which
> both have the implicit assumption that it is indeed safe to compute the
> credentials from a shell script.
>
> When passing a file descriptor to execat we have
> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
> which reduces some of the races.
>
> However when just plain executing a shell script we pass the filename of
> the shell script as a command line argument, and expect the shell to
> open the filename again. This has been a time of check to time of use
> race for decades, and one of the reasons we don't have setuid shell
> scripts.
>
> Yet the IMA implementation (without the above mentioned patch) assumes
> the final creds will be calculated before security_bprm_check is called,
> and security_bprm_creds_for_exec busily calculate the final creds.
>
> For some of the security modules I believe anyone can set any label they
> want on a file and they remain secure (At which point I don't understand
> the point of having labels on files). I don't believe that is the case
> for selinux, or in general.
>
> So just to remove the TOCTOU race the security_bprm_creds_for_exec
> and security_bprm_check hooks need to be removed, after moving their
> code into something like security_bprm_creds_from_file.
>
> Or am I missing something and even with the TOCTOU race are setuid shell
> scripts somehow safe now?
setuid shell scripts are not safe. But SELinux (and likely AppArmor
and others) have long relied on the ability to transition on shell
scripts to _shed_ permissions. That's a matter of writing your policy
sensibly.
Changing it would break existing userspace and policies.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v18] exec: Fix dead-lock in de_thread with ptrace_attach
2025-11-20 17:29 ` Eric W. Biederman
2025-11-20 20:57 ` [RFC][PATCH] exec: Move cred computation under exec_update_lock Eric W. Biederman
@ 2025-11-22 17:10 ` Bernd Edlinger
1 sibling, 0 replies; 68+ messages in thread
From: Bernd Edlinger @ 2025-11-22 17:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
Suren Baghdasaryan, Yafang Shao, Helge Deller, Adrian Reber,
Thomas Gleixner, Jens Axboe, Alexei Starovoitov, linux-fsdevel,
linux-kernel, linux-kselftest, linux-mm, linux-security-module,
tiozhang, Luis Chamberlain, Paulo Alcantara (SUSE),
Sergey Senozhatsky, Frederic Weisbecker, YueHaibing, Paul Moore,
Aleksa Sarai, Stefan Roesch, Chao Yu, xu xin, Jeff Layton,
Jan Kara, David Hildenbrand, Dave Chinner, Shuah Khan,
Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
Joel Fernandes (Google), Matthew Wilcox (Oracle),
Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
Ingo Molnar, Peter Zijlstra (Intel),
Cyrill Gorcunov, Eric Dumazet, Ryan Lee, apparmor, selinux
On 11/20/25 18:29, Eric W. Biederman wrote:
> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes:
>>
>>> This introduces signal->exec_bprm, which is used to
>>> fix the case when at least one of the sibling threads
>>> is traced, and therefore the trace process may dead-lock
>>> in ptrace_attach, but de_thread will need to wait for the
>>> tracer to continue execution.
>>
>> A small quibble it isn't a dead lock. It isn't even really a live lock,
>> as it is possible to SIGKILL our way out.
>>
That is of course what I meant to say with that sentence. In my language an
application process can "dead-lock" by waiting on a mutex forever.
Indeed the original problem with the dead-lock (I think in mm_access) was
worse, as both involved processes were only killable by "kill -9", but with the
remaining problem in ptrace_attach, the debugger can be killed by a simple CTRL-C.
So if I understand you right, you want me use a different term here like
"the trace process may be blocked in trace_attach" or so?
Or could you please give me a concrete suggestion how to rephrase the patch
description.
BTW, unless there are objections I would also want to rephrase the description
of cred_guard_mutex to replace the term "Deprecated" with "Not recommended",
like this:
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -260,11 +260,11 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace)
* Held while execve runs, except when
* a sibling thread is being traced.
- * Deprecated do not use in new code.
+ * Not recommended to use in new code.
* Use exec_update_lock instead.
*/
>> Thinking about this there is a really silly and simple way we can deal
>> with this situation for PTRACE_ATTACH. We can send SIGSTOP and wait for
>> the thread to stop before doing anything with cred_guard_mutex.
>>
>> PTRACE_ATTACH already implies sending SIGSTOP so as long as we have
>> enough permissions to send SIGSTOP I don't see that being a problem.
>>
>> The worst case I can see is that we get a case where we stop the
>> process, the permission check fails under cred_guard_mutex and
>> and ptrace attach has fails and has to send SIGCONT to undo it's
>> premature SIGSTOP. That might almost be visible, but it would still
>> be legitimate because we can still check that we have permission to
>> send SIGSTOP.
>
> Bah no I am full of it.
>
> The challenging behavior is in the semantics of the kernel operations.
> We need to describe it as such please.
>
> It is the same class of problem as a single threaded process calls exec
> with a pipe attached to both stdin and stdout of the new process.
>
> For the stdin and stdout we can say just use pull and nonblocking I/O.
>
> The problem is that both PTRACE_ATTACH and PTRACE_SEIZE block over
> the duration of exec, and if exec is waiting for a thread to exit,
> and that thread is blocked in PTRACE_EVENT_EXIT waiting for that very
> same tracer those processes will hang. Not deadlock.
>
>
> I haven't seen anyone clearly describe the problem lately so I am
> repeating it.
>
>
> Just looking at the code I don't think there is any fundamental reason
> to call commit_creds after de_thread. If we can change that we can sort
> this out without any change in userspace semantics.
>
> If we can't move commit_creds we have to either give
> PTRACE_ATTACH/PTRACE_SEIZE a non-block mode, or break out of
> PTRACE_EVENT_EXIT in de_thread.
>
> I will post a proof of concept of moving commit_creds in just a minute.
>
> Eric
Note: I forgot to add apparmor and selinux mailing list to this patch, previous
versions of this did try to avoid to touch the security engine code, and did
instead temporarily install the new credentials, mostiy for the benefit of the
security engines. But that is considered an unacceptable solution, therefore
I want to use instead a new option to ptrace_may_access.
All security engines have to handle this option, but the advantage is, that the
engines could detect and maybe also deny the unsafe execve.
This is an alternative to Eric's patch: "exec: Move cred computation under
exec_update_lock" that is supposed to solve the same problem, but tries instead
to avoid any user visible API change.
Thanks
Bernd.
^ permalink raw reply [flat|nested] 68+ messages in thread