From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C5E49CCFA13 for ; Sun, 9 Nov 2025 17:16:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15DBC8E000C; Sun, 9 Nov 2025 12:16:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 135F98E0002; Sun, 9 Nov 2025 12:16:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0243C8E000C; Sun, 9 Nov 2025 12:16:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E38488E0002 for ; Sun, 9 Nov 2025 12:16:07 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2E5B54C466 for ; Sun, 9 Nov 2025 17:16:07 +0000 (UTC) X-FDA: 84091721574.07.D47861B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 28EB21C0002 for ; Sun, 9 Nov 2025 17:16:04 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YOi6bzAG; spf=pass (imf18.hostedemail.com: domain of oleg@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=oleg@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1762708565; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eJT6y1GDlaasTCIPvbrA0Yxc9LFx5w+afKdTXNnGcpk=; b=QtCvc5vSACBQIHoU6OX5a+oo/4HV1Hjb15Zl4QMATB8yM3bGCLsTntOzog8Uy4/FLsN05G 6Ks5EkTcH6ZLpEdeqSrKyvIPri3w0uA7KeBuMaaUJCqKZByMvOQg3toeab1xdshDbJRTBj A+nBww0LIjq3yDIsJqveJkcTN+n9gr0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YOi6bzAG; spf=pass (imf18.hostedemail.com: domain of oleg@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=oleg@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1762708565; a=rsa-sha256; cv=none; b=xcfSuVC5JVEGr3r7l2cHXwMcB+Lf912xaJbAL12qsBFyeyE9TwmmxvRYqkiJMclATV34r+ 97syQh3ieBSC+AzukiLm/rP3cfMB5qAFbNSr4Wud96Yy2etV9oBi21wR4aNtZeT4zIE+Cw 5BhXURrlFSs/yq5WoZZ7ilY89xJoFT0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1762708564; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eJT6y1GDlaasTCIPvbrA0Yxc9LFx5w+afKdTXNnGcpk=; b=YOi6bzAGjPIQ+3SZ47QSmFUbGeCNGG724XpKFsm+t2I+JanZwCkPnqd07DzYhtqKjfAODf TD1vD2g8cewg5kmNh4xcTa2oEYHSXm/bN0w8bhMGoJ8kXVaJILhIb321LsDMLLl/SbWBHq YGyMfAoYk3MIkeVJjZwmfycU3qRpqes= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-223-ftEFn_sbMku-0ty6xr4nTQ-1; Sun, 09 Nov 2025 12:16:01 -0500 X-MC-Unique: ftEFn_sbMku-0ty6xr4nTQ-1 X-Mimecast-MFC-AGG-ID: ftEFn_sbMku-0ty6xr4nTQ_1762708556 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9EFD419560A2; Sun, 9 Nov 2025 17:15:55 +0000 (UTC) Received: from fedora (unknown [10.44.32.53]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with SMTP id 60B861800361; Sun, 9 Nov 2025 17:15:35 +0000 (UTC) Received: by fedora (nbSMTP-1.00) for uid 1000 oleg@redhat.com; Sun, 9 Nov 2025 18:15:55 +0100 (CET) Date: Sun, 9 Nov 2025 18:15:33 +0100 From: Oleg Nesterov To: Bernd Edlinger , Linus Torvalds , Dmitry Levin Cc: Alexander Viro , Alexey Dobriyan , Kees Cook , Andy Lutomirski , Will Drewry , Christian Brauner , Andrew Morton , Michal Hocko , Serge Hallyn , James Morris , Randy Dunlap , Suren Baghdasaryan , Yafang Shao , Helge Deller , "Eric W. Biederman" , Adrian Reber , Thomas Gleixner , Jens Axboe , Alexei Starovoitov , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , linux-kselftest@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, tiozhang , Luis Chamberlain , "Paulo Alcantara (SUSE)" , Sergey Senozhatsky , Frederic Weisbecker , YueHaibing , Paul Moore , Aleksa Sarai , Stefan Roesch , Chao Yu , xu xin , Jeff Layton , Jan Kara , David Hildenbrand , Dave Chinner , Shuah Khan , Elena Reshetova , David Windsor , Mateusz Guzik , Ard Biesheuvel , "Joel Fernandes (Google)" , "Matthew Wilcox (Oracle)" , Hans Liljestrand , Penglei Jiang , Lorenzo Stoakes , Adrian Ratiu , Ingo Molnar , "Peter Zijlstra (Intel)" , Cyrill Gorcunov , Eric Dumazet Subject: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 28EB21C0002 X-Stat-Signature: y5kmhuogmmmxass4g4p5iyrq46jw7ieo X-Rspam-User: X-HE-Tag: 1762708564-639341 X-HE-Meta: U2FsdGVkX1+Iz84KJxQGdWOGfrVpFiCFvJclplAp5M42oSvUfKnxHkgcDTpxpEmvULuxI40GwL0vEiU4lFexHP43flDb4XCFlA1KeGmlolBs4ijhkrGjbgfizZ8gbxWhSLWsoYytSKJgm11LRkyFvW8FhKuAB0odZPFmbzjFh1rjyFbAAom4Wp7zUOetk9dklAu0vcTU66rS/uYqi4KcwXuRRMJeFddlHzs5WZjPO45Ou174U4n/KnjLM6CbNB83q4Ku9d92CUwtxDOqDCURp7LZG9M9LQQGbBwDlCMkG9rmOEitANMKLziOtJNQHXkmaEYm0i9ybJXuEeJNOBjyP8E5a5aQE6YIQU28fPo8NVLJugqy/kOM8ZfCsuUZbTBgf06mPyoHc3qOOXZB/hYv0yBpkzZ1VcskPvNlU5WFFw1ndAKC/5IyzqhUnt8qCbKGCah2LqxsZNr4nBo+byr+FiUlqU+2BgFeB07p0DIQpbQzwOBdLOuGHYFKxCQy7PODdKgi6qHyUEV6mCcL0DVjNzGDvdsemiGmM6QwjM+50YCrdnJwtM2HJUVIpnFsXn+940hLeuYdHmcitSkdwNMihWGLiLB5txYQM8PjEATEooqFZ8PgKZRg/db6fDGQj10nhW6Bvct9kmUuDep27UVkzuglKMBVYj6k568loMtMLeQSMv0NTfPAigcRXOrCxBFRChkYYZ9Gh84Fyv86NFdokHPCRs23U5n1ghKBhLetPiLg2nuDfI05MnhdwEuDANvVnD2KLmKbG8VxJvjwnOZp9cH8kzCUyUDE9felyARzMi9j8sTu4Kc5tNzAJoAHOHiwoWnXUAgFzyhe63lp2kIgkb9PjeR97bUkGENlOwNABnZO4x1GaoUfzJNmXZcYbelzzWzEkM5Say3Vk123ojcPrB434qRr3Ijn70yDX7mai05fa8i2zdhgJgSyBcwShz3DvlnlDy/nzuJFmOiyTJq bwy6K6vC WCHCtkWcGwOK+XZxcBxtV8+qBV/BQ0WdfNBmRJZhj+BAIvktomXMnJzn65c8/UTgk/1YyWBskCRCJ4iffXz5LENWBnMvnj+IfMxwrnuZwZgm58dZ/yTlSpI2TRWpo4DRkO9tzc2mlrT9haPCchfeBc0Lxv3uoLJ1QuKLiJB0gquKnX4+KBIBMAn8eqLAaLPs+hpZM8+tuueMrZOr5xS3tiocNnrw25GKa3XQ1Qrje5NTTsjaEbZ202Z28MqICZmqmT6WeIiNREu974PpGV9Ylr1LC9Bqrk6Bz2RpeEKrj/funOmgVXs9e0+DN2bWxbCicdVWD9gjkg4pX/XSOzgBH3bJyN3iWLun735KDbpRBjTB8mD0SmVmOC8CAGdxwHwwG1+CK2x9CZ8xRfPjdjtnu7FpGWeuP3vW5GDNiTXrKY1More2NSIRYGbANCjOmB3NBFhxcBs6B0Ta3kzXaJS3wZcZZpROGB8moO+V4FqT3cXFfdOFLxJ9WDan9ZiJYCpRhaE8FZ/uPAM7G4FdYsr8/TGJpMQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This simple program #include #include #include #include void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } hangs because de_thread() waits for debugger which should release the killed thread with cred_guard_mutex held, while the debugger sleeps waiting for the same mutex. Not really that bad, the tracer can be killed, but still this is a bug and people hit it in practice. With this patch: - de_thread() waits until all the sub-threads pass exit_notify() and become zombies. - setup_new_exec() waits until all the sub-threads are reaped without cred_guard_mutex held. - unshare_sighand() and flush_signal_handlers() are moved from begin_new_exec() to setup_new_exec(), we can't call them until all sub-threads go away. Signed-off-by: Oleg Nesterov --- fs/exec.c | 140 +++++++++++++++++++++++------------------------- kernel/exit.c | 9 ++-- kernel/signal.c | 2 +- 3 files changed, 71 insertions(+), 80 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 136a7ab5d91c..2bac7deb9a98 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -905,42 +905,56 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int kill_sub_threads(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; - struct sighand_struct *oldsighand = tsk->sighand; - spinlock_t *lock = &oldsighand->siglock; - - if (thread_group_empty(tsk)) - goto no_thread_group; + int err = -EINTR; - /* - * Kill all other threads in the thread group. - */ - spin_lock_irq(lock); - if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) { + sig->group_exec_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); - sig->group_exec_task = tsk; - sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) - sig->notify_count--; + return err; +} - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); +static int wait_for_notify_count(struct task_struct *tsk) +{ + for (;;) { if (__fatal_signal_pending(tsk)) - goto killed; - spin_lock_irq(lock); + return -EINTR; + set_current_state(TASK_KILLABLE); + if (!tsk->signal->notify_count) + break; + schedule(); } - spin_unlock_irq(lock); + __set_current_state(TASK_RUNNING); + return 0; +} + +static void clear_group_exec_task(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exec_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); +} + +static int de_thread(struct task_struct *tsk) +{ + if (thread_group_empty(tsk)) + goto no_thread_group; + + if (kill_sub_threads(tsk) || wait_for_notify_count(tsk)) + return -EINTR; /* * At this point all other threads have exited, all we have to @@ -948,26 +962,10 @@ static int de_thread(struct task_struct *tsk) * and to assume its PID: */ if (!thread_group_leader(tsk)) { - struct task_struct *leader = tsk->group_leader; - - for (;;) { - cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exec_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - cgroup_threadgroup_change_end(tsk); - schedule(); - if (__fatal_signal_pending(tsk)) - goto killed; - } + struct task_struct *leader = tsk->group_leader, *t; + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1000,8 +998,8 @@ static int de_thread(struct task_struct *tsk) list_replace_rcu(&leader->tasks, &tsk->tasks); list_replace_init(&leader->sibling, &tsk->sibling); - tsk->group_leader = tsk; - leader->group_leader = tsk; + for_each_thread(tsk, t) + t->group_leader = tsk; tsk->exit_signal = SIGCHLD; leader->exit_signal = -1; @@ -1021,23 +1019,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } - sig->group_exec_task = NULL; - sig->notify_count = 0; - no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; - BUG_ON(!thread_group_leader(tsk)); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exec_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } @@ -1171,13 +1157,6 @@ int begin_new_exec(struct linux_binprm * bprm) flush_itimer_signals(); #endif - /* - * Make the signal table private. - */ - retval = unshare_sighand(me); - if (retval) - goto out_unlock; - me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); @@ -1249,7 +1228,6 @@ int begin_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1); - flush_signal_handlers(me, 0); retval = set_cred_ucounts(bprm->cred); if (retval < 0) @@ -1293,8 +1271,9 @@ int begin_new_exec(struct linux_binprm * bprm) up_write(&me->signal->exec_update_lock); if (!bprm->cred) mutex_unlock(&me->signal->cred_guard_mutex); - out: + if (me->signal->group_exec_task == me) + clear_group_exec_task(me); return retval; } EXPORT_SYMBOL(begin_new_exec); @@ -1325,6 +1304,8 @@ int setup_new_exec(struct linux_binprm * bprm) { /* Setup things that can depend upon the personality */ struct task_struct *me = current; + struct signal_struct *sig = me->signal; + int err = 0; arch_pick_mmap_layout(me->mm, &bprm->rlim_stack); @@ -1335,10 +1316,23 @@ int setup_new_exec(struct linux_binprm * bprm) * some architectures like powerpc */ me->mm->task_size = TASK_SIZE; - up_write(&me->signal->exec_update_lock); - mutex_unlock(&me->signal->cred_guard_mutex); + up_write(&sig->exec_update_lock); + mutex_unlock(&sig->cred_guard_mutex); - return 0; + if (sig->group_exec_task) { + spin_lock_irq(&me->sighand->siglock); + sig->notify_count = sig->nr_threads - 1; + spin_unlock_irq(&me->sighand->siglock); + + err = wait_for_notify_count(me); + clear_group_exec_task(me); + } + + if (!err) + err = unshare_sighand(me); + if (!err) + flush_signal_handlers(me, 0); + return err; } EXPORT_SYMBOL(setup_new_exec); diff --git a/kernel/exit.c b/kernel/exit.c index f041f0c05ebb..bcde78c97253 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -178,10 +178,7 @@ static void __exit_signal(struct release_task_post *post, struct task_struct *ts tty = sig->tty; sig->tty = NULL; } else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ + /* mt-exec, setup_new_exec() -> wait_for_notify_count() */ if (sig->notify_count > 0 && !--sig->notify_count) wake_up_process(sig->group_exec_task); @@ -766,8 +763,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) list_add(&tsk->ptrace_entry, &dead); } - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, de_thread() -> wait_for_notify_count() */ + if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exec_task); write_unlock_irq(&tasklist_lock); diff --git a/kernel/signal.c b/kernel/signal.c index fe9190d84f28..334212044940 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1343,13 +1343,13 @@ int zap_other_threads(struct task_struct *p) for_other_threads(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- 2.25.1.362.g51ebf55