[PATCH v2 0/2] pid_namespace: make init creation more flexible

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] pid_namespace: make init creation more flexible
@ 2026-02-23 20:01 Pavel Tikhomirov
  2026-02-23 20:01 ` [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created Pavel Tikhomirov
  2026-02-23 20:01 ` [PATCH v2 2/2] selftests: Add tests for creating pidns init via setns Pavel Tikhomirov
  0 siblings, 2 replies; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-23 20:01 UTC (permalink / raw)
  To: Christian Brauner, Shuah Khan
  Cc: Kees Cook, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Jan Kara,
	Oleg Nesterov, Aleksa Sarai, Andrei Vagin, Kirill Tkhai,
	Alexander Mikhalitsyn, Adrian Reber, Pavel Tikhomirov,
	linux-kernel, linux-mm, linux-kselftest

The first patch allows to join pid namespace before pid namespace init
is created, that allows to create pid namespace by one process and then
create pid namespace init from another process after setns(). Please see
the detailed description in the patch commit message.

The second patch is a comprehansive test, which tests both basic usecase
of creating pid namespace and init separately, and a more specific
usecase which shows how we can improve clone3(set_tid) usability after
this change.

This is generally useful as it makes clone3(set_tid) more universal, and
work in all the cases evenly. Also is highly useful to CRIU to handle
nested containers.

v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
task_list lock for reading it. Rebase to master.

This series is also available here:
https://github.com/Snorch/linux/commits/allow-creating-pid-namespace-init-after-setns-v2/

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

Pavel Tikhomirov (2):
  pid_namespace: allow opening pid_for_children before init was created
  selftests: Add tests for creating pidns init via setns

 kernel/exit.c                                 |   2 +-
 kernel/fork.c                                 |   2 +-
 kernel/pid.c                                  |   5 +-
 kernel/pid_namespace.c                        |   9 -
 .../selftests/pid_namespace/.gitignore        |   1 +
 .../testing/selftests/pid_namespace/Makefile  |   2 +-
 .../pid_namespace/pidns_init_via_setns.c      | 238 ++++++++++++++++++
 7 files changed, 245 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/pid_namespace/pidns_init_via_setns.c

-- 
2.53.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-23 20:01 [PATCH v2 0/2] pid_namespace: make init creation more flexible Pavel Tikhomirov
@ 2026-02-23 20:01 ` Pavel Tikhomirov
  2026-02-24  7:02   ` Andrei Vagin
  2026-02-24 12:09   ` Oleg Nesterov
  2026-02-23 20:01 ` [PATCH v2 2/2] selftests: Add tests for creating pidns init via setns Pavel Tikhomirov
  1 sibling, 2 replies; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-23 20:01 UTC (permalink / raw)
  To: Christian Brauner, Shuah Khan
  Cc: Kees Cook, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Jan Kara,
	Oleg Nesterov, Aleksa Sarai, Andrei Vagin, Kirill Tkhai,
	Alexander Mikhalitsyn, Adrian Reber, Pavel Tikhomirov,
	linux-kernel, linux-mm, linux-kselftest

This effectively gives us an ability to create the pid namespace init as
a child of the process (setns-ed to the pid namespace) different to the
process which created the pid namespace itself.

Original problem:

There is a cool set_tid feature in clone3() syscall, it allows you to
create process with desired pids on multiple pid namespace levels. Which
is useful to restore processes in CRIU for nested pid namespace case.

In nested container case we can potentially see this kind of pid/user
namespace tree:

                            Process
                          ┌─────────┐
    User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
        │           │     │         │
        ▼           ▼     │         │
    User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
        │           │     │         │
       ...         ...    │   ...   │
        │           │     │         │
        ▼           ▼     │         │
    User NSn ──▶ Pid NSn ──▶ Pid pn │
                          └─────────┘

So to create the "Process" and set pids {p0, p1, ... pn} for it on all
pid namespace levels we can use clone3() syscall set_tid feature, BUT
the syscall does not allow you to set pid on pid namespace levels you
don't have permission to. So basically you have to be in "User NS0" when
creating the "Process" to actually be able to set pids on all levels.

It is ok for almost any process, but with pid namespace init this does
not work, as currently we can only create pid namespace init and the pid
namespace itself simultaneously, so to make "Pid NSn" owned by "User
NSn" we have to be in the "User NSn".

We can't possibly be in "User NS0" and "User NSn" at the same time,
hence the problem.

Alternative solution:

Yes, for the case of pid namespace init we can use old and gold
/proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
it is much more complicated and introduces tons of extra code to do. It
would be nice to make clone3() set_tid interface also aplicable to this
corner case.

Implementation:

Now when anyone can setns to the pid namespace before the creation of
init, and thus multiple processes can fork children to the pid
namespace, we enforce that the first process created is always the init,
and only allow other processes after the init sets
pid_namespace->child_reaper.

To avoid possible problems related to cpu/compiler optimizations around
->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
everywhere we write it and use READ_ONCE everywhere we read it without
explicit lock. Note: we already had READ_ONCE in nsfs_fh_to_dentry().

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

--
v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
task_list lock for reading it. Rebase to master, and thus remove
now excess pidns_ready variable.

Note: I didn't find anything in copy_process() around setting the
->child_reaper which can influence the pid namespace, so it looks like
the pid namespace is fully setup at the point when init sets
->child_reaper to receive more processes. Thus tasklist lock looks
excess in pidns_for_children_get()'s ->child_reaper check and it should
be safe not to have it in the corresponding checks in alloc_pid().
---
 kernel/exit.c          | 2 +-
 kernel/fork.c          | 2 +-
 kernel/pid.c           | 5 +++--
 kernel/pid_namespace.c | 9 ---------
 4 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 8a87021211ae..567fc3b7b0f9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -608,7 +608,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
 
 	reaper = find_alive_thread(father);
 	if (reaper) {
-		pid_ns->child_reaper = reaper;
+		WRITE_ONCE(pid_ns->child_reaper, reaper);
 		return reaper;
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index e832da9d15a4..27d0cdbca67e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2423,7 +2423,7 @@ __latent_entropy struct task_struct *copy_process(
 			init_task_pid(p, PIDTYPE_SID, task_session(current));
 
 			if (is_child_reaper(pid)) {
-				ns_of_pid(pid)->child_reaper = p;
+				WRITE_ONCE(ns_of_pid(pid)->child_reaper, p);
 				p->signal->flags |= SIGNAL_UNKILLABLE;
 			}
 			p->signal->shared_pending.signal = delayed.signal;
diff --git a/kernel/pid.c b/kernel/pid.c
index 3b96571d0fe6..e6116e131d8d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -219,7 +219,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
 			 * Also fail if a PID != 1 is requested and
 			 * no PID 1 exists.
 			 */
-			if (tid != 1 && !tmp->child_reaper)
+			if (tid != 1 && !READ_ONCE(tmp->child_reaper))
 				goto out_abort;
 			retval = -EPERM;
 			if (!checkpoint_restore_ns_capable(tmp->user_ns))
@@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
 			 * alreay in use. Return EEXIST in that case.
 			 */
 			if (nr == -ENOSPC)
-
 				nr = -EEXIST;
+		} else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {
+			nr = -EINVAL;
 		} else {
 			int pid_min = 1;
 			/*
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e48f5de41361..d36afc58ee1d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -369,15 +369,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
 	}
 	task_unlock(task);
 
-	if (ns) {
-		read_lock(&tasklist_lock);
-		if (!ns->child_reaper) {
-			put_pid_ns(ns);
-			ns = NULL;
-		}
-		read_unlock(&tasklist_lock);
-	}
-
 	return ns ? &ns->ns : NULL;
 }
 
-- 
2.53.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] selftests: Add tests for creating pidns init via setns
  2026-02-23 20:01 [PATCH v2 0/2] pid_namespace: make init creation more flexible Pavel Tikhomirov
  2026-02-23 20:01 ` [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created Pavel Tikhomirov
@ 2026-02-23 20:01 ` Pavel Tikhomirov
  1 sibling, 0 replies; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-23 20:01 UTC (permalink / raw)
  To: Christian Brauner, Shuah Khan
  Cc: Kees Cook, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Jan Kara,
	Oleg Nesterov, Aleksa Sarai, Andrei Vagin, Kirill Tkhai,
	Alexander Mikhalitsyn, Adrian Reber, Pavel Tikhomirov,
	linux-kernel, linux-mm, linux-kselftest

First testcase "pidns_init_via_setns" checks that a process can become
Pid 1 (init) in a new Pid namespace created via unshare() and joined via
setns().

Second testcase "pidns_init_via_setns_set_tid" checks that during this
process we can use clone3() + set_tid and set the pid in both the new
and old pid namespaces (owned by different user namespaces).

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 .../selftests/pid_namespace/.gitignore        |   1 +
 .../testing/selftests/pid_namespace/Makefile  |   2 +-
 .../pid_namespace/pidns_init_via_setns.c      | 238 ++++++++++++++++++
 3 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/pid_namespace/pidns_init_via_setns.c

diff --git a/tools/testing/selftests/pid_namespace/.gitignore b/tools/testing/selftests/pid_namespace/.gitignore
index 5118f0f3edf4..c647c6eb3367 100644
--- a/tools/testing/selftests/pid_namespace/.gitignore
+++ b/tools/testing/selftests/pid_namespace/.gitignore
@@ -1,2 +1,3 @@
 pid_max
+pidns_init_via_setns
 regression_enomem
diff --git a/tools/testing/selftests/pid_namespace/Makefile b/tools/testing/selftests/pid_namespace/Makefile
index b972f55d07ae..b01a924ac04b 100644
--- a/tools/testing/selftests/pid_namespace/Makefile
+++ b/tools/testing/selftests/pid_namespace/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g $(KHDR_INCLUDES)
 
-TEST_GEN_PROGS = regression_enomem pid_max
+TEST_GEN_PROGS = regression_enomem pid_max pidns_init_via_setns
 
 LOCAL_HDRS += $(selfdir)/pidfd/pidfd.h
 
diff --git a/tools/testing/selftests/pid_namespace/pidns_init_via_setns.c b/tools/testing/selftests/pid_namespace/pidns_init_via_setns.c
new file mode 100644
index 000000000000..7e4c610291d3
--- /dev/null
+++ b/tools/testing/selftests/pid_namespace/pidns_init_via_setns.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "kselftest_harness.h"
+#include "../pidfd/pidfd.h"
+
+/*
+ * Test that a process can become PID 1 (init) in a new PID namespace
+ * created via unshare() and joined via setns().
+ *
+ * Flow:
+ *  1. Parent creates a pipe for synchronization.
+ *  2. Parent forks a child.
+ *  3. Parent calls unshare(CLONE_NEWPID) to create a new PID namespace.
+ *  4. Parent signals the child via the pipe.
+ *  5. Child opens parent's /proc/<ppid>/ns/pid_for_children and calls
+ *     setns(fd, CLONE_NEWPID) to join the new namespace.
+ *  6. Child forks a grandchild.
+ *  7. Grandchild verifies getpid() == 1.
+ */
+TEST(pidns_init_via_setns)
+{
+	pid_t child, parent_pid;
+	int pipe_fd[2];
+	char buf;
+
+	parent_pid = getpid();
+
+	ASSERT_EQ(0, pipe(pipe_fd));
+
+	child = fork();
+	ASSERT_GE(child, 0);
+
+	if (child == 0) {
+		char path[256];
+		int nsfd;
+		pid_t grandchild;
+
+		close(pipe_fd[1]);
+
+		/* Wait for parent to complete unshare */
+		ASSERT_EQ(1, read_nointr(pipe_fd[0], &buf, 1));
+		close(pipe_fd[0]);
+
+		snprintf(path, sizeof(path),
+			 "/proc/%d/ns/pid_for_children", parent_pid);
+		nsfd = open(path, O_RDONLY);
+		ASSERT_GE(nsfd, 0);
+
+		ASSERT_EQ(0, setns(nsfd, CLONE_NEWPID));
+		close(nsfd);
+
+		grandchild = fork();
+		ASSERT_GE(grandchild, 0);
+
+		if (grandchild == 0) {
+			/* Should be init (PID 1) in the new namespace */
+			if (getpid() != 1)
+				_exit(1);
+			_exit(0);
+		}
+
+		ASSERT_EQ(0, wait_for_pid(grandchild));
+		_exit(0);
+	}
+
+	close(pipe_fd[0]);
+
+	if (geteuid())
+		ASSERT_EQ(0, unshare(CLONE_NEWUSER));
+
+	ASSERT_EQ(0, unshare(CLONE_NEWPID));
+
+	/* Signal child that the new PID namespace is ready */
+	buf = 0;
+	ASSERT_EQ(1, write_nointr(pipe_fd[1], &buf, 1));
+	close(pipe_fd[1]);
+
+	ASSERT_EQ(0, wait_for_pid(child));
+}
+
+/*
+ * Similar to pidns_init_via_setns, but:
+ *  1. Parent enters a new PID namespace right from the start to be able to
+ *     later freely use pid 1001 in it.
+ *  2. After forking child, parent also calls unshare(CLONE_NEWUSER)
+ *     before unshare(CLONE_NEWPID) so that new old and new pid namespaces have
+ *     different user namespace owners.
+ *  3. Child uses clone3() with set_tid={1, 1001} instead of fork() and
+ *     grandchild checks that it gets desired pids .
+ *
+ * Flow:
+ *  1. Test process creates a new PID namespace and forks a wrapper
+ *     (PID 1 in the outer namespace).
+ *  2. Wrapper forks a child.
+ *  3. Wrapper calls unshare(CLONE_NEWUSER) + unshare(CLONE_NEWPID)
+ *     to create an inner PID namespace.
+ *  4. Wrapper signals the child via pipe.
+ *  5. Child opens wrapper's /proc/<pid>/ns/pid_for_children and calls
+ *     setns(fd, CLONE_NEWPID) to join the inner namespace.
+ *  6. Child calls clone3() with set_tid={1, 1001}.
+ *  7. Grandchild verifies its NSpid ends with "1001 1".
+ */
+
+pid_t set_tid[] = {1, 1001};
+
+static int pidns_init_via_setns_set_tid_grandchild(struct __test_metadata *_metadata)
+{
+	char *line = NULL;
+	size_t len = 0;
+	int found = 0;
+	FILE *gf;
+
+	gf = fopen("/proc/self/status", "r");
+	ASSERT_NE(gf, NULL);
+
+	while (getline(&line, &len, gf) != -1) {
+		if (strncmp(line, "NSpid:", 6) != 0)
+			continue;
+
+		for (int i = 0; i < 2; i++) {
+			char *last = strrchr(line, '\t');
+			pid_t pid;
+
+			ASSERT_NE(last, NULL);
+			ASSERT_EQ(sscanf(last, "%d", &pid), 1);
+			ASSERT_EQ(pid, set_tid[i]);
+			*last = '\0';
+		}
+
+		found = true;
+		break;
+	}
+
+	free(line);
+	fclose(gf);
+	ASSERT_TRUE(found);
+	return 0;
+}
+
+static int pidns_init_via_setns_set_tid_child(struct __test_metadata *_metadata,
+					      pid_t parent_pid, int pipe_fd[2])
+{
+	struct __clone_args args = {
+		.exit_signal	= SIGCHLD,
+		.set_tid	= ptr_to_u64(set_tid),
+		.set_tid_size	= 2,
+	};
+	pid_t grandchild;
+	char path[256];
+	char buf;
+	int nsfd;
+
+	close(pipe_fd[1]);
+
+	ASSERT_EQ(1, read_nointr(pipe_fd[0], &buf, 1));
+	close(pipe_fd[0]);
+
+	snprintf(path, sizeof(path),
+		 "/proc/%d/ns/pid_for_children", parent_pid);
+	nsfd = open(path, O_RDONLY);
+	ASSERT_GE(nsfd, 0);
+
+	ASSERT_EQ(0, setns(nsfd, CLONE_NEWPID));
+	close(nsfd);
+
+	grandchild = sys_clone3(&args, sizeof(args));
+	ASSERT_GE(grandchild, 0);
+
+	if (grandchild == 0)
+		_exit(pidns_init_via_setns_set_tid_grandchild(_metadata));
+
+	ASSERT_EQ(0, wait_for_pid(grandchild));
+	return 0;
+}
+
+static int pidns_init_via_setns_set_tid_wrapper(struct __test_metadata *_metadata)
+{
+	int pipe_fd[2];
+	pid_t child, parent_pid;
+	char buf;
+	FILE *f;
+
+	/*
+	 * We are PID 1 inside the new namespace, but /proc is
+	 * mounted from the host.  Read our host-visible PID so
+	 * the child can reach our pid_for_children via /proc.
+	 */
+	f = fopen("/proc/self/stat", "r");
+	ASSERT_NE(f, NULL);
+	ASSERT_EQ(fscanf(f, "%d", &parent_pid), 1);
+	ASSERT_EQ(0, pipe(pipe_fd));
+
+	child = fork();
+	ASSERT_GE(child, 0);
+
+	if (child == 0)
+		_exit(pidns_init_via_setns_set_tid_child(_metadata, parent_pid, pipe_fd));
+
+	close(pipe_fd[0]);
+
+	ASSERT_EQ(0, unshare(CLONE_NEWUSER));
+	ASSERT_EQ(0, unshare(CLONE_NEWPID));
+
+	buf = 0;
+	ASSERT_EQ(1, write_nointr(pipe_fd[1], &buf, 1));
+	close(pipe_fd[1]);
+
+	ASSERT_EQ(0, wait_for_pid(child));
+
+	fclose(f);
+	return 0;
+}
+
+TEST(pidns_init_via_setns_set_tid)
+{
+	pid_t wrapper;
+
+	if (geteuid())
+		ASSERT_EQ(0, unshare(CLONE_NEWUSER));
+
+	ASSERT_EQ(0, unshare(CLONE_NEWPID));
+
+	wrapper = fork();
+	ASSERT_GE(wrapper, 0);
+
+	if (wrapper == 0)
+		_exit(pidns_init_via_setns_set_tid_wrapper(_metadata));
+
+	ASSERT_EQ(0, wait_for_pid(wrapper));
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-23 20:01 ` [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created Pavel Tikhomirov
@ 2026-02-24  7:02   ` Andrei Vagin
  2026-02-24 10:37     ` Pavel Tikhomirov
  2026-02-24 12:09   ` Oleg Nesterov
  1 sibling, 1 reply; 11+ messages in thread
From: Andrei Vagin @ 2026-02-24  7:02 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Oleg Nesterov, Aleksa Sarai,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest

On Mon, Feb 23, 2026 at 3:03 PM Pavel Tikhomirov
<ptikhomirov@virtuozzo.com> wrote:
>
> This effectively gives us an ability to create the pid namespace init as
> a child of the process (setns-ed to the pid namespace) different to the
> process which created the pid namespace itself.
>
> Original problem:
>
> There is a cool set_tid feature in clone3() syscall, it allows you to
> create process with desired pids on multiple pid namespace levels. Which
> is useful to restore processes in CRIU for nested pid namespace case.
>
> In nested container case we can potentially see this kind of pid/user
> namespace tree:
>
>                             Process
>                           ┌─────────┐
>     User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
>         │           │     │         │
>         ▼           ▼     │         │
>     User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
>         │           │     │         │
>        ...         ...    │   ...   │
>         │           │     │         │
>         ▼           ▼     │         │
>     User NSn ──▶ Pid NSn ──▶ Pid pn │
>                           └─────────┘
>
> So to create the "Process" and set pids {p0, p1, ... pn} for it on all
> pid namespace levels we can use clone3() syscall set_tid feature, BUT
> the syscall does not allow you to set pid on pid namespace levels you
> don't have permission to. So basically you have to be in "User NS0" when
> creating the "Process" to actually be able to set pids on all levels.
>
> It is ok for almost any process, but with pid namespace init this does
> not work, as currently we can only create pid namespace init and the pid
> namespace itself simultaneously, so to make "Pid NSn" owned by "User
> NSn" we have to be in the "User NSn".
>
> We can't possibly be in "User NS0" and "User NSn" at the same time,
> hence the problem.
>
> Alternative solution:
>
> Yes, for the case of pid namespace init we can use old and gold
> /proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
> it is much more complicated and introduces tons of extra code to do. It
> would be nice to make clone3() set_tid interface also aplicable to this
> corner case.
>
> Implementation:
>
> Now when anyone can setns to the pid namespace before the creation of
> init, and thus multiple processes can fork children to the pid
> namespace, we enforce that the first process created is always the init,
> and only allow other processes after the init sets
> pid_namespace->child_reaper.
>
> To avoid possible problems related to cpu/compiler optimizations around
> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
> everywhere we write it and use READ_ONCE everywhere we read it without
> explicit lock. Note: we already had READ_ONCE in nsfs_fh_to_dentry().
>
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>
> --
> v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
> task_list lock for reading it. Rebase to master, and thus remove
> now excess pidns_ready variable.
>
> Note: I didn't find anything in copy_process() around setting the
> ->child_reaper which can influence the pid namespace, so it looks like
> the pid namespace is fully setup at the point when init sets
> ->child_reaper to receive more processes. Thus tasklist lock looks
> excess in pidns_for_children_get()'s ->child_reaper check and it should
> be safe not to have it in the corresponding checks in alloc_pid().
> ---
>  kernel/exit.c          | 2 +-
>  kernel/fork.c          | 2 +-
>  kernel/pid.c           | 5 +++--
>  kernel/pid_namespace.c | 9 ---------
>  4 files changed, 5 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 8a87021211ae..567fc3b7b0f9 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -608,7 +608,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
>
>         reaper = find_alive_thread(father);
>         if (reaper) {
> -               pid_ns->child_reaper = reaper;
> +               WRITE_ONCE(pid_ns->child_reaper, reaper);
>                 return reaper;
>         }
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e832da9d15a4..27d0cdbca67e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2423,7 +2423,7 @@ __latent_entropy struct task_struct *copy_process(
>                         init_task_pid(p, PIDTYPE_SID, task_session(current));
>
>                         if (is_child_reaper(pid)) {
> -                               ns_of_pid(pid)->child_reaper = p;
> +                               WRITE_ONCE(ns_of_pid(pid)->child_reaper, p);
>                                 p->signal->flags |= SIGNAL_UNKILLABLE;
>                         }
>                         p->signal->shared_pending.signal = delayed.signal;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 3b96571d0fe6..e6116e131d8d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -219,7 +219,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>                          * Also fail if a PID != 1 is requested and
>                          * no PID 1 exists.
>                          */
> -                       if (tid != 1 && !tmp->child_reaper)
> +                       if (tid != 1 && !READ_ONCE(tmp->child_reaper))
>                                 goto out_abort;
>                         retval = -EPERM;
>                         if (!checkpoint_restore_ns_capable(tmp->user_ns))
> @@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>                          * alreay in use. Return EEXIST in that case.
>                          */
>                         if (nr == -ENOSPC)
> -
>                                 nr = -EEXIST;
> +               } else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {

I think it is better to update pid_ns_ctl_handler to prevent setting
ns_last_pid in a pidns
without init. Otherwise, figuring out why fork returns EINVAL can be tricky.

> +                       nr = -EINVAL;
>                 } else {
>                         int pid_min = 1;
>                         /*


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24  7:02   ` Andrei Vagin
@ 2026-02-24 10:37     ` Pavel Tikhomirov
  2026-02-24 15:38       ` Andrei Vagin
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-24 10:37 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Oleg Nesterov, Aleksa Sarai,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest



On 2/24/26 08:02, Andrei Vagin wrote:
> On Mon, Feb 23, 2026 at 3:03 PM Pavel Tikhomirov
> <ptikhomirov@virtuozzo.com> wrote:
>>
>> This effectively gives us an ability to create the pid namespace init as
>> a child of the process (setns-ed to the pid namespace) different to the
>> process which created the pid namespace itself.
>>
>> Original problem:
>>
>> There is a cool set_tid feature in clone3() syscall, it allows you to
>> create process with desired pids on multiple pid namespace levels. Which
>> is useful to restore processes in CRIU for nested pid namespace case.
>>
>> In nested container case we can potentially see this kind of pid/user
>> namespace tree:
>>
>>                             Process
>>                           ┌─────────┐
>>     User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
>>         │           │     │         │
>>         ▼           ▼     │         │
>>     User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
>>         │           │     │         │
>>        ...         ...    │   ...   │
>>         │           │     │         │
>>         ▼           ▼     │         │
>>     User NSn ──▶ Pid NSn ──▶ Pid pn │
>>                           └─────────┘
>>
>> So to create the "Process" and set pids {p0, p1, ... pn} for it on all
>> pid namespace levels we can use clone3() syscall set_tid feature, BUT
>> the syscall does not allow you to set pid on pid namespace levels you
>> don't have permission to. So basically you have to be in "User NS0" when
>> creating the "Process" to actually be able to set pids on all levels.
>>
>> It is ok for almost any process, but with pid namespace init this does
>> not work, as currently we can only create pid namespace init and the pid
>> namespace itself simultaneously, so to make "Pid NSn" owned by "User
>> NSn" we have to be in the "User NSn".
>>
>> We can't possibly be in "User NS0" and "User NSn" at the same time,
>> hence the problem.
>>
>> Alternative solution:
>>
>> Yes, for the case of pid namespace init we can use old and gold
>> /proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
>> it is much more complicated and introduces tons of extra code to do. It
>> would be nice to make clone3() set_tid interface also aplicable to this
>> corner case.
>>
>> Implementation:
>>
>> Now when anyone can setns to the pid namespace before the creation of
>> init, and thus multiple processes can fork children to the pid
>> namespace, we enforce that the first process created is always the init,
>> and only allow other processes after the init sets
>> pid_namespace->child_reaper.
>>
>> To avoid possible problems related to cpu/compiler optimizations around
>> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
>> everywhere we write it and use READ_ONCE everywhere we read it without
>> explicit lock. Note: we already had READ_ONCE in nsfs_fh_to_dentry().
>>
>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>
>> --
>> v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
>> task_list lock for reading it. Rebase to master, and thus remove
>> now excess pidns_ready variable.
>>
>> Note: I didn't find anything in copy_process() around setting the
>> ->child_reaper which can influence the pid namespace, so it looks like
>> the pid namespace is fully setup at the point when init sets
>> ->child_reaper to receive more processes. Thus tasklist lock looks
>> excess in pidns_for_children_get()'s ->child_reaper check and it should
>> be safe not to have it in the corresponding checks in alloc_pid().
>> ---
>>  kernel/exit.c          | 2 +-
>>  kernel/fork.c          | 2 +-
>>  kernel/pid.c           | 5 +++--
>>  kernel/pid_namespace.c | 9 ---------
>>  4 files changed, 5 insertions(+), 13 deletions(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 8a87021211ae..567fc3b7b0f9 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -608,7 +608,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
>>
>>         reaper = find_alive_thread(father);
>>         if (reaper) {
>> -               pid_ns->child_reaper = reaper;
>> +               WRITE_ONCE(pid_ns->child_reaper, reaper);
>>                 return reaper;
>>         }
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e832da9d15a4..27d0cdbca67e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2423,7 +2423,7 @@ __latent_entropy struct task_struct *copy_process(
>>                         init_task_pid(p, PIDTYPE_SID, task_session(current));
>>
>>                         if (is_child_reaper(pid)) {
>> -                               ns_of_pid(pid)->child_reaper = p;
>> +                               WRITE_ONCE(ns_of_pid(pid)->child_reaper, p);
>>                                 p->signal->flags |= SIGNAL_UNKILLABLE;
>>                         }
>>                         p->signal->shared_pending.signal = delayed.signal;
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 3b96571d0fe6..e6116e131d8d 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -219,7 +219,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>>                          * Also fail if a PID != 1 is requested and
>>                          * no PID 1 exists.
>>                          */
>> -                       if (tid != 1 && !tmp->child_reaper)
>> +                       if (tid != 1 && !READ_ONCE(tmp->child_reaper))
>>                                 goto out_abort;
>>                         retval = -EPERM;
>>                         if (!checkpoint_restore_ns_capable(tmp->user_ns))
>> @@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>>                          * alreay in use. Return EEXIST in that case.
>>                          */
>>                         if (nr == -ENOSPC)
>> -
>>                                 nr = -EEXIST;
>> +               } else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {
> 
> I think it is better to update pid_ns_ctl_handler to prevent setting
> ns_last_pid in a pidns
> without init. Otherwise, figuring out why fork returns EINVAL can be tricky.

Hm, I think pid_ns_ctl_handler(), as it uses current active pid namespace can 
only work if current is already fully (ns/pid) in the pid namespace, and thus
the init is also already there. So it's implicitly protected from change
before init creation.

This check here is more for the concurrent alloc_pid() case. When one process 
in alloc_pid() successfully allocated the pid and than, for instance, hit the
pidfs_add_pid() error and is going to free_pid(), but the pid 1 is remains yet
allocated from idr and the cursor is on 2 at the moment. At the same time the
concurrent process may get to alloc_pid(), and will see cursor on 2, it should 
not be able to create a process as this process will get pid 2 and will be 
created before init.

And in general (non concurrent case) it makes sense to only allow allocating 1,
for the first process.

> 
>> +                       nr = -EINVAL;
>>                 } else {
>>                         int pid_min = 1;
>>                         /*

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-23 20:01 ` [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created Pavel Tikhomirov
  2026-02-24  7:02   ` Andrei Vagin
@ 2026-02-24 12:09   ` Oleg Nesterov
  2026-02-24 13:23     ` Pavel Tikhomirov
  1 sibling, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2026-02-24 12:09 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Aleksa Sarai, Andrei Vagin,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest, Mateusz Guzik

On 02/23, Pavel Tikhomirov wrote:
>
> To avoid possible problems related to cpu/compiler optimizations around
> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
> everywhere we write it and use READ_ONCE everywhere we read it without
> explicit lock.

Yes, this is what I meant... but I can never recall if READ_ONCE() alone
is enough to make KCSAN happy...

I won't insist, but I think it would be better to do this in a separate
change for documenation purposes and for discussion.

> @@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>  			 * alreay in use. Return EEXIST in that case.
>  			 */
>  			if (nr == -ENOSPC)
> -
>  				nr = -EEXIST;
> +		} else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {
> +			nr = -EINVAL;
>  		} else {

Oh, this doesn't look clear/clean... This even looks racy even if it is not.
Can you move this check into the "else" branch which does another get_cursor
and unify this check with the RESERVED_PIDS check?

Either way, I don't like the fact we check ->child_reaper != NULL twice.
Perhaps something like the preparational patch below makes sense ? Not
sure this is actually better...

Oleg.

--- x/kernel/pid.c
+++ x/kernel/pid.c
@@ -215,12 +215,6 @@ struct pid *alloc_pid(struct pid_namespa
 			retval = -EINVAL;
 			if (tid < 1 || tid >= pid_max[ns->level - i])
 				goto out_abort;
-			/*
-			 * Also fail if a PID != 1 is requested and
-			 * no PID 1 exists.
-			 */
-			if (tid != 1 && !tmp->child_reaper)
-				goto out_abort;
 			retval = -EPERM;
 			if (!checkpoint_restore_ns_capable(tmp->user_ns))
 				goto out_abort;
@@ -299,6 +293,11 @@ struct pid *alloc_pid(struct pid_namespa
 		tmp = tmp->parent;
 		i--;
 		retried_preload = false;
+
+		if (!READ_ONCE(tmp->child_reaper) && nr != 1) {
+			retval = -EINVAL;
+			goto out_free;
+		}
 	}
 
 	/*



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24 12:09   ` Oleg Nesterov
@ 2026-02-24 13:23     ` Pavel Tikhomirov
  2026-02-24 16:03       ` Oleg Nesterov
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-24 13:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Aleksa Sarai, Andrei Vagin,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest, Mateusz Guzik



On 2/24/26 13:09, Oleg Nesterov wrote:
> On 02/23, Pavel Tikhomirov wrote:
>>
>> To avoid possible problems related to cpu/compiler optimizations around
>> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
>> everywhere we write it and use READ_ONCE everywhere we read it without
>> explicit lock.
> 
> Yes, this is what I meant... but I can never recall if READ_ONCE() alone
> is enough to make KCSAN happy...

AFAICS this should be fine for memory safety of accesses to ->child_reaper.
I would love if someone more experienced in this area would confirm.

> 
> I won't insist, but I think it would be better to do this in a separate
> change for documenation purposes and for discussion.

Ok, will do. It will be a bit ugly as I will first add READ_ONCE to the
pidns_for_children_get() and then remove the hunk with it in the next patch.

> 
>> @@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
>>  			 * alreay in use. Return EEXIST in that case.
>>  			 */
>>  			if (nr == -ENOSPC)
>> -
>>  				nr = -EEXIST;
>> +		} else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {
>> +			nr = -EINVAL;
>>  		} else {
> 
> Oh, this doesn't look clear/clean... This even looks racy even if it is not.
> Can you move this check into the "else" branch which does another get_cursor
> and unify this check with the RESERVED_PIDS check?
> 
> Either way, I don't like the fact we check ->child_reaper != NULL twice.

Notice that we don't really read it twice, it is only twice in code but any real
execution either passes the first check or the second one, not both at the same
time as they are in tid set/unset branches correspondingly.

> Perhaps something like the preparational patch below makes sense ? Not
> sure this is actually better...

This looks more universal at least, as instead of two checks we have one in one
place. My only concern of putting the check where I put it was to avoid extra
idr_alloc_cyclic() + idr_remove(), if we already know we don't need it. But it's
only in last pid_namespace we can have ->child_reaper unset so we do alloc/remove
for all other namespaces anyway in error case, should not be a big deal.

I will add the preparation patches: for below patch and related to _ONCE.

> 
> Oleg.
> 
> --- x/kernel/pid.c
> +++ x/kernel/pid.c
> @@ -215,12 +215,6 @@ struct pid *alloc_pid(struct pid_namespa
>  			retval = -EINVAL;
>  			if (tid < 1 || tid >= pid_max[ns->level - i])
>  				goto out_abort;
> -			/*
> -			 * Also fail if a PID != 1 is requested and
> -			 * no PID 1 exists.
> -			 */
> -			if (tid != 1 && !tmp->child_reaper)
> -				goto out_abort;
>  			retval = -EPERM;
>  			if (!checkpoint_restore_ns_capable(tmp->user_ns))
>  				goto out_abort;
> @@ -299,6 +293,11 @@ struct pid *alloc_pid(struct pid_namespa
>  		tmp = tmp->parent;
>  		i--;
>  		retried_preload = false;
> +
> +		if (!READ_ONCE(tmp->child_reaper) && nr != 1) {
> +			retval = -EINVAL;
> +			goto out_free;
> +		}
>  	}
>  
>  	/*
> 

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24 10:37     ` Pavel Tikhomirov
@ 2026-02-24 15:38       ` Andrei Vagin
  2026-02-24 16:09         ` Oleg Nesterov
  0 siblings, 1 reply; 11+ messages in thread
From: Andrei Vagin @ 2026-02-24 15:38 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Oleg Nesterov, Aleksa Sarai,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest

On Tue, Feb 24, 2026 at 5:38 AM Pavel Tikhomirov
<ptikhomirov@virtuozzo.com> wrote:
>
>
>
> On 2/24/26 08:02, Andrei Vagin wrote:
> > On Mon, Feb 23, 2026 at 3:03 PM Pavel Tikhomirov
> > <ptikhomirov@virtuozzo.com> wrote:
> >>
> >> This effectively gives us an ability to create the pid namespace init as
> >> a child of the process (setns-ed to the pid namespace) different to the
> >> process which created the pid namespace itself.
> >>
> >> Original problem:
> >>
> >> There is a cool set_tid feature in clone3() syscall, it allows you to
> >> create process with desired pids on multiple pid namespace levels. Which
> >> is useful to restore processes in CRIU for nested pid namespace case.
> >>
> >> In nested container case we can potentially see this kind of pid/user
> >> namespace tree:
> >>
> >>                             Process
> >>                           ┌─────────┐
> >>     User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
> >>         │           │     │         │
> >>         ▼           ▼     │         │
> >>     User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
> >>         │           │     │         │
> >>        ...         ...    │   ...   │
> >>         │           │     │         │
> >>         ▼           ▼     │         │
> >>     User NSn ──▶ Pid NSn ──▶ Pid pn │
> >>                           └─────────┘
> >>
> >> So to create the "Process" and set pids {p0, p1, ... pn} for it on all
> >> pid namespace levels we can use clone3() syscall set_tid feature, BUT
> >> the syscall does not allow you to set pid on pid namespace levels you
> >> don't have permission to. So basically you have to be in "User NS0" when
> >> creating the "Process" to actually be able to set pids on all levels.
> >>
> >> It is ok for almost any process, but with pid namespace init this does
> >> not work, as currently we can only create pid namespace init and the pid
> >> namespace itself simultaneously, so to make "Pid NSn" owned by "User
> >> NSn" we have to be in the "User NSn".
> >>
> >> We can't possibly be in "User NS0" and "User NSn" at the same time,
> >> hence the problem.
> >>
> >> Alternative solution:
> >>
> >> Yes, for the case of pid namespace init we can use old and gold
> >> /proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
> >> it is much more complicated and introduces tons of extra code to do. It
> >> would be nice to make clone3() set_tid interface also aplicable to this
> >> corner case.
> >>
> >> Implementation:
> >>
> >> Now when anyone can setns to the pid namespace before the creation of
> >> init, and thus multiple processes can fork children to the pid
> >> namespace, we enforce that the first process created is always the init,
> >> and only allow other processes after the init sets
> >> pid_namespace->child_reaper.
> >>
> >> To avoid possible problems related to cpu/compiler optimizations around
> >> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
> >> everywhere we write it and use READ_ONCE everywhere we read it without
> >> explicit lock. Note: we already had READ_ONCE in nsfs_fh_to_dentry().
> >>
> >> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> >>
> >> --
> >> v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
> >> task_list lock for reading it. Rebase to master, and thus remove
> >> now excess pidns_ready variable.
> >>
> >> Note: I didn't find anything in copy_process() around setting the
> >> ->child_reaper which can influence the pid namespace, so it looks like
> >> the pid namespace is fully setup at the point when init sets
> >> ->child_reaper to receive more processes. Thus tasklist lock looks
> >> excess in pidns_for_children_get()'s ->child_reaper check and it should
> >> be safe not to have it in the corresponding checks in alloc_pid().
> >> ---
> >>  kernel/exit.c          | 2 +-
> >>  kernel/fork.c          | 2 +-
> >>  kernel/pid.c           | 5 +++--
> >>  kernel/pid_namespace.c | 9 ---------
> >>  4 files changed, 5 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 8a87021211ae..567fc3b7b0f9 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -608,7 +608,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father,
> >>
> >>         reaper = find_alive_thread(father);
> >>         if (reaper) {
> >> -               pid_ns->child_reaper = reaper;
> >> +               WRITE_ONCE(pid_ns->child_reaper, reaper);
> >>                 return reaper;
> >>         }
> >>
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index e832da9d15a4..27d0cdbca67e 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -2423,7 +2423,7 @@ __latent_entropy struct task_struct *copy_process(
> >>                         init_task_pid(p, PIDTYPE_SID, task_session(current));
> >>
> >>                         if (is_child_reaper(pid)) {
> >> -                               ns_of_pid(pid)->child_reaper = p;
> >> +                               WRITE_ONCE(ns_of_pid(pid)->child_reaper, p);
> >>                                 p->signal->flags |= SIGNAL_UNKILLABLE;
> >>                         }
> >>                         p->signal->shared_pending.signal = delayed.signal;
> >> diff --git a/kernel/pid.c b/kernel/pid.c
> >> index 3b96571d0fe6..e6116e131d8d 100644
> >> --- a/kernel/pid.c
> >> +++ b/kernel/pid.c
> >> @@ -219,7 +219,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
> >>                          * Also fail if a PID != 1 is requested and
> >>                          * no PID 1 exists.
> >>                          */
> >> -                       if (tid != 1 && !tmp->child_reaper)
> >> +                       if (tid != 1 && !READ_ONCE(tmp->child_reaper))
> >>                                 goto out_abort;
> >>                         retval = -EPERM;
> >>                         if (!checkpoint_restore_ns_capable(tmp->user_ns))
> >> @@ -247,8 +247,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
> >>                          * alreay in use. Return EEXIST in that case.
> >>                          */
> >>                         if (nr == -ENOSPC)
> >> -
> >>                                 nr = -EEXIST;
> >> +               } else if (!READ_ONCE(tmp->child_reaper) && idr_get_cursor(&tmp->idr) != 0) {
> >
> > I think it is better to update pid_ns_ctl_handler to prevent setting
> > ns_last_pid in a pidns
> > without init. Otherwise, figuring out why fork returns EINVAL can be tricky.
>
> Hm, I think pid_ns_ctl_handler(), as it uses current active pid namespace can
> only work if current is already fully (ns/pid) in the pid namespace, and thus
> the init is also already there. So it's implicitly protected from change
> before init creation.
>
> This check here is more for the concurrent alloc_pid() case. When one process
> in alloc_pid() successfully allocated the pid and than, for instance, hit the
> pidfs_add_pid() error and is going to free_pid(), but the pid 1 is remains yet
> allocated from idr and the cursor is on 2 at the moment. At the same time the
> concurrent process may get to alloc_pid(), and will see cursor on 2, it should
> not be able to create a process as this process will get pid 2 and will be
> created before init.
>
> And in general (non concurrent case) it makes sense to only allow allocating 1,
> for the first process.

In this case, there is likely a race condition. Two alloc_pid() calls
can run concurrently,
where idr_get_cursor() returns 0 in both instances. Consequently, both
will attempt to
allocate PIDs, but only one will actually receive PID 1. I think this
check needs to
be moved after idr_alloc_cyclic() to verify the actual value that was allocated.


>
> >
> >> +                       nr = -EINVAL;
> >>                 } else {
> >>                         int pid_min = 1;
> >>                         /*
>
> --
> Best regards, Pavel Tikhomirov
> Senior Software Developer, Virtuozzo.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24 13:23     ` Pavel Tikhomirov
@ 2026-02-24 16:03       ` Oleg Nesterov
  2026-02-24 16:35         ` Pavel Tikhomirov
  0 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2026-02-24 16:03 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Aleksa Sarai, Andrei Vagin,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest, Mateusz Guzik

On 02/24, Pavel Tikhomirov wrote:
>
> On 2/24/26 13:09, Oleg Nesterov wrote:
> > On 02/23, Pavel Tikhomirov wrote:
> >>
> >> To avoid possible problems related to cpu/compiler optimizations around
> >> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
> >> everywhere we write it and use READ_ONCE everywhere we read it without
> >> explicit lock.
> >
> > Yes, this is what I meant... but I can never recall if READ_ONCE() alone
> > is enough to make KCSAN happy...
>
> AFAICS this should be fine for memory safety of accesses to ->child_reaper.
> I would love if someone more experienced in this area would confirm.

__READ_ONCE() uses volatile cast, DEFINE_TSAN_VOLATILE_READ_WRITE()
will pass KCSAN_ACCESS_ATOMIC to check_access(), so it seems that
READ_ONCE() should be enough...

But I am not sure, I don't really understand this code.

> > I won't insist, but I think it would be better to do this in a separate
> > change for documenation purposes and for discussion.
>
> Ok, will do. It will be a bit ugly as I will first add READ_ONCE to the
> pidns_for_children_get() and then remove the hunk with it in the next patch.

Agreed, this is ugly. I almost regret I mentioned _ONCE() in the previous
discussions, I only tried to "nack" another read_lock(tasklist).

So lets avoid a separate change and WRITE_ONCE()'s in copy_process/find_child_reaper,
we can add them later if KCSAN complains, they are not needed for correctness.

But up to you, I am fine either way.

> > Perhaps something like the preparational patch below makes sense ? Not
> > sure this is actually better...
>
> This looks more universal at least, as instead of two checks we have one in one
> place. My only concern of putting the check where I put it was to avoid extra
> idr_alloc_cyclic() + idr_remove(), if we already know we don't need it. But it's
> only in last pid_namespace we can have ->child_reaper unset so we do alloc/remove
> for all other namespaces anyway in error case, should not be a big deal.

Yes...

> I will add the preparation patches: for below patch and related to _ONCE.

Again, up to you. But either way it would be nice to have a comment or at
least a note in the changelog to explain that this is also needed to avoid
the race between alloc_pid() + fail and another alloc_pid(). This is subtle.

Oleg.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24 15:38       ` Andrei Vagin
@ 2026-02-24 16:09         ` Oleg Nesterov
  0 siblings, 0 replies; 11+ messages in thread
From: Oleg Nesterov @ 2026-02-24 16:09 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Pavel Tikhomirov, Christian Brauner, Shuah Khan, Kees Cook,
	Andrew Morton, David Hildenbrand, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Jan Kara, Aleksa Sarai,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest

On 02/24, Andrei Vagin wrote:
>
> On Tue, Feb 24, 2026 at 5:38 AM Pavel Tikhomirov
> >
> > This check here is more for the concurrent alloc_pid() case. When one process
> > in alloc_pid() successfully allocated the pid and than, for instance, hit the
> > pidfs_add_pid() error and is going to free_pid(), but the pid 1 is remains yet
> > allocated from idr and the cursor is on 2 at the moment. At the same time the
> > concurrent process may get to alloc_pid(), and will see cursor on 2, it should
> > not be able to create a process as this process will get pid 2 and will be
> > created before init.
> >
> > And in general (non concurrent case) it makes sense to only allow allocating 1,
> > for the first process.
>
> In this case, there is likely a race condition. Two alloc_pid() calls
> can run concurrently,
> where idr_get_cursor() returns 0 in both instances.

This code runs with pidmap_lock held.

But,

> I think this
> check needs to
> be moved after idr_alloc_cyclic() to verify the actual value that was allocated.

Agreed, see other emails in this thread.

Oleg.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created
  2026-02-24 16:03       ` Oleg Nesterov
@ 2026-02-24 16:35         ` Pavel Tikhomirov
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Tikhomirov @ 2026-02-24 16:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Shuah Khan, Kees Cook, Andrew Morton,
	David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Jan Kara, Aleksa Sarai, Andrei Vagin,
	Kirill Tkhai, Alexander Mikhalitsyn, Adrian Reber, linux-kernel,
	linux-mm, linux-kselftest, Mateusz Guzik



On 2/24/26 17:03, Oleg Nesterov wrote:
> On 02/24, Pavel Tikhomirov wrote:
>>
>> On 2/24/26 13:09, Oleg Nesterov wrote:
>>> On 02/23, Pavel Tikhomirov wrote:
>>>>
>>>> To avoid possible problems related to cpu/compiler optimizations around
>>>> ->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
>>>> everywhere we write it and use READ_ONCE everywhere we read it without
>>>> explicit lock.
>>>
>>> Yes, this is what I meant... but I can never recall if READ_ONCE() alone
>>> is enough to make KCSAN happy...
>>
>> AFAICS this should be fine for memory safety of accesses to ->child_reaper.
>> I would love if someone more experienced in this area would confirm.
> 
> __READ_ONCE() uses volatile cast, DEFINE_TSAN_VOLATILE_READ_WRITE()
> will pass KCSAN_ACCESS_ATOMIC to check_access(), so it seems that
> READ_ONCE() should be enough...
> 
> But I am not sure, I don't really understand this code.
> 
>>> I won't insist, but I think it would be better to do this in a separate
>>> change for documenation purposes and for discussion.
>>
>> Ok, will do. It will be a bit ugly as I will first add READ_ONCE to the
>> pidns_for_children_get() and then remove the hunk with it in the next patch.
> 
> Agreed, this is ugly. I almost regret I mentioned _ONCE() in the previous
> discussions, I only tried to "nack" another read_lock(tasklist).

Heh, I missed that in pidns_for_children_get() we have read under tasklist
lock, so I don't need to add _ONCE there so it should not be that ugly =).

I will test and send v3 with prep patches soon.

> 
> So lets avoid a separate change and WRITE_ONCE()'s in copy_process/find_child_reaper,
> we can add them later if KCSAN complains, they are not needed for correctness.
> 
> But up to you, I am fine either way.

I also noticed that 
https://docs.kernel.org/dev-tools/kcsan.html#c.ASSERT_EXCLUSIVE_WRITER
asks to add ASSERT_EXCLUSIVE_WRITER, so I will try to accommodate that.

> 
>>> Perhaps something like the preparational patch below makes sense ? Not
>>> sure this is actually better...
>>
>> This looks more universal at least, as instead of two checks we have one in one
>> place. My only concern of putting the check where I put it was to avoid extra
>> idr_alloc_cyclic() + idr_remove(), if we already know we don't need it. But it's
>> only in last pid_namespace we can have ->child_reaper unset so we do alloc/remove
>> for all other namespaces anyway in error case, should not be a big deal.
> 
> Yes...
> 
>> I will add the preparation patches: for below patch and related to _ONCE.
> 
> Again, up to you. But either way it would be nice to have a comment or at
> least a note in the changelog to explain that this is also needed to avoid
> the race between alloc_pid() + fail and another alloc_pid(). This is subtle.

Yes I will try to emphasize this.

> 
> Oleg.
> 

Thank you!

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-24 16:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-23 20:01 [PATCH v2 0/2] pid_namespace: make init creation more flexible Pavel Tikhomirov
2026-02-23 20:01 ` [PATCH v2 1/2] pid_namespace: allow opening pid_for_children before init was created Pavel Tikhomirov
2026-02-24  7:02   ` Andrei Vagin
2026-02-24 10:37     ` Pavel Tikhomirov
2026-02-24 15:38       ` Andrei Vagin
2026-02-24 16:09         ` Oleg Nesterov
2026-02-24 12:09   ` Oleg Nesterov
2026-02-24 13:23     ` Pavel Tikhomirov
2026-02-24 16:03       ` Oleg Nesterov
2026-02-24 16:35         ` Pavel Tikhomirov
2026-02-23 20:01 ` [PATCH v2 2/2] selftests: Add tests for creating pidns init via setns Pavel Tikhomirov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox