linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup
@ 2025-02-06 16:44 Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 1/5] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

I fixed fat-fingering and touchedup some commit messages. git diff
between the old and new branch does not show any code changes.

hopefully we will be done here after this iteration 8->

old cover letter:

The clone side contends against exit side in a way which avoidably
exacerbates the problem by the latter waiting on locks held by the
former while holding the tasklist_lock.

Whacking this for both add_device_randomness and pids allocation gives
me a 15% speed up for thread creation/destruction in a 24-core vm.

The random patch is worth about 4%.

The new bottleneck is pidmap_lock itself, with the biggest problem being
the allocation itself taking the lock *twice*.

Bench (plop into will-it-scale):
$ cat tests/threadspawn1.c

char *testcase_description = "Thread creation and teardown";

static void *worker(void *arg)
{
	return (NULL);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
	pthread_t thread;
	int error;

	while (1) {
		error = pthread_create(&thread, NULL, worker, NULL);
		assert(error == 0);
		error = pthread_join(thread, NULL);
		assert(error == 0);
		(*iterations)++;
	}
}

v6:
- expand on the commit message in 3/5 pid: sprinkle tasklist_lock asserts
- move fat-fingered free_pids call to the right patch

v5:
- whack scripts/selinux/genheaders/genheaders which accidentally got in
- rebased on next-20250205

v4:
- justify moving get_pid in the commit message with a one-liner
- drop the tty unref patch -- it is completely optional and Oleg has his
  own variant
- add the ACK by Oleg

v3:
- keep procfs flush where it was, instead hoist get_pid outside of the
  lock
- make detach_pid et al accept an array argument of pids to populate
- sprinkle asserts
- drop irq trips around pidmap_lock
- move tty unref outside of tasklist_lock



Mateusz Guzik (5):
  exit: perform add_device_randomness() without tasklist_lock
  exit: hoist get_pid() in release_task() outside of tasklist_lock
  pid: sprinkle tasklist_lock asserts
  pid: perform free_pid() calls outside of tasklist_lock
  pid: drop irq disablement around pidmap_lock

 include/linux/pid.h |  7 ++--
 kernel/exit.c       | 36 +++++++++++++-------
 kernel/pid.c        | 82 +++++++++++++++++++++++++--------------------
 kernel/sys.c        | 14 +++++---
 4 files changed, 82 insertions(+), 57 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 1/5] exit: perform add_device_randomness() without tasklist_lock
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
@ 2025-02-06 16:44 ` Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 2/5] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

Parallel calls to add_device_randomness() contend on their own.

The clone side aleady runs outside of tasklist_lock, which in turn means
any caller on the exit side extends the tasklist_lock hold time while
contending on the random-private lock.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/exit.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..9d7acd7b0040 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -174,9 +174,6 @@ static void __exit_signal(struct task_struct *tsk)
 			sig->curr_target = next_thread(tsk);
 	}
 
-	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
-			      sizeof(unsigned long long));
-
 	/*
 	 * Accumulate here the counters for all threads as they die. We could
 	 * skip the group leader because it is the last user of signal_struct,
@@ -278,6 +275,8 @@ void release_task(struct task_struct *p)
 	write_unlock_irq(&tasklist_lock);
 	proc_flush_pid(thread_pid);
 	put_pid(thread_pid);
+	add_device_randomness(&p->se.sum_exec_runtime,
+			      sizeof(p->se.sum_exec_runtime));
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 2/5] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 1/5] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
@ 2025-02-06 16:44 ` Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 3/5] pid: sprinkle tasklist_lock asserts Mateusz Guzik
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

Reduces hold time as get_pid() contains an atomic.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 9d7acd7b0040..8c39c84582b7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -248,9 +248,10 @@ void release_task(struct task_struct *p)
 
 	cgroup_release(p);
 
+	thread_pid = get_pid(p->thread_pid);
+
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	thread_pid = get_pid(p->thread_pid);
 	__exit_signal(p);
 
 	/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 3/5] pid: sprinkle tasklist_lock asserts
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 1/5] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 2/5] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
@ 2025-02-06 16:44 ` Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 4/5] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

They cost nothing on production kernels and document the requirement of
holding the tasklist_lock lock in respective routines.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/pid.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 924084713be8..2ae872f689a7 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -339,17 +339,23 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
  */
 void attach_pid(struct task_struct *task, enum pid_type type)
 {
-	struct pid *pid = *task_pid_ptr(task, type);
+	struct pid *pid;
+
+	lockdep_assert_held_write(&tasklist_lock);
+
+	pid = *task_pid_ptr(task, type);
 	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
 static void __change_pid(struct task_struct *task, enum pid_type type,
 			struct pid *new)
 {
-	struct pid **pid_ptr = task_pid_ptr(task, type);
-	struct pid *pid;
+	struct pid **pid_ptr, *pid;
 	int tmp;
 
+	lockdep_assert_held_write(&tasklist_lock);
+
+	pid_ptr = task_pid_ptr(task, type);
 	pid = *pid_ptr;
 
 	hlist_del_rcu(&task->pid_links[type]);
@@ -386,6 +392,8 @@ void exchange_tids(struct task_struct *left, struct task_struct *right)
 	struct hlist_head *head1 = &pid1->tasks[PIDTYPE_PID];
 	struct hlist_head *head2 = &pid2->tasks[PIDTYPE_PID];
 
+	lockdep_assert_held_write(&tasklist_lock);
+
 	/* Swap the single entry tid lists */
 	hlists_swap_heads_rcu(head1, head2);
 
@@ -403,6 +411,7 @@ void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
 {
 	WARN_ON_ONCE(type == PIDTYPE_PID);
+	lockdep_assert_held_write(&tasklist_lock);
 	hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 4/5] pid: perform free_pid() calls outside of tasklist_lock
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
                   ` (2 preceding siblings ...)
  2025-02-06 16:44 ` [PATCH v6 3/5] pid: sprinkle tasklist_lock asserts Mateusz Guzik
@ 2025-02-06 16:44 ` Mateusz Guzik
  2025-02-06 16:44 ` [PATCH v6 5/5] pid: drop irq disablement around pidmap_lock Mateusz Guzik
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/pid.h |  7 ++++---
 kernel/exit.c       | 28 ++++++++++++++++++++--------
 kernel/pid.c        | 44 ++++++++++++++++++++++----------------------
 kernel/sys.c        | 14 +++++++++-----
 4 files changed, 55 insertions(+), 38 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 98837a1ff0f3..311ecebd7d56 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -101,9 +101,9 @@ extern struct pid *get_task_pid(struct task_struct *task, enum pid_type type);
  * these helpers must be called with the tasklist_lock write-held.
  */
 extern void attach_pid(struct task_struct *task, enum pid_type);
-extern void detach_pid(struct task_struct *task, enum pid_type);
-extern void change_pid(struct task_struct *task, enum pid_type,
-			struct pid *pid);
+void detach_pid(struct pid **pids, struct task_struct *task, enum pid_type);
+void change_pid(struct pid **pids, struct task_struct *task, enum pid_type,
+		struct pid *pid);
 extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
@@ -129,6 +129,7 @@ extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			     size_t set_tid_size);
 extern void free_pid(struct pid *pid);
+void free_pids(struct pid **pids);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 8c39c84582b7..0d6df671c8a8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -122,14 +122,22 @@ static __init int kernel_exit_sysfs_init(void)
 late_initcall(kernel_exit_sysfs_init);
 #endif
 
-static void __unhash_process(struct task_struct *p, bool group_dead)
+/*
+ * For things release_task() would like to do *after* tasklist_lock is released.
+ */
+struct release_task_post {
+	struct pid *pids[PIDTYPE_MAX];
+};
+
+static void __unhash_process(struct release_task_post *post, struct task_struct *p,
+			     bool group_dead)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
+	detach_pid(post->pids, p, PIDTYPE_PID);
 	if (group_dead) {
-		detach_pid(p, PIDTYPE_TGID);
-		detach_pid(p, PIDTYPE_PGID);
-		detach_pid(p, PIDTYPE_SID);
+		detach_pid(post->pids, p, PIDTYPE_TGID);
+		detach_pid(post->pids, p, PIDTYPE_PGID);
+		detach_pid(post->pids, p, PIDTYPE_SID);
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
@@ -141,7 +149,7 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 /*
  * This function expects the tasklist_lock write-locked.
  */
-static void __exit_signal(struct task_struct *tsk)
+static void __exit_signal(struct release_task_post *post, struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	bool group_dead = thread_group_leader(tsk);
@@ -194,7 +202,7 @@ static void __exit_signal(struct task_struct *tsk)
 	task_io_accounting_add(&sig->ioac, &tsk->ioac);
 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
-	__unhash_process(tsk, group_dead);
+	__unhash_process(post, tsk, group_dead);
 	write_sequnlock(&sig->stats_lock);
 
 	/*
@@ -236,10 +244,13 @@ void __weak release_thread(struct task_struct *dead_task)
 
 void release_task(struct task_struct *p)
 {
+	struct release_task_post post;
 	struct task_struct *leader;
 	struct pid *thread_pid;
 	int zap_leader;
 repeat:
+	memset(&post, 0, sizeof(post));
+
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
 	rcu_read_lock();
@@ -252,7 +263,7 @@ void release_task(struct task_struct *p)
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	__exit_signal(p);
+	__exit_signal(&post, p);
 
 	/*
 	 * If we are the last non-leader member of the thread
@@ -278,6 +289,7 @@ void release_task(struct task_struct *p)
 	put_pid(thread_pid);
 	add_device_randomness(&p->se.sum_exec_runtime,
 			      sizeof(p->se.sum_exec_runtime));
+	free_pids(post.pids);
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 2ae872f689a7..73625f28c166 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -88,20 +88,6 @@ struct pid_namespace init_pid_ns = {
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
-/*
- * Note: disable interrupts while the pidmap_lock is held as an
- * interrupt might come in and do read_lock(&tasklist_lock).
- *
- * If we don't disable interrupts there is a nasty deadlock between
- * detach_pid()->free_pid() and another cpu that does
- * spin_lock(&pidmap_lock) followed by an interrupt routine that does
- * read_lock(&tasklist_lock);
- *
- * After we clean up the tasklist_lock and know there are no
- * irq handlers that take it we can leave the interrupts enabled.
- * For now it is easier to be safe than to prove it can't happen.
- */
-
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 seqcount_spinlock_t pidmap_lock_seq = SEQCNT_SPINLOCK_ZERO(pidmap_lock_seq, &pidmap_lock);
 
@@ -128,10 +114,11 @@ static void delayed_put_pid(struct rcu_head *rhp)
 
 void free_pid(struct pid *pid)
 {
-	/* We can be called with write_lock_irq(&tasklist_lock) held */
 	int i;
 	unsigned long flags;
 
+	lockdep_assert_not_held(&tasklist_lock);
+
 	spin_lock_irqsave(&pidmap_lock, flags);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
@@ -160,6 +147,18 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
+void free_pids(struct pid **pids)
+{
+	int tmp;
+
+	/*
+	 * This can batch pidmap_lock.
+	 */
+	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
+		if (pids[tmp])
+			free_pid(pids[tmp]);
+}
+
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		      size_t set_tid_size)
 {
@@ -347,8 +346,8 @@ void attach_pid(struct task_struct *task, enum pid_type type)
 	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
-static void __change_pid(struct task_struct *task, enum pid_type type,
-			struct pid *new)
+static void __change_pid(struct pid **pids, struct task_struct *task,
+			 enum pid_type type, struct pid *new)
 {
 	struct pid **pid_ptr, *pid;
 	int tmp;
@@ -370,18 +369,19 @@ static void __change_pid(struct task_struct *task, enum pid_type type,
 		if (pid_has_task(pid, tmp))
 			return;
 
-	free_pid(pid);
+	WARN_ON(pids[type]);
+	pids[type] = pid;
 }
 
-void detach_pid(struct task_struct *task, enum pid_type type)
+void detach_pid(struct pid **pids, struct task_struct *task, enum pid_type type)
 {
-	__change_pid(task, type, NULL);
+	__change_pid(pids, task, type, NULL);
 }
 
-void change_pid(struct task_struct *task, enum pid_type type,
+void change_pid(struct pid **pids, struct task_struct *task, enum pid_type type,
 		struct pid *pid)
 {
-	__change_pid(task, type, pid);
+	__change_pid(pids, task, type, pid);
 	attach_pid(task, type);
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703a..4efca8a97d62 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1085,6 +1085,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
 {
 	struct task_struct *p;
 	struct task_struct *group_leader = current->group_leader;
+	struct pid *pids[PIDTYPE_MAX] = { 0 };
 	struct pid *pgrp;
 	int err;
 
@@ -1142,13 +1143,14 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
 		goto out;
 
 	if (task_pgrp(p) != pgrp)
-		change_pid(p, PIDTYPE_PGID, pgrp);
+		change_pid(pids, p, PIDTYPE_PGID, pgrp);
 
 	err = 0;
 out:
 	/* All paths lead to here, thus we are safe. -DaveM */
 	write_unlock_irq(&tasklist_lock);
 	rcu_read_unlock();
+	free_pids(pids);
 	return err;
 }
 
@@ -1222,21 +1224,22 @@ SYSCALL_DEFINE1(getsid, pid_t, pid)
 	return retval;
 }
 
-static void set_special_pids(struct pid *pid)
+static void set_special_pids(struct pid **pids, struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
 	if (task_session(curr) != pid)
-		change_pid(curr, PIDTYPE_SID, pid);
+		change_pid(pids, curr, PIDTYPE_SID, pid);
 
 	if (task_pgrp(curr) != pid)
-		change_pid(curr, PIDTYPE_PGID, pid);
+		change_pid(pids, curr, PIDTYPE_PGID, pid);
 }
 
 int ksys_setsid(void)
 {
 	struct task_struct *group_leader = current->group_leader;
 	struct pid *sid = task_pid(group_leader);
+	struct pid *pids[PIDTYPE_MAX] = { 0 };
 	pid_t session = pid_vnr(sid);
 	int err = -EPERM;
 
@@ -1252,13 +1255,14 @@ int ksys_setsid(void)
 		goto out;
 
 	group_leader->signal->leader = 1;
-	set_special_pids(sid);
+	set_special_pids(pids, sid);
 
 	proc_clear_tty(group_leader);
 
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+	free_pids(pids);
 	if (err > 0) {
 		proc_sid_connector(group_leader);
 		sched_autogroup_create_attach(group_leader);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 5/5] pid: drop irq disablement around pidmap_lock
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
                   ` (3 preceding siblings ...)
  2025-02-06 16:44 ` [PATCH v6 4/5] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
@ 2025-02-06 16:44 ` Mateusz Guzik
  2025-02-06 17:14 ` [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Liam R. Howlett
  2025-02-07  9:18 ` Christian Brauner
  6 siblings, 0 replies; 9+ messages in thread
From: Mateusz Guzik @ 2025-02-06 16:44 UTC (permalink / raw)
  To: ebiederm, oleg
  Cc: brauner, akpm, Liam.Howlett, linux-mm, linux-kernel, Mateusz Guzik

It no longer serves any purpose now that the tasklist_lock ->
pidmap_lock ordering got eliminated.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/pid.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 73625f28c166..900193de4232 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -115,11 +115,10 @@ static void delayed_put_pid(struct rcu_head *rhp)
 void free_pid(struct pid *pid)
 {
 	int i;
-	unsigned long flags;
 
 	lockdep_assert_not_held(&tasklist_lock);
 
-	spin_lock_irqsave(&pidmap_lock, flags);
+	spin_lock(&pidmap_lock);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
 		struct pid_namespace *ns = upid->ns;
@@ -142,7 +141,7 @@ void free_pid(struct pid *pid)
 		idr_remove(&ns->idr, upid->nr);
 	}
 	pidfs_remove_pid(pid);
-	spin_unlock_irqrestore(&pidmap_lock, flags);
+	spin_unlock(&pidmap_lock);
 
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
@@ -210,7 +209,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		}
 
 		idr_preload(GFP_KERNEL);
-		spin_lock_irq(&pidmap_lock);
+		spin_lock(&pidmap_lock);
 
 		if (tid) {
 			nr = idr_alloc(&tmp->idr, NULL, tid,
@@ -237,7 +236,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
 					      pid_max, GFP_ATOMIC);
 		}
-		spin_unlock_irq(&pidmap_lock);
+		spin_unlock(&pidmap_lock);
 		idr_preload_end();
 
 		if (nr < 0) {
@@ -271,7 +270,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	idr_preload(GFP_KERNEL);
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	pidfs_add_pid(pid);
@@ -280,18 +279,18 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		idr_replace(&upid->ns->idr, pid, upid->nr);
 		upid->ns->pid_allocated++;
 	}
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 	idr_preload_end();
 
 	return pid;
 
 out_unlock:
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 	idr_preload_end();
 	put_pid_ns(ns);
 
 out_free:
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	while (++i <= ns->level) {
 		upid = pid->numbers + i;
 		idr_remove(&upid->ns->idr, upid->nr);
@@ -301,7 +300,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	if (ns->pid_allocated == PIDNS_ADDING)
 		idr_set_cursor(&ns->idr, 0);
 
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 
 	kmem_cache_free(ns->pid_cachep, pid);
 	return ERR_PTR(retval);
@@ -309,9 +308,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 void disable_pid_allocation(struct pid_namespace *ns)
 {
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	ns->pid_allocated &= ~PIDNS_ADDING;
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 }
 
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
                   ` (4 preceding siblings ...)
  2025-02-06 16:44 ` [PATCH v6 5/5] pid: drop irq disablement around pidmap_lock Mateusz Guzik
@ 2025-02-06 17:14 ` Liam R. Howlett
  2025-02-07  9:18 ` Christian Brauner
  6 siblings, 0 replies; 9+ messages in thread
From: Liam R. Howlett @ 2025-02-06 17:14 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

* Mateusz Guzik <mjguzik@gmail.com> [250206 11:44]:
> I fixed fat-fingering and touchedup some commit messages. git diff
> between the old and new branch does not show any code changes.


Thanks for doing these changes.

Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

> 
> hopefully we will be done here after this iteration 8->
> 
> old cover letter:
> 
> The clone side contends against exit side in a way which avoidably
> exacerbates the problem by the latter waiting on locks held by the
> former while holding the tasklist_lock.
> 
> Whacking this for both add_device_randomness and pids allocation gives
> me a 15% speed up for thread creation/destruction in a 24-core vm.
> 
> The random patch is worth about 4%.
> 
> The new bottleneck is pidmap_lock itself, with the biggest problem being
> the allocation itself taking the lock *twice*.
> 
> Bench (plop into will-it-scale):
> $ cat tests/threadspawn1.c
> 
> char *testcase_description = "Thread creation and teardown";
> 
> static void *worker(void *arg)
> {
> 	return (NULL);
> }
> 
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
> 	pthread_t thread;
> 	int error;
> 
> 	while (1) {
> 		error = pthread_create(&thread, NULL, worker, NULL);
> 		assert(error == 0);
> 		error = pthread_join(thread, NULL);
> 		assert(error == 0);
> 		(*iterations)++;
> 	}
> }
> 
> v6:
> - expand on the commit message in 3/5 pid: sprinkle tasklist_lock asserts
> - move fat-fingered free_pids call to the right patch
> 
> v5:
> - whack scripts/selinux/genheaders/genheaders which accidentally got in
> - rebased on next-20250205
> 
> v4:
> - justify moving get_pid in the commit message with a one-liner
> - drop the tty unref patch -- it is completely optional and Oleg has his
>   own variant
> - add the ACK by Oleg
> 
> v3:
> - keep procfs flush where it was, instead hoist get_pid outside of the
>   lock
> - make detach_pid et al accept an array argument of pids to populate
> - sprinkle asserts
> - drop irq trips around pidmap_lock
> - move tty unref outside of tasklist_lock
> 
> 
> 
> Mateusz Guzik (5):
>   exit: perform add_device_randomness() without tasklist_lock
>   exit: hoist get_pid() in release_task() outside of tasklist_lock
>   pid: sprinkle tasklist_lock asserts
>   pid: perform free_pid() calls outside of tasklist_lock
>   pid: drop irq disablement around pidmap_lock
> 
>  include/linux/pid.h |  7 ++--
>  kernel/exit.c       | 36 +++++++++++++-------
>  kernel/pid.c        | 82 +++++++++++++++++++++++++--------------------
>  kernel/sys.c        | 14 +++++---
>  4 files changed, 82 insertions(+), 57 deletions(-)
> 
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup
  2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
                   ` (5 preceding siblings ...)
  2025-02-06 17:14 ` [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Liam R. Howlett
@ 2025-02-07  9:18 ` Christian Brauner
  2025-02-07  9:42   ` Oleg Nesterov
  6 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2025-02-07  9:18 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, akpm, Liam.Howlett, linux-mm, linux-kernel,
	ebiederm, oleg

On Thu, 06 Feb 2025 17:44:09 +0100, Mateusz Guzik wrote:
> I fixed fat-fingering and touchedup some commit messages. git diff
> between the old and new branch does not show any code changes.
> 
> hopefully we will be done here after this iteration 8->
> 
> old cover letter:
> 
> [...]

This is great!

---

Applied to the kernel-6.15.tasklist_lock branch of the vfs/vfs.git tree.
Patches in the kernel-6.15.tasklist_lock branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: kernel-6.15.tasklist_lock

[1/5] exit: perform add_device_randomness() without tasklist_lock
      https://git.kernel.org/vfs/vfs/c/c8be3764feff
[2/5] exit: hoist get_pid() in release_task() outside of tasklist_lock
      https://git.kernel.org/vfs/vfs/c/9ee2997cbf14
[3/5] pid: sprinkle tasklist_lock asserts
      https://git.kernel.org/vfs/vfs/c/7d0030e1660c
[4/5] pid: perform free_pid() calls outside of tasklist_lock
      https://git.kernel.org/vfs/vfs/c/7b15589a575f
[5/5] pid: drop irq disablement around pidmap_lock
      https://git.kernel.org/vfs/vfs/c/07e518c464e8


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup
  2025-02-07  9:18 ` Christian Brauner
@ 2025-02-07  9:42   ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2025-02-07  9:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mateusz Guzik, akpm, Liam.Howlett, linux-mm, linux-kernel, ebiederm

On 02/07, Christian Brauner wrote:
>
> Applied to the kernel-6.15.tasklist_lock branch of the vfs/vfs.git tree.

Great.

Then perhaps you can also take the related

	[PATCH v2 0/2] exit: change the release_task() paths to call flush_sigqueue() lockless
	https://lore.kernel.org/all/20250206152244.GA14609@redhat.com/

changes?

Oleg.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-07  9:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-06 16:44 [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Mateusz Guzik
2025-02-06 16:44 ` [PATCH v6 1/5] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
2025-02-06 16:44 ` [PATCH v6 2/5] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
2025-02-06 16:44 ` [PATCH v6 3/5] pid: sprinkle tasklist_lock asserts Mateusz Guzik
2025-02-06 16:44 ` [PATCH v6 4/5] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
2025-02-06 16:44 ` [PATCH v6 5/5] pid: drop irq disablement around pidmap_lock Mateusz Guzik
2025-02-06 17:14 ` [PATCH v6 0/5] reduce tasklist_lock hold time on exit and do some pid cleanup Liam R. Howlett
2025-02-07  9:18 ` Christian Brauner
2025-02-07  9:42   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox