[PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some
@ 2025-02-01 16:31 Mateusz Guzik
  2025-02-01 16:31 ` [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

The clone side contends against exit side in a way which avoidably
exacerbates the problem by the latter waiting on locks held by the
former while holding the tasklist_lock.

Whacking this for both add_device_randomness and pids allocation gives
me a 15% speed up for thread creation/destruction in a 24-core vm.

The random patch is worth about 4%.

nothing blew up with lockdep, lightly tested so far

Bench (plop into will-it-scale):
$ cat tests/threadspawn1.c

char *testcase_description = "Thread creation and teardown";

static void *worker(void *arg)
{
	return (NULL);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
	pthread_t thread;
	int error;

	while (1) {
		error = pthread_create(&thread, NULL, worker, NULL);
		assert(error == 0);
		error = pthread_join(thread, NULL);
		assert(error == 0);
		(*iterations)++;
	}
}


v3:
- keep procfs flush where it was, instead hoist get_pid outside of the
  lock
- make detach_pid et al accept an array argument of pids to populate
- sprinkle asserts
- drop irq trips around pidmap_lock
- move tty unref outside of tasklist_lock

Mateusz Guzik (6):
  exit: perform add_device_randomness() without tasklist_lock
  exit: hoist get_pid() in release_task() outside of tasklist_lock
  exit: postpone tty_kref_put() until after tasklist_lock is dropped
  pid: sprinkle tasklist_lock asserts
  pid: perform free_pid() calls outside of tasklist_lock
  pid: drop irq disablement around pidmap_lock

 include/linux/pid.h |  7 ++--
 kernel/exit.c       | 45 +++++++++++++++----------
 kernel/pid.c        | 82 +++++++++++++++++++++++++--------------------
 kernel/sys.c        | 14 +++++---
 4 files changed, 86 insertions(+), 62 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-03 17:51   ` Oleg Nesterov
  2025-02-01 16:31 ` [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

Parallel calls to add_device_randomness() contend on their own.

The clone side aleady runs outside of tasklist_lock, which in turn means
any caller on the exit side extends the tasklist_lock hold time while
contending on the random-private lock.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/exit.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..1eb2e7d36ce4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -174,9 +174,6 @@ static void __exit_signal(struct task_struct *tsk)
 			sig->curr_target = next_thread(tsk);
 	}
 
-	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
-			      sizeof(unsigned long long));
-
 	/*
 	 * Accumulate here the counters for all threads as they die. We could
 	 * skip the group leader because it is the last user of signal_struct,
@@ -278,6 +275,8 @@ void release_task(struct task_struct *p)
 	write_unlock_irq(&tasklist_lock);
 	proc_flush_pid(thread_pid);
 	put_pid(thread_pid);
+	add_device_randomness((const void*) &p->se.sum_exec_runtime,
+			      sizeof(unsigned long long));
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
  2025-02-01 16:31 ` [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-03 19:27   ` Liam R. Howlett
  2025-02-01 16:31 ` [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped Mateusz Guzik
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 1eb2e7d36ce4..257dd8ed45ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -248,9 +248,10 @@ void release_task(struct task_struct *p)
 
 	cgroup_release(p);
 
+	thread_pid = get_pid(p->thread_pid);
+
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	thread_pid = get_pid(p->thread_pid);
 	__exit_signal(p);
 
 	/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
  2025-02-01 16:31 ` [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
  2025-02-01 16:31 ` [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-03 18:06   ` Oleg Nesterov
  2025-02-01 16:31 ` [PATCH v3 4/6] pid: sprinkle tasklist_lock asserts Mateusz Guzik
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

Instead of smuggling the tty pointer directly, use a struct so that more
things can be added later.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/exit.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 257dd8ed45ea..d2c74f93f7d2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -122,6 +122,13 @@ static __init int kernel_exit_sysfs_init(void)
 late_initcall(kernel_exit_sysfs_init);
 #endif
 
+/*
+ * For things release_task() would like to do *after* tasklist_lock is released.
+ */
+struct release_task_post {
+	struct tty_struct *tty;
+};
+
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
@@ -141,12 +148,11 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 /*
  * This function expects the tasklist_lock write-locked.
  */
-static void __exit_signal(struct task_struct *tsk)
+static void __exit_signal(struct release_task_post *post, struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	bool group_dead = thread_group_leader(tsk);
 	struct sighand_struct *sighand;
-	struct tty_struct *tty;
 	u64 utime, stime;
 
 	sighand = rcu_dereference_check(tsk->sighand,
@@ -160,7 +166,7 @@ static void __exit_signal(struct task_struct *tsk)
 #endif
 
 	if (group_dead) {
-		tty = sig->tty;
+		post->tty = sig->tty;
 		sig->tty = NULL;
 	} else {
 		/*
@@ -207,10 +213,8 @@ static void __exit_signal(struct task_struct *tsk)
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
-	if (group_dead) {
+	if (group_dead)
 		flush_sigqueue(&sig->shared_pending);
-		tty_kref_put(tty);
-	}
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -236,10 +240,13 @@ void __weak release_thread(struct task_struct *dead_task)
 
 void release_task(struct task_struct *p)
 {
+	struct release_task_post post;
 	struct task_struct *leader;
 	struct pid *thread_pid;
 	int zap_leader;
 repeat:
+	memset(&post, 0, sizeof(post));
+
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
 	rcu_read_lock();
@@ -252,7 +259,7 @@ void release_task(struct task_struct *p)
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	__exit_signal(p);
+	__exit_signal(&post, p);
 
 	/*
 	 * If we are the last non-leader member of the thread
@@ -280,6 +287,7 @@ void release_task(struct task_struct *p)
 			      sizeof(unsigned long long));
 	release_thread(p);
 	put_task_struct_rcu_user(p);
+	tty_kref_put(post.tty);
 
 	p = leader;
 	if (unlikely(zap_leader))
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 4/6] pid: sprinkle tasklist_lock asserts
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
                   ` (2 preceding siblings ...)
  2025-02-01 16:31 ` [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-01 16:31 ` [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/pid.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 924084713be8..2ae872f689a7 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -339,17 +339,23 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
  */
 void attach_pid(struct task_struct *task, enum pid_type type)
 {
-	struct pid *pid = *task_pid_ptr(task, type);
+	struct pid *pid;
+
+	lockdep_assert_held_write(&tasklist_lock);
+
+	pid = *task_pid_ptr(task, type);
 	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
 static void __change_pid(struct task_struct *task, enum pid_type type,
 			struct pid *new)
 {
-	struct pid **pid_ptr = task_pid_ptr(task, type);
-	struct pid *pid;
+	struct pid **pid_ptr, *pid;
 	int tmp;
 
+	lockdep_assert_held_write(&tasklist_lock);
+
+	pid_ptr = task_pid_ptr(task, type);
 	pid = *pid_ptr;
 
 	hlist_del_rcu(&task->pid_links[type]);
@@ -386,6 +392,8 @@ void exchange_tids(struct task_struct *left, struct task_struct *right)
 	struct hlist_head *head1 = &pid1->tasks[PIDTYPE_PID];
 	struct hlist_head *head2 = &pid2->tasks[PIDTYPE_PID];
 
+	lockdep_assert_held_write(&tasklist_lock);
+
 	/* Swap the single entry tid lists */
 	hlists_swap_heads_rcu(head1, head2);
 
@@ -403,6 +411,7 @@ void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
 {
 	WARN_ON_ONCE(type == PIDTYPE_PID);
+	lockdep_assert_held_write(&tasklist_lock);
 	hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
                   ` (3 preceding siblings ...)
  2025-02-01 16:31 ` [PATCH v3 4/6] pid: sprinkle tasklist_lock asserts Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-03 18:49   ` Oleg Nesterov
  2025-02-01 16:31 ` [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock Mateusz Guzik
  2025-02-03 17:47 ` [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Oleg Nesterov
  6 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/pid.h |  7 ++++---
 kernel/exit.c       | 15 +++++++++------
 kernel/pid.c        | 44 ++++++++++++++++++++++----------------------
 kernel/sys.c        | 14 +++++++++-----
 4 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 98837a1ff0f3..311ecebd7d56 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -101,9 +101,9 @@ extern struct pid *get_task_pid(struct task_struct *task, enum pid_type type);
  * these helpers must be called with the tasklist_lock write-held.
  */
 extern void attach_pid(struct task_struct *task, enum pid_type);
-extern void detach_pid(struct task_struct *task, enum pid_type);
-extern void change_pid(struct task_struct *task, enum pid_type,
-			struct pid *pid);
+void detach_pid(struct pid **pids, struct task_struct *task, enum pid_type);
+void change_pid(struct pid **pids, struct task_struct *task, enum pid_type,
+		struct pid *pid);
 extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
@@ -129,6 +129,7 @@ extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			     size_t set_tid_size);
 extern void free_pid(struct pid *pid);
+void free_pids(struct pid **pids);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index d2c74f93f7d2..a90a0d159570 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -127,16 +127,18 @@ late_initcall(kernel_exit_sysfs_init);
  */
 struct release_task_post {
 	struct tty_struct *tty;
+	struct pid *pids[PIDTYPE_MAX];
 };
 
-static void __unhash_process(struct task_struct *p, bool group_dead)
+static void __unhash_process(struct release_task_post *post, struct task_struct *p,
+			     bool group_dead)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
+	detach_pid(post->pids, p, PIDTYPE_PID);
 	if (group_dead) {
-		detach_pid(p, PIDTYPE_TGID);
-		detach_pid(p, PIDTYPE_PGID);
-		detach_pid(p, PIDTYPE_SID);
+		detach_pid(post->pids, p, PIDTYPE_TGID);
+		detach_pid(post->pids, p, PIDTYPE_PGID);
+		detach_pid(post->pids, p, PIDTYPE_SID);
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
@@ -200,7 +202,7 @@ static void __exit_signal(struct release_task_post *post, struct task_struct *ts
 	task_io_accounting_add(&sig->ioac, &tsk->ioac);
 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
-	__unhash_process(tsk, group_dead);
+	__unhash_process(post, tsk, group_dead);
 	write_sequnlock(&sig->stats_lock);
 
 	/*
@@ -285,6 +287,7 @@ void release_task(struct task_struct *p)
 	put_pid(thread_pid);
 	add_device_randomness((const void*) &p->se.sum_exec_runtime,
 			      sizeof(unsigned long long));
+	free_pids(post.pids);
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 	tty_kref_put(post.tty);
diff --git a/kernel/pid.c b/kernel/pid.c
index 2ae872f689a7..73625f28c166 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -88,20 +88,6 @@ struct pid_namespace init_pid_ns = {
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
-/*
- * Note: disable interrupts while the pidmap_lock is held as an
- * interrupt might come in and do read_lock(&tasklist_lock).
- *
- * If we don't disable interrupts there is a nasty deadlock between
- * detach_pid()->free_pid() and another cpu that does
- * spin_lock(&pidmap_lock) followed by an interrupt routine that does
- * read_lock(&tasklist_lock);
- *
- * After we clean up the tasklist_lock and know there are no
- * irq handlers that take it we can leave the interrupts enabled.
- * For now it is easier to be safe than to prove it can't happen.
- */
-
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 seqcount_spinlock_t pidmap_lock_seq = SEQCNT_SPINLOCK_ZERO(pidmap_lock_seq, &pidmap_lock);
 
@@ -128,10 +114,11 @@ static void delayed_put_pid(struct rcu_head *rhp)
 
 void free_pid(struct pid *pid)
 {
-	/* We can be called with write_lock_irq(&tasklist_lock) held */
 	int i;
 	unsigned long flags;
 
+	lockdep_assert_not_held(&tasklist_lock);
+
 	spin_lock_irqsave(&pidmap_lock, flags);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
@@ -160,6 +147,18 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
+void free_pids(struct pid **pids)
+{
+	int tmp;
+
+	/*
+	 * This can batch pidmap_lock.
+	 */
+	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
+		if (pids[tmp])
+			free_pid(pids[tmp]);
+}
+
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		      size_t set_tid_size)
 {
@@ -347,8 +346,8 @@ void attach_pid(struct task_struct *task, enum pid_type type)
 	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
-static void __change_pid(struct task_struct *task, enum pid_type type,
-			struct pid *new)
+static void __change_pid(struct pid **pids, struct task_struct *task,
+			 enum pid_type type, struct pid *new)
 {
 	struct pid **pid_ptr, *pid;
 	int tmp;
@@ -370,18 +369,19 @@ static void __change_pid(struct task_struct *task, enum pid_type type,
 		if (pid_has_task(pid, tmp))
 			return;
 
-	free_pid(pid);
+	WARN_ON(pids[type]);
+	pids[type] = pid;
 }
 
-void detach_pid(struct task_struct *task, enum pid_type type)
+void detach_pid(struct pid **pids, struct task_struct *task, enum pid_type type)
 {
-	__change_pid(task, type, NULL);
+	__change_pid(pids, task, type, NULL);
 }
 
-void change_pid(struct task_struct *task, enum pid_type type,
+void change_pid(struct pid **pids, struct task_struct *task, enum pid_type type,
 		struct pid *pid)
 {
-	__change_pid(task, type, pid);
+	__change_pid(pids, task, type, pid);
 	attach_pid(task, type);
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703a..4efca8a97d62 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1085,6 +1085,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
 {
 	struct task_struct *p;
 	struct task_struct *group_leader = current->group_leader;
+	struct pid *pids[PIDTYPE_MAX] = { 0 };
 	struct pid *pgrp;
 	int err;
 
@@ -1142,13 +1143,14 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
 		goto out;
 
 	if (task_pgrp(p) != pgrp)
-		change_pid(p, PIDTYPE_PGID, pgrp);
+		change_pid(pids, p, PIDTYPE_PGID, pgrp);
 
 	err = 0;
 out:
 	/* All paths lead to here, thus we are safe. -DaveM */
 	write_unlock_irq(&tasklist_lock);
 	rcu_read_unlock();
+	free_pids(pids);
 	return err;
 }
 
@@ -1222,21 +1224,22 @@ SYSCALL_DEFINE1(getsid, pid_t, pid)
 	return retval;
 }
 
-static void set_special_pids(struct pid *pid)
+static void set_special_pids(struct pid **pids, struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
 	if (task_session(curr) != pid)
-		change_pid(curr, PIDTYPE_SID, pid);
+		change_pid(pids, curr, PIDTYPE_SID, pid);
 
 	if (task_pgrp(curr) != pid)
-		change_pid(curr, PIDTYPE_PGID, pid);
+		change_pid(pids, curr, PIDTYPE_PGID, pid);
 }
 
 int ksys_setsid(void)
 {
 	struct task_struct *group_leader = current->group_leader;
 	struct pid *sid = task_pid(group_leader);
+	struct pid *pids[PIDTYPE_MAX] = { 0 };
 	pid_t session = pid_vnr(sid);
 	int err = -EPERM;
 
@@ -1252,13 +1255,14 @@ int ksys_setsid(void)
 		goto out;
 
 	group_leader->signal->leader = 1;
-	set_special_pids(sid);
+	set_special_pids(pids, sid);
 
 	proc_clear_tty(group_leader);
 
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+	free_pids(pids);
 	if (err > 0) {
 		proc_sid_connector(group_leader);
 		sched_autogroup_create_attach(group_leader);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
                   ` (4 preceding siblings ...)
  2025-02-01 16:31 ` [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
@ 2025-02-01 16:31 ` Mateusz Guzik
  2025-02-01 17:42   ` Oleg Nesterov
  2025-02-01 18:19   ` David Laight
  2025-02-03 17:47 ` [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Oleg Nesterov
  6 siblings, 2 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 16:31 UTC (permalink / raw)
  To: ebiederm, oleg; +Cc: brauner, akpm, linux-mm, linux-kernel, Mateusz Guzik

It no longer serves any purpose now that the tasklist_lock ->
pidmap_lock ordering got eliminated.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/pid.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 73625f28c166..900193de4232 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -115,11 +115,10 @@ static void delayed_put_pid(struct rcu_head *rhp)
 void free_pid(struct pid *pid)
 {
 	int i;
-	unsigned long flags;
 
 	lockdep_assert_not_held(&tasklist_lock);
 
-	spin_lock_irqsave(&pidmap_lock, flags);
+	spin_lock(&pidmap_lock);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
 		struct pid_namespace *ns = upid->ns;
@@ -142,7 +141,7 @@ void free_pid(struct pid *pid)
 		idr_remove(&ns->idr, upid->nr);
 	}
 	pidfs_remove_pid(pid);
-	spin_unlock_irqrestore(&pidmap_lock, flags);
+	spin_unlock(&pidmap_lock);
 
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
@@ -210,7 +209,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		}
 
 		idr_preload(GFP_KERNEL);
-		spin_lock_irq(&pidmap_lock);
+		spin_lock(&pidmap_lock);
 
 		if (tid) {
 			nr = idr_alloc(&tmp->idr, NULL, tid,
@@ -237,7 +236,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
 					      pid_max, GFP_ATOMIC);
 		}
-		spin_unlock_irq(&pidmap_lock);
+		spin_unlock(&pidmap_lock);
 		idr_preload_end();
 
 		if (nr < 0) {
@@ -271,7 +270,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	idr_preload(GFP_KERNEL);
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	pidfs_add_pid(pid);
@@ -280,18 +279,18 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		idr_replace(&upid->ns->idr, pid, upid->nr);
 		upid->ns->pid_allocated++;
 	}
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 	idr_preload_end();
 
 	return pid;
 
 out_unlock:
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 	idr_preload_end();
 	put_pid_ns(ns);
 
 out_free:
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	while (++i <= ns->level) {
 		upid = pid->numbers + i;
 		idr_remove(&upid->ns->idr, upid->nr);
@@ -301,7 +300,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	if (ns->pid_allocated == PIDNS_ADDING)
 		idr_set_cursor(&ns->idr, 0);
 
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 
 	kmem_cache_free(ns->pid_cachep, pid);
 	return ERR_PTR(retval);
@@ -309,9 +308,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 void disable_pid_allocation(struct pid_namespace *ns)
 {
-	spin_lock_irq(&pidmap_lock);
+	spin_lock(&pidmap_lock);
 	ns->pid_allocated &= ~PIDNS_ADDING;
-	spin_unlock_irq(&pidmap_lock);
+	spin_unlock(&pidmap_lock);
 }
 
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 16:31 ` [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock Mateusz Guzik
@ 2025-02-01 17:42   ` Oleg Nesterov
  2025-02-01 17:45     ` Oleg Nesterov
  2025-02-01 18:19   ` David Laight
  1 sibling, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-01 17:42 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

I'll try to review on Monday, but I see nothing obviously wrong after
a very quick glance. Although I am not sure 4/6 is really useful, but
I won't argue.

However, shouldn't this patch also remove the comment which explains
the possible lock inversion? Above put_pid(),

	/*
	 * Note: disable interrupts while the pidmap_lock is held as an
	 * interrupt might come in and do read_lock(&tasklist_lock).
	 *
	 * If we don't disable interrupts there is a nasty deadlock between
	 * detach_pid()->free_pid() and another cpu that does
	 * spin_lock(&pidmap_lock) followed by an interrupt routine that does
	 * read_lock(&tasklist_lock);
	 *
	 * After we clean up the tasklist_lock and know there are no
	 * irq handlers that take it we can leave the interrupts enabled.
	 * For now it is easier to be safe than to prove it can't happen.
	 */

Oleg.

On 02/01, Mateusz Guzik wrote:
>
> It no longer serves any purpose now that the tasklist_lock ->
> pidmap_lock ordering got eliminated.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>  kernel/pid.c | 23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 73625f28c166..900193de4232 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -115,11 +115,10 @@ static void delayed_put_pid(struct rcu_head *rhp)
>  void free_pid(struct pid *pid)
>  {
>  	int i;
> -	unsigned long flags;
>  
>  	lockdep_assert_not_held(&tasklist_lock);
>  
> -	spin_lock_irqsave(&pidmap_lock, flags);
> +	spin_lock(&pidmap_lock);
>  	for (i = 0; i <= pid->level; i++) {
>  		struct upid *upid = pid->numbers + i;
>  		struct pid_namespace *ns = upid->ns;
> @@ -142,7 +141,7 @@ void free_pid(struct pid *pid)
>  		idr_remove(&ns->idr, upid->nr);
>  	}
>  	pidfs_remove_pid(pid);
> -	spin_unlock_irqrestore(&pidmap_lock, flags);
> +	spin_unlock(&pidmap_lock);
>  
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> @@ -210,7 +209,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  		}
>  
>  		idr_preload(GFP_KERNEL);
> -		spin_lock_irq(&pidmap_lock);
> +		spin_lock(&pidmap_lock);
>  
>  		if (tid) {
>  			nr = idr_alloc(&tmp->idr, NULL, tid,
> @@ -237,7 +236,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
>  					      pid_max, GFP_ATOMIC);
>  		}
> -		spin_unlock_irq(&pidmap_lock);
> +		spin_unlock(&pidmap_lock);
>  		idr_preload_end();
>  
>  		if (nr < 0) {
> @@ -271,7 +270,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  
>  	upid = pid->numbers + ns->level;
>  	idr_preload(GFP_KERNEL);
> -	spin_lock_irq(&pidmap_lock);
> +	spin_lock(&pidmap_lock);
>  	if (!(ns->pid_allocated & PIDNS_ADDING))
>  		goto out_unlock;
>  	pidfs_add_pid(pid);
> @@ -280,18 +279,18 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  		idr_replace(&upid->ns->idr, pid, upid->nr);
>  		upid->ns->pid_allocated++;
>  	}
> -	spin_unlock_irq(&pidmap_lock);
> +	spin_unlock(&pidmap_lock);
>  	idr_preload_end();
>  
>  	return pid;
>  
>  out_unlock:
> -	spin_unlock_irq(&pidmap_lock);
> +	spin_unlock(&pidmap_lock);
>  	idr_preload_end();
>  	put_pid_ns(ns);
>  
>  out_free:
> -	spin_lock_irq(&pidmap_lock);
> +	spin_lock(&pidmap_lock);
>  	while (++i <= ns->level) {
>  		upid = pid->numbers + i;
>  		idr_remove(&upid->ns->idr, upid->nr);
> @@ -301,7 +300,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  	if (ns->pid_allocated == PIDNS_ADDING)
>  		idr_set_cursor(&ns->idr, 0);
>  
> -	spin_unlock_irq(&pidmap_lock);
> +	spin_unlock(&pidmap_lock);
>  
>  	kmem_cache_free(ns->pid_cachep, pid);
>  	return ERR_PTR(retval);
> @@ -309,9 +308,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  
>  void disable_pid_allocation(struct pid_namespace *ns)
>  {
> -	spin_lock_irq(&pidmap_lock);
> +	spin_lock(&pidmap_lock);
>  	ns->pid_allocated &= ~PIDNS_ADDING;
> -	spin_unlock_irq(&pidmap_lock);
> +	spin_unlock(&pidmap_lock);
>  }
>  
>  struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 17:42   ` Oleg Nesterov
@ 2025-02-01 17:45     ` Oleg Nesterov
  0 siblings, 0 replies; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-01 17:45 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/01, Oleg Nesterov wrote:
>
> However, shouldn't this patch also remove the comment which explains
> the possible lock inversion? Above put_pid(),
>
> 	/*
> 	 * Note: disable interrupts while the pidmap_lock is held as an
> 	 * interrupt might come in and do read_lock(&tasklist_lock).
> 	 *
> 	 * If we don't disable interrupts there is a nasty deadlock between
> 	 * detach_pid()->free_pid() and another cpu that does
> 	 * spin_lock(&pidmap_lock) followed by an interrupt routine that does
> 	 * read_lock(&tasklist_lock);
> 	 *
> 	 * After we clean up the tasklist_lock and know there are no
> 	 * irq handlers that take it we can leave the interrupts enabled.
> 	 * For now it is easier to be safe than to prove it can't happen.
> 	 */

Ah, sorry, please forget, you did it in the previous patch. which probably
needs some more discussion...

Oleg.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 16:31 ` [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock Mateusz Guzik
  2025-02-01 17:42   ` Oleg Nesterov
@ 2025-02-01 18:19   ` David Laight
  2025-02-01 18:42     ` Mateusz Guzik
  1 sibling, 1 reply; 33+ messages in thread
From: David Laight @ 2025-02-01 18:19 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

On Sat,  1 Feb 2025 17:31:06 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> It no longer serves any purpose now that the tasklist_lock ->
> pidmap_lock ordering got eliminated.

Not disabling interrupts may make thing worse.
It is a trade off between 'interrupt latency' and 'lock hold time'.

If interrupts are disabled then (clearly) they can get delayed because
the lock is held.
Provided the lock is only held for a short time it probably doesn't matter.
Indeed, unless it is the worst one, it probably doesn't matter at all.
After all spin locks shouldn't really be held for significant periods.

OTOH if the lock doesn't disable interrupts then an interrupt will
increase the length of time a lock is held for.
This can be significant - and I mean upwards of 1ms.
Network interrupts can tale a while - and then the work that is deferred
to 'softint' context happens as well (I don't think a spinlock stops
the softint code).

I've a feeling that unless a spin lock is held for 'far longer than one
should ever be held for' then you really want to disable interrupts.

In this case if you get a network interrupt + softint while the pidmap_lock
is held then all other cpu won't be able to acquire the lock until the
network code finishes.

The same issue makes futex pretty much useless in anything trying to do
audio processing.

	David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 18:19   ` David Laight
@ 2025-02-01 18:42     ` Mateusz Guzik
  2025-02-01 21:51       ` David Laight
  0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-01 18:42 UTC (permalink / raw)
  To: David Laight; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

On Sat, Feb 1, 2025 at 7:19 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Sat,  1 Feb 2025 17:31:06 +0100
> Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> > It no longer serves any purpose now that the tasklist_lock ->
> > pidmap_lock ordering got eliminated.
>
> Not disabling interrupts may make thing worse.
> It is a trade off between 'interrupt latency' and 'lock hold time'.
>
> If interrupts are disabled then (clearly) they can get delayed because
> the lock is held.
> Provided the lock is only held for a short time it probably doesn't matter.
> Indeed, unless it is the worst one, it probably doesn't matter at all.
> After all spin locks shouldn't really be held for significant periods.
>
> OTOH if the lock doesn't disable interrupts then an interrupt will
> increase the length of time a lock is held for.
> This can be significant - and I mean upwards of 1ms.
> Network interrupts can tale a while - and then the work that is deferred
> to 'softint' context happens as well (I don't think a spinlock stops
> the softint code).
>
> I've a feeling that unless a spin lock is held for 'far longer than one
> should ever be held for' then you really want to disable interrupts.
>

Note that taking the interrupt trip increases single-threaded overhead.

Per your own description, if the lock is contested and interrupts are
disabled, handling them also get delayed by CPUs which are busy just
waiting (and which would otherwise take care of them).

So while this is indeed a tradeoff, as I understand the sane default
is to *not* disable interrupts unless necessary.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 18:42     ` Mateusz Guzik
@ 2025-02-01 21:51       ` David Laight
  2025-02-01 22:00         ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: David Laight @ 2025-02-01 21:51 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

On Sat, 1 Feb 2025 19:42:32 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Sat, Feb 1, 2025 at 7:19 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Sat,  1 Feb 2025 17:31:06 +0100
> > Mateusz Guzik <mjguzik@gmail.com> wrote:
> >  
> > > It no longer serves any purpose now that the tasklist_lock ->
> > > pidmap_lock ordering got eliminated.  
> >
> > Not disabling interrupts may make thing worse.
> > It is a trade off between 'interrupt latency' and 'lock hold time'.
> >
> > If interrupts are disabled then (clearly) they can get delayed because
> > the lock is held.
> > Provided the lock is only held for a short time it probably doesn't matter.
> > Indeed, unless it is the worst one, it probably doesn't matter at all.
> > After all spin locks shouldn't really be held for significant periods.
> >
> > OTOH if the lock doesn't disable interrupts then an interrupt will
> > increase the length of time a lock is held for.
> > This can be significant - and I mean upwards of 1ms.
> > Network interrupts can tale a while - and then the work that is deferred
> > to 'softint' context happens as well (I don't think a spinlock stops
> > the softint code).
> >
> > I've a feeling that unless a spin lock is held for 'far longer than one
> > should ever be held for' then you really want to disable interrupts.
> >  
> 
> Note that taking the interrupt trip increases single-threaded overhead.

I'm not sure what you mean.
Disabling interrupts isn't as cheap as it ought to be, but probably isn't
that bad.

> Per your own description, if the lock is contested and interrupts are
> disabled, handling them also get delayed by CPUs which are busy just
> waiting (and which would otherwise take care of them).

The slow path for spin locks ought to have interrupts enabled.
But, in any case, the interrupt is only delayed for the short time the spin
lock is held for.

> So while this is indeed a tradeoff, as I understand the sane default
> is to *not* disable interrupts unless necessary.

I bet to differ.

If an interrupt is taken by a cpu that holds a spin lock then any other
cpu that attempts to acquire the lock spins for the additional time that
the interrupt takes.
If interrupts are disabled then an interrupt is delayed for the time
that the spin lock is held.

The execution time of an interrupt can easily be a lot longer than
most spin locks are held for.
Either because of the work that an ethernet interrupt does (even before
deferring to softint), or because even a single PCIe read (eg to an
Altera Cyclone V fpga) can easily take 1000s of clocks.

Now the execution cost of the interrupt has to happen some time.
But you really don't want multiple cpu spinning waiting for it to finish.

	David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 21:51       ` David Laight
@ 2025-02-01 22:00         ` Matthew Wilcox
  2025-02-02 13:55           ` David Laight
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-02-01 22:00 UTC (permalink / raw)
  To: David Laight
  Cc: Mateusz Guzik, ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

On Sat, Feb 01, 2025 at 09:51:05PM +0000, David Laight wrote:
> I'm not sure what you mean.
> Disabling interrupts isn't as cheap as it ought to be, but probably isn't
> that bad.

Time it.  You'll see.

> > So while this is indeed a tradeoff, as I understand the sane default
> > is to *not* disable interrupts unless necessary.
> 
> I bet to differ.

You're wrong.  It is utterly standard to take spinlocks without
disabling IRQs.  We do it all over the kernel.  If you think that needs
to change, then make your case, don't throw a driveby review.

And I don't mean by arguing.  Make a change, measure the difference.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-01 22:00         ` Matthew Wilcox
@ 2025-02-02 13:55           ` David Laight
  2025-02-02 19:34             ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: David Laight @ 2025-02-02 13:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mateusz Guzik, ebiederm, oleg, brauner, akpm, linux-mm,
	linux-kernel, netdev

On Sat, 1 Feb 2025 22:00:06 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Feb 01, 2025 at 09:51:05PM +0000, David Laight wrote:
> > I'm not sure what you mean.
> > Disabling interrupts isn't as cheap as it ought to be, but probably isn't
> > that bad.  
> 
> Time it.  You'll see.

The best scheme I've seen is to just increment a per-cpu value.
Let the interrupt happen, notice it isn't allowed and return with
interrupts disabled.
Then re-issue the interrupt when the count is decremented to zero.
Easy with level sensitive interrupts.
But I don't think Linux ever uses that scheme.

> > > So while this is indeed a tradeoff, as I understand the sane default
> > > is to *not* disable interrupts unless necessary.  
> > 
> > I bet to differ.  
> 
> You're wrong.  It is utterly standard to take spinlocks without
> disabling IRQs.  We do it all over the kernel.  If you think that needs
> to change, then make your case, don't throw a driveby review.
> 
> And I don't mean by arguing.  Make a change, measure the difference.

The analysis was done on some userspace code that basically does:
	for (;;) {
		pthread_mutex_enter(lock);
		item = get_head(list);
		if (!item)
			break;
		pthead_mutex_exit(lock);
		process(item);
	}
For the test there were about 10000 items on the list and 30 threads
processing it (that was the target of the tests).
The entire list needs to be processed in 10ms (RTP audio).
There was a bit more code with the mutex held, but only 100 or so
instructions.
Mostly it works fine, some threads get delayed by interrupts (etc) but
the other threads carry on working and all the items get processed.

However sometimes an interrupt happens while the mutex is held.
In that case the other 29 threads get stuck waiting for the mutex.
No progress is made until the interrupt completes and it overruns
the 10ms period.

While this is a userspace test, the same thing will happen with
spin locks in the kernel.

In userspace you can't disable interrupts, but for kernel spinlocks
you can.

The problem is likely to show up as unexpected latency affecting
code with a hot mutex that is only held for short periods while
running a lot of network traffic.
That is also latency that affects all cpu at the same time.
The interrupt itself will always cause latency to one cpu.

Note that I also had to enable RFS, threaded NAPI and move the NAPI
threads to RT priorities to avoid lost packets.
The fix was to replace the linked list with an array and use atomic
increment to get the index of the item to process.

	David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-02 13:55           ` David Laight
@ 2025-02-02 19:34             ` Mateusz Guzik
  2025-02-02 20:44               ` David Laight
  0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-02 19:34 UTC (permalink / raw)
  To: David Laight
  Cc: Matthew Wilcox, ebiederm, oleg, brauner, akpm, linux-mm,
	linux-kernel, netdev

On Sun, Feb 2, 2025 at 2:55 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Sat, 1 Feb 2025 22:00:06 +0000
> Matthew Wilcox <willy@infradead.org> wrote:
>
> > On Sat, Feb 01, 2025 at 09:51:05PM +0000, David Laight wrote:
> > > I'm not sure what you mean.
> > > Disabling interrupts isn't as cheap as it ought to be, but probably isn't
> > > that bad.
> >
> > Time it.  You'll see.
>
> The best scheme I've seen is to just increment a per-cpu value.
> Let the interrupt happen, notice it isn't allowed and return with
> interrupts disabled.
> Then re-issue the interrupt when the count is decremented to zero.
> Easy with level sensitive interrupts.
> But I don't think Linux ever uses that scheme.
>

I presume you are talking about the splhigh/splx set of primivitives
from Unix kernels.

While "entering" is indeed cheap, undoing the work still needs to be
atomic vs interrupts.

I see NetBSD uses local cmpxchg8b on the interrupt level and interrupt
mask, while the rest takes the irq trip.

The NetBSD solution is still going to be visibly slower than not
messing with any of it as spin_unlock on amd64 is merely a store of 0
and cmpxchg even without the lock prefix costs several clocks.

Maybe there is other hackery which could be done, but see below.

>
> > > > So while this is indeed a tradeoff, as I understand the sane default
> > > > is to *not* disable interrupts unless necessary.
> > >
> > > I bet to differ.
> >
> > You're wrong.  It is utterly standard to take spinlocks without
> > disabling IRQs.  We do it all over the kernel.  If you think that needs
> > to change, then make your case, don't throw a driveby review.
> >
> > And I don't mean by arguing.  Make a change, measure the difference.
>
> The analysis was done on some userspace code that basically does:
>         for (;;) {
>                 pthread_mutex_enter(lock);
>                 item = get_head(list);
>                 if (!item)
>                         break;
>                 pthead_mutex_exit(lock);
>                 process(item);
>         }
> For the test there were about 10000 items on the list and 30 threads
> processing it (that was the target of the tests).
> The entire list needs to be processed in 10ms (RTP audio).
> There was a bit more code with the mutex held, but only 100 or so
> instructions.
> Mostly it works fine, some threads get delayed by interrupts (etc) but
> the other threads carry on working and all the items get processed.
>
> However sometimes an interrupt happens while the mutex is held.
> In that case the other 29 threads get stuck waiting for the mutex.
> No progress is made until the interrupt completes and it overruns
> the 10ms period.
>
> While this is a userspace test, the same thing will happen with
> spin locks in the kernel.
>
> In userspace you can't disable interrupts, but for kernel spinlocks
> you can.
>
> The problem is likely to show up as unexpected latency affecting
> code with a hot mutex that is only held for short periods while
> running a lot of network traffic.
> That is also latency that affects all cpu at the same time.
> The interrupt itself will always cause latency to one cpu.
>

Nobody is denying there is potential that lock hold time will get
significantly extended if you get unlucky enough vs interrupts. It is
questioned whether defaulting to irqs off around lock-protected areas
is the right call.

As I noted in my previous e-mail the spin_lock_irq stuff disables
interrupts upfront and does not touch them afterwards even when
waiting for the lock to become free. Patching that up with queued
locks may be non-trivial, if at all possible. Thus contention on
irq-disabled locks *will* add latency to their handling unless this
gets addressed. Note maintaining forward progress guarantee in the
locking mechanism is non-negotiable, so punting to TAS or similar
unfair locks does not cut it.

This is on top of having to solve the overhead problem for taking the
trips (see earlier in the e-mail).

I would argue if the network stuff specifically is known to add
visible latency, then perhaps that's something to investigate.

Anyhow, as Willy said, you are welcome to code this up and demonstrate
it is better overall.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-02 19:34             ` Mateusz Guzik
@ 2025-02-02 20:44               ` David Laight
  2025-02-02 22:06                 ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: David Laight @ 2025-02-02 20:44 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Matthew Wilcox, ebiederm, oleg, brauner, akpm, linux-mm,
	linux-kernel, netdev

On Sun, 2 Feb 2025 20:34:48 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Sun, Feb 2, 2025 at 2:55 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Sat, 1 Feb 2025 22:00:06 +0000
> > Matthew Wilcox <willy@infradead.org> wrote:
> >  
> > > On Sat, Feb 01, 2025 at 09:51:05PM +0000, David Laight wrote:  
> > > > I'm not sure what you mean.
> > > > Disabling interrupts isn't as cheap as it ought to be, but probably isn't
> > > > that bad.  
> > >
> > > Time it.  You'll see.  
> >
> > The best scheme I've seen is to just increment a per-cpu value.
> > Let the interrupt happen, notice it isn't allowed and return with
> > interrupts disabled.
> > Then re-issue the interrupt when the count is decremented to zero.
> > Easy with level sensitive interrupts.
> > But I don't think Linux ever uses that scheme.
> >  
> 
> I presume you are talking about the splhigh/splx set of primivitives
> from Unix kernels.
> 
> While "entering" is indeed cheap, undoing the work still needs to be
> atomic vs interrupts.
> 
> I see NetBSD uses local cmpxchg8b on the interrupt level and interrupt
> mask, while the rest takes the irq trip.
> 
> The NetBSD solution is still going to be visibly slower than not
> messing with any of it as spin_unlock on amd64 is merely a store of 0
> and cmpxchg even without the lock prefix costs several clocks.
> 
> Maybe there is other hackery which could be done, but see below.

I was thinking it might be possible to merge an 'interrupts disabled' count
with the existing 'pre-emption disabled' count.
IIRC (on x86 at least) this is just a per-cpu variabled accessed from %fs/%gs.
So you add one to disable pre-emption and (say) 1<<16 to disable interrupts.
If an interrupt happens while the count is 'big' the count value is changed
so the last decrement of 1<<16 will set carry (or overflow), and a return
from interrupt is done that leaves interrupts disabled (traditionally easy).
The interrupt enable call just subtracts the 1<<16 and checks for carry (or
overflow), if not set all is fine, it set it needs to call something to
re-issue the interrupt - that is probably the hard bit.

> 
> >  
> > > > > So while this is indeed a tradeoff, as I understand the sane default
> > > > > is to *not* disable interrupts unless necessary.  
> > > >
> > > > I bet to differ.  
> > >
> > > You're wrong.  It is utterly standard to take spinlocks without
> > > disabling IRQs.  We do it all over the kernel.  If you think that needs
> > > to change, then make your case, don't throw a driveby review.
> > >
> > > And I don't mean by arguing.  Make a change, measure the difference.  
> >
> > The analysis was done on some userspace code that basically does:
> >         for (;;) {
> >                 pthread_mutex_enter(lock);
> >                 item = get_head(list);
> >                 if (!item)
> >                         break;
> >                 pthead_mutex_exit(lock);
> >                 process(item);
> >         }
> > For the test there were about 10000 items on the list and 30 threads
> > processing it (that was the target of the tests).
> > The entire list needs to be processed in 10ms (RTP audio).
> > There was a bit more code with the mutex held, but only 100 or so
> > instructions.
> > Mostly it works fine, some threads get delayed by interrupts (etc) but
> > the other threads carry on working and all the items get processed.
> >
> > However sometimes an interrupt happens while the mutex is held.
> > In that case the other 29 threads get stuck waiting for the mutex.
> > No progress is made until the interrupt completes and it overruns
> > the 10ms period.
> >
> > While this is a userspace test, the same thing will happen with
> > spin locks in the kernel.
> >
> > In userspace you can't disable interrupts, but for kernel spinlocks
> > you can.
> >
> > The problem is likely to show up as unexpected latency affecting
> > code with a hot mutex that is only held for short periods while
> > running a lot of network traffic.
> > That is also latency that affects all cpu at the same time.
> > The interrupt itself will always cause latency to one cpu.
> >  
> 
> Nobody is denying there is potential that lock hold time will get
> significantly extended if you get unlucky enough vs interrupts. It is
> questioned whether defaulting to irqs off around lock-protected areas
> is the right call.

I really commented because you were changing one lock which could
easily be 'hot' enough for there to be side effects, without even
a comment about any pitfalls.

	David

> 
> As I noted in my previous e-mail the spin_lock_irq stuff disables
> interrupts upfront and does not touch them afterwards even when
> waiting for the lock to become free. Patching that up with queued
> locks may be non-trivial, if at all possible. Thus contention on
> irq-disabled locks *will* add latency to their handling unless this
> gets addressed. Note maintaining forward progress guarantee in the
> locking mechanism is non-negotiable, so punting to TAS or similar
> unfair locks does not cut it.
> 
> This is on top of having to solve the overhead problem for taking the
> trips (see earlier in the e-mail).
> 
> I would argue if the network stuff specifically is known to add
> visible latency, then perhaps that's something to investigate.
> 
> Anyhow, as Willy said, you are welcome to code this up and demonstrate
> it is better overall.
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock
  2025-02-02 20:44               ` David Laight
@ 2025-02-02 22:06                 ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2025-02-02 22:06 UTC (permalink / raw)
  To: David Laight
  Cc: Mateusz Guzik, ebiederm, oleg, brauner, akpm, linux-mm,
	linux-kernel, netdev

On Sun, Feb 02, 2025 at 08:44:49PM +0000, David Laight wrote:
> I really commented because you were changing one lock which could
> easily be 'hot' enough for there to be side effects, without even
> a comment about any pitfalls.

No such comment is needed.  This is standard practice.  You're the one
arguing for changing stuff.  Have a different conversation separate from
this patch.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some
  2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
                   ` (5 preceding siblings ...)
  2025-02-01 16:31 ` [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock Mateusz Guzik
@ 2025-02-03 17:47 ` Oleg Nesterov
  6 siblings, 0 replies; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-03 17:47 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/01, Mateusz Guzik wrote:
>
> Mateusz Guzik (6):
>   exit: perform add_device_randomness() without tasklist_lock
>   exit: hoist get_pid() in release_task() outside of tasklist_lock
>   exit: postpone tty_kref_put() until after tasklist_lock is dropped
>   pid: sprinkle tasklist_lock asserts
>   pid: perform free_pid() calls outside of tasklist_lock
>   pid: drop irq disablement around pidmap_lock
>
>  include/linux/pid.h |  7 ++--
>  kernel/exit.c       | 45 +++++++++++++++----------
>  kernel/pid.c        | 82 +++++++++++++++++++++++++--------------------
>  kernel/sys.c        | 14 +++++---
>  4 files changed, 86 insertions(+), 62 deletions(-)

I see nothing wrong in this series, feel free to add

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

I have a couple of very minor nits, will send in a minute...

Oleg.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock
  2025-02-01 16:31 ` [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
@ 2025-02-03 17:51   ` Oleg Nesterov
  2025-02-03 17:55     ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-03 17:51 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/01, Mateusz Guzik wrote:
>
> +	add_device_randomness((const void*) &p->se.sum_exec_runtime,
> +			      sizeof(unsigned long long));

I won't insist, but do we really need to keep this ugly "(const void*)"
typecast? and perhaps sizeof(p->se.sum_exec_runtime) will look a bit
better?

Oleg.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock
  2025-02-03 17:51   ` Oleg Nesterov
@ 2025-02-03 17:55     ` Mateusz Guzik
  0 siblings, 0 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-03 17:55 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On Mon, Feb 3, 2025 at 6:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 02/01, Mateusz Guzik wrote:
> >
> > +     add_device_randomness((const void*) &p->se.sum_exec_runtime,
> > +                           sizeof(unsigned long long));
>
> I won't insist, but do we really need to keep this ugly "(const void*)"
> typecast? and perhaps sizeof(p->se.sum_exec_runtime) will look a bit
> better?
>

Thanks for the review, I'm going to wait for Eric to chime in before
doing anything here.

As for the thing at hand, I tried avoid messing with it in order to
reduce discussion. :->

However, we seem to be in agreement this should be augmented, so I'm
going to do it in v4.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped
  2025-02-01 16:31 ` [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped Mateusz Guzik
@ 2025-02-03 18:06   ` Oleg Nesterov
  2025-02-03 19:33     ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-03 18:06 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/01, Mateusz Guzik wrote:
>
> Instead of smuggling the tty pointer directly, use a struct so that more
> things can be added later.

I am not sure this particular change worth the effort, but I won't argue.
I'd like to know what Eric thinks.

OTOH, if we do this, then perhaps we can do more "call tty_kref_put()
lockless" changes later. And perhaps even add the new

	void tty_kref_put_sync(struct tty_struct *tty)
	{
		if (tty)
			kref_put(&tty->kref, release_one_tty);
	}

helper. With this change release_task() doesn't need to abuse
schedule_work(), and this helper can have more users.

Nevermind, this is almost off-topic.

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock
  2025-02-01 16:31 ` [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
@ 2025-02-03 18:49   ` Oleg Nesterov
  2025-02-03 19:31     ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-03 18:49 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

To avoid the confusion, my question and a note are absolutely offtopic.

On 02/01, Mateusz Guzik wrote:
>
> @@ -1085,6 +1085,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
>  {
>  	struct task_struct *p;
>  	struct task_struct *group_leader = current->group_leader;
> +	struct pid *pids[PIDTYPE_MAX] = { 0 };

Could you remind me why

	struct pid *pids[PIDTYPE_MAX] = {};

is not right? I seem to knew it some time before, but can't recall...

And just for the record, a note for myself. I need to recheck, but it seems
that the comment above flush_sigqueue(&tsk->pending) in __exit_signal() is
obsolete, so with some minimal changes we can move more work outside of
tasklist.

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-01 16:31 ` [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
@ 2025-02-03 19:27   ` Liam R. Howlett
  2025-02-03 19:35     ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Liam R. Howlett @ 2025-02-03 19:27 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

* Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

Change log is a bit sparse?  I get that the subject spells out what is
done, but anything to say here at all?  Reduces lock contention by
reducing lock time or something?  Maybe even some numbers you have in
the cover letter?

> ---
>  kernel/exit.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 1eb2e7d36ce4..257dd8ed45ea 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -248,9 +248,10 @@ void release_task(struct task_struct *p)
>  
>  	cgroup_release(p);
>  
> +	thread_pid = get_pid(p->thread_pid);
> +
>  	write_lock_irq(&tasklist_lock);
>  	ptrace_release_task(p);
> -	thread_pid = get_pid(p->thread_pid);
>  	__exit_signal(p);
>  
>  	/*
> -- 
> 2.43.0
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock
  2025-02-03 18:49   ` Oleg Nesterov
@ 2025-02-03 19:31     ` Mateusz Guzik
  2025-02-03 20:02       ` Oleg Nesterov
  0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-03 19:31 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On Mon, Feb 3, 2025 at 7:49 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> To avoid the confusion, my question and a note are absolutely offtopic.
>
> On 02/01, Mateusz Guzik wrote:
> >
> > @@ -1085,6 +1085,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
> >  {
> >       struct task_struct *p;
> >       struct task_struct *group_leader = current->group_leader;
> > +     struct pid *pids[PIDTYPE_MAX] = { 0 };
>
> Could you remind me why
>
>         struct pid *pids[PIDTYPE_MAX] = {};
>
> is not right? I seem to knew it some time before, but can't recall...
>

as far as I know this is merely a matter of convention, fwiw a naive grep:
$ git grep '= { }' | wc -l
1273
$ git grep '= { 0 }' | wc -l
2031
$ git grep '= {}' | wc -l
5626

that said i can change it to whatever you see fit

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped
  2025-02-03 18:06   ` Oleg Nesterov
@ 2025-02-03 19:33     ` Mateusz Guzik
  2025-02-04 11:22       ` Oleg Nesterov
  0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-03 19:33 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On Mon, Feb 3, 2025 at 7:07 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 02/01, Mateusz Guzik wrote:
> >
> > Instead of smuggling the tty pointer directly, use a struct so that more
> > things can be added later.
>
> I am not sure this particular change worth the effort, but I won't argue.
> I'd like to know what Eric thinks.
>

it trivially whacks an atomic from an area protected by a global lock

> OTOH, if we do this, then perhaps we can do more "call tty_kref_put()
> lockless" changes later. And perhaps even add the new
>
>         void tty_kref_put_sync(struct tty_struct *tty)
>         {
>                 if (tty)
>                         kref_put(&tty->kref, release_one_tty);
>         }
>
> helper. With this change release_task() doesn't need to abuse
> schedule_work(), and this helper can have more users.
>
> Nevermind, this is almost off-topic.
>

I have no interest in messing with ttys, regardless I agree this is
definitely something to consider *after* this patch is sorted out.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-03 19:27   ` Liam R. Howlett
@ 2025-02-03 19:35     ` Mateusz Guzik
  2025-02-03 20:13       ` Liam R. Howlett
  0 siblings, 1 reply; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-03 19:35 UTC (permalink / raw)
  To: Liam R. Howlett, Mateusz Guzik, ebiederm, oleg, brauner, akpm,
	linux-mm, linux-kernel

On Mon, Feb 3, 2025 at 8:27 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>
> Change log is a bit sparse?  I get that the subject spells out what is
> done, but anything to say here at all?  Reduces lock contention by
> reducing lock time or something?  Maybe even some numbers you have in
> the cover letter?
>

I did not measure this bit *specifically*.

As one can expect get_pid issues an atomic and that's slow. And since
it can happen *prior* to taking the global lock it seems like an
obvious little nit to sort out.

I would argue the change is self-explanatory given the cover-letter.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock
  2025-02-03 19:31     ` Mateusz Guzik
@ 2025-02-03 20:02       ` Oleg Nesterov
  0 siblings, 0 replies; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-03 20:02 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/03, Mateusz Guzik wrote:
>
> On Mon, Feb 3, 2025 at 7:49 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > To avoid the confusion, my question and a note are absolutely offtopic.
> >
> > On 02/01, Mateusz Guzik wrote:
> > >
> > > @@ -1085,6 +1085,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
> > >  {
> > >       struct task_struct *p;
> > >       struct task_struct *group_leader = current->group_leader;
> > > +     struct pid *pids[PIDTYPE_MAX] = { 0 };
> >
> > Could you remind me why
> >
> >         struct pid *pids[PIDTYPE_MAX] = {};
> >
> > is not right? I seem to knew it some time before, but can't recall...
> >
> 
> as far as I know this is merely a matter of convention, fwiw a naive grep:
> $ git grep '= { }' | wc -l
> 1273
> $ git grep '= { 0 }' | wc -l
> 2031
> $ git grep '= {}' | wc -l
> 5626

OK... but I still have a vague feeling that it was explained somewhere
that '{ 0 }' should be preferred... Nevermind.

> that said i can change it to whatever you see fit

No, please do whatever looks better to you.

Oleg.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-03 19:35     ` Mateusz Guzik
@ 2025-02-03 20:13       ` Liam R. Howlett
  2025-02-03 20:22         ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Liam R. Howlett @ 2025-02-03 20:13 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

* Mateusz Guzik <mjguzik@gmail.com> [250203 14:36]:
> On Mon, Feb 3, 2025 at 8:27 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> >
> > Change log is a bit sparse?  I get that the subject spells out what is
> > done, but anything to say here at all?  Reduces lock contention by
> > reducing lock time or something?  Maybe even some numbers you have in
> > the cover letter?
> >
> 
> I did not measure this bit *specifically*.
> 
> As one can expect get_pid issues an atomic and that's slow. And since
> it can happen *prior* to taking the global lock it seems like an
> obvious little nit to sort out.
> 
> I would argue the change is self-explanatory given the cover-letter.

But when you git blame on the file, you will not see that cover letter.

> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-03 20:13       ` Liam R. Howlett
@ 2025-02-03 20:22         ` Mateusz Guzik
  2025-02-03 20:28           ` Liam R. Howlett
  2025-02-04  1:51           ` Andrew Morton
  0 siblings, 2 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-03 20:22 UTC (permalink / raw)
  To: Liam R. Howlett, Mateusz Guzik, ebiederm, oleg, brauner, akpm,
	linux-mm, linux-kernel

On Mon, Feb 3, 2025 at 9:14 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Mateusz Guzik <mjguzik@gmail.com> [250203 14:36]:
> > On Mon, Feb 3, 2025 at 8:27 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > >
> > > Change log is a bit sparse?  I get that the subject spells out what is
> > > done, but anything to say here at all?  Reduces lock contention by
> > > reducing lock time or something?  Maybe even some numbers you have in
> > > the cover letter?
> > >
> >
> > I did not measure this bit *specifically*.
> >
> > As one can expect get_pid issues an atomic and that's slow. And since
> > it can happen *prior* to taking the global lock it seems like an
> > obvious little nit to sort out.
> >
> > I would argue the change is self-explanatory given the cover-letter.
>
> But when you git blame on the file, you will not see that cover letter.

if this lands, I presume it is going to go through Andrew who uses
tooling pulling in the cover letter for each commit

but i'm not going to argue this bit, just provide with a commit
message which you think works and I'll use it

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-03 20:22         ` Mateusz Guzik
@ 2025-02-03 20:28           ` Liam R. Howlett
  2025-02-04  1:51           ` Andrew Morton
  1 sibling, 0 replies; 33+ messages in thread
From: Liam R. Howlett @ 2025-02-03 20:28 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, oleg, brauner, akpm, linux-mm, linux-kernel

* Mateusz Guzik <mjguzik@gmail.com> [250203 15:22]:
> On Mon, Feb 3, 2025 at 9:14 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Mateusz Guzik <mjguzik@gmail.com> [250203 14:36]:
> > > On Mon, Feb 3, 2025 at 8:27 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > > >
> > > > Change log is a bit sparse?  I get that the subject spells out what is
> > > > done, but anything to say here at all?  Reduces lock contention by
> > > > reducing lock time or something?  Maybe even some numbers you have in
> > > > the cover letter?
> > > >
> > >
> > > I did not measure this bit *specifically*.
> > >
> > > As one can expect get_pid issues an atomic and that's slow. And since
> > > it can happen *prior* to taking the global lock it seems like an
> > > obvious little nit to sort out.
> > >
> > > I would argue the change is self-explanatory given the cover-letter.
> >
> > But when you git blame on the file, you will not see that cover letter.
> 
> if this lands, I presume it is going to go through Andrew who uses
> tooling pulling in the cover letter for each commit
> 
> but i'm not going to argue this bit, just provide with a commit
> message which you think works and I'll use it

Your comment above states why this is beneficial.  Why not use a
variation of your statement of get_pid issuing an atomic which can be
removed from the locking area?  This seems like an important detail to
capture.

> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock
  2025-02-03 20:22         ` Mateusz Guzik
  2025-02-03 20:28           ` Liam R. Howlett
@ 2025-02-04  1:51           ` Andrew Morton
  1 sibling, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2025-02-04  1:51 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Liam R. Howlett, ebiederm, oleg, brauner, linux-mm, linux-kernel

On Mon, 3 Feb 2025 21:22:31 +0100 Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Mon, Feb 3, 2025 at 9:14 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Mateusz Guzik <mjguzik@gmail.com> [250203 14:36]:
> > > On Mon, Feb 3, 2025 at 8:27 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Mateusz Guzik <mjguzik@gmail.com> [250201 11:31]:
> > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > > >
> > > > Change log is a bit sparse?  I get that the subject spells out what is
> > > > done, but anything to say here at all?  Reduces lock contention by
> > > > reducing lock time or something?  Maybe even some numbers you have in
> > > > the cover letter?
> > > >
> > >
> > > I did not measure this bit *specifically*.
> > >
> > > As one can expect get_pid issues an atomic and that's slow. And since
> > > it can happen *prior* to taking the global lock it seems like an
> > > obvious little nit to sort out.
> > >
> > > I would argue the change is self-explanatory given the cover-letter.
> >
> > But when you git blame on the file, you will not see that cover letter.
> 
> if this lands, I presume it is going to go through Andrew who uses
> tooling pulling in the cover letter for each commit

No, I copy the [0/n] description into the [1/n] changelog only.

> but i'm not going to argue this bit, just provide with a commit
> message which you think works and I'll use it

It's rather irregular to ask Liam to explain your patch!

The Subject: describes what the patch does (which was obvious from the
code anyway) but the changelog fails to explain *why* the change was
made.  You've explained "why" adequately within this discussion so I
suggest you condense those words into this patch's changelog.


General muse: if a reviewer asks questions regarding a patch then we
should treat those questions as bug reports against the changelogs and
code comments: required information is missing.  So please let's go
through reviewer questions as we prepare the next revision of a
patchset and make sure that all those questions are fully answered,


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped
  2025-02-03 19:33     ` Mateusz Guzik
@ 2025-02-04 11:22       ` Oleg Nesterov
  2025-02-04 12:12         ` Mateusz Guzik
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2025-02-04 11:22 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On 02/03, Mateusz Guzik wrote:
>
> On Mon, Feb 3, 2025 at 7:07 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 02/01, Mateusz Guzik wrote:
> > >
> > > Instead of smuggling the tty pointer directly, use a struct so that more
> > > things can be added later.
> >
> > I am not sure this particular change worth the effort, but I won't argue.
> > I'd like to know what Eric thinks.
> >
>
> it trivially whacks an atomic from an area protected by a global lock

Yes, but I am not sure this can make a noticeable difference... Let
me repeat, I am not really arguing, I leave this to you and Eric.
The patch looks correct and I have already acked it. Just it looks
"less obvious" to me than other changes in this series.

And even if we do this, I am not sure we need the new
"struct release_task_post", it seems we can simply move
tty_kref_put() from __exit_signal() to release_task(), see the
patch at the end.

Nobody can touch signal->tty after the group leader passes __exit_signal(),
if nothing else lock_task_sighand() can't succeed.

But again, feel free to ignore.

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -146,7 +146,6 @@ static void __exit_signal(struct task_st
 	struct signal_struct *sig = tsk->signal;
 	bool group_dead = thread_group_leader(tsk);
 	struct sighand_struct *sighand;
-	struct tty_struct *tty;
 	u64 utime, stime;
 
 	sighand = rcu_dereference_check(tsk->sighand,
@@ -159,10 +158,7 @@ static void __exit_signal(struct task_st
 		posix_cpu_timers_exit_group(tsk);
 #endif
 
-	if (group_dead) {
-		tty = sig->tty;
-		sig->tty = NULL;
-	} else {
+	if (!group_dead) {
 		/*
 		 * If there is any task waiting for the group exit
 		 * then notify it:
@@ -210,10 +206,8 @@ static void __exit_signal(struct task_st
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
-	if (group_dead) {
+	if (group_dead)
 		flush_sigqueue(&sig->shared_pending);
-		tty_kref_put(tty);
-	}
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -279,6 +273,10 @@ repeat:
 	proc_flush_pid(thread_pid);
 	put_pid(thread_pid);
 	release_thread(p);
+
+	if (p == leader)
+		tty_kref_put(p->signal->tty);
+
 	put_task_struct_rcu_user(p);
 
 	p = leader;



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped
  2025-02-04 11:22       ` Oleg Nesterov
@ 2025-02-04 12:12         ` Mateusz Guzik
  0 siblings, 0 replies; 33+ messages in thread
From: Mateusz Guzik @ 2025-02-04 12:12 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: ebiederm, brauner, akpm, linux-mm, linux-kernel

On Tue, Feb 4, 2025 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 02/03, Mateusz Guzik wrote:
> >
> > On Mon, Feb 3, 2025 at 7:07 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 02/01, Mateusz Guzik wrote:
> > > >
> > > > Instead of smuggling the tty pointer directly, use a struct so that more
> > > > things can be added later.
> > >
> > > I am not sure this particular change worth the effort, but I won't argue.
> > > I'd like to know what Eric thinks.
> > >
> >
> > it trivially whacks an atomic from an area protected by a global lock
>
> Yes, but I am not sure this can make a noticeable difference... Let
> me repeat, I am not really arguing, I leave this to you and Eric.
> The patch looks correct and I have already acked it. Just it looks
> "less obvious" to me than other changes in this series.

Reducing lock hold time is basic multicore hygiene, doubly so for global locks.

Atomics are notoriously slow even if cache-hot.

I agree this change in isolation is not a big deal, but there are
several other "not a big deal"s in the area which will add up if
sorted out.

Given the triviality of the change as far as complexity goes, I would
argue that's a routine patch not warranting much of a discussion
(modulo the new struct, for that see below).

>
> And even if we do this, I am not sure we need the new
> "struct release_task_post", it seems we can simply move
> tty_kref_put() from __exit_signal() to release_task(), see the
> patch at the end.
>
> Nobody can touch signal->tty after the group leader passes __exit_signal(),
> if nothing else lock_task_sighand() can't succeed.
>

I wanted to keep "whack the tty" logic intact in order reduce any discussion. ;)

This brings me to the release_task_post struct.

Suppose the tty thing does not get patched at all or gets patched the
way you are proposing here.

I still need a way to handle the pidmap stuff. Suppose to that end a
pid array gets passed directly.

Some time later another thing might pop up which gets determined
within the lock and wants to return the state to clean up after the
unlock. Then someone will need to add a mechanism of the same sort I'm
adding here. iow I'm future-proofing this and tty just happens to be
the first (even if not necessary) consumer. iow preferably the struct
(or an equivalent) would still be there without tty stuff.

Given the nature of the change there is a lot of opinionated
handwaving possible and "happy to change" vs "feel free to ignore" is
not a productive discussion, thus I would like to clarify: my primary
interest is to whack the pidmap thing out of tasklist_lock-protected
area and this bit I'm going to argue about. Everything else is minor
touch ups which I can drop without much resistance (I'll note though
there is more I can submit in the area later :P), but if you genuinely
want something changed, it would be best explicitly say it.

As is, given the prior ack, I intend to submit v4 with minor touch ups.

> But again, feel free to ignore.
>
> Oleg.
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -146,7 +146,6 @@ static void __exit_signal(struct task_st
>         struct signal_struct *sig = tsk->signal;
>         bool group_dead = thread_group_leader(tsk);
>         struct sighand_struct *sighand;
> -       struct tty_struct *tty;
>         u64 utime, stime;
>
>         sighand = rcu_dereference_check(tsk->sighand,
> @@ -159,10 +158,7 @@ static void __exit_signal(struct task_st
>                 posix_cpu_timers_exit_group(tsk);
>  #endif
>
> -       if (group_dead) {
> -               tty = sig->tty;
> -               sig->tty = NULL;
> -       } else {
> +       if (!group_dead) {
>                 /*
>                  * If there is any task waiting for the group exit
>                  * then notify it:
> @@ -210,10 +206,8 @@ static void __exit_signal(struct task_st
>
>         __cleanup_sighand(sighand);
>         clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
> -       if (group_dead) {
> +       if (group_dead)
>                 flush_sigqueue(&sig->shared_pending);
> -               tty_kref_put(tty);
> -       }
>  }
>
>  static void delayed_put_task_struct(struct rcu_head *rhp)
> @@ -279,6 +273,10 @@ repeat:
>         proc_flush_pid(thread_pid);
>         put_pid(thread_pid);
>         release_thread(p);
> +
> +       if (p == leader)
> +               tty_kref_put(p->signal->tty);
> +
>         put_task_struct_rcu_user(p);
>
>         p = leader;
>

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-02-04 12:13 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-01 16:31 [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Mateusz Guzik
2025-02-01 16:31 ` [PATCH v3 1/6] exit: perform add_device_randomness() without tasklist_lock Mateusz Guzik
2025-02-03 17:51   ` Oleg Nesterov
2025-02-03 17:55     ` Mateusz Guzik
2025-02-01 16:31 ` [PATCH v3 2/6] exit: hoist get_pid() in release_task() outside of tasklist_lock Mateusz Guzik
2025-02-03 19:27   ` Liam R. Howlett
2025-02-03 19:35     ` Mateusz Guzik
2025-02-03 20:13       ` Liam R. Howlett
2025-02-03 20:22         ` Mateusz Guzik
2025-02-03 20:28           ` Liam R. Howlett
2025-02-04  1:51           ` Andrew Morton
2025-02-01 16:31 ` [PATCH v3 3/6] exit: postpone tty_kref_put() until after tasklist_lock is dropped Mateusz Guzik
2025-02-03 18:06   ` Oleg Nesterov
2025-02-03 19:33     ` Mateusz Guzik
2025-02-04 11:22       ` Oleg Nesterov
2025-02-04 12:12         ` Mateusz Guzik
2025-02-01 16:31 ` [PATCH v3 4/6] pid: sprinkle tasklist_lock asserts Mateusz Guzik
2025-02-01 16:31 ` [PATCH v3 5/6] pid: perform free_pid() calls outside of tasklist_lock Mateusz Guzik
2025-02-03 18:49   ` Oleg Nesterov
2025-02-03 19:31     ` Mateusz Guzik
2025-02-03 20:02       ` Oleg Nesterov
2025-02-01 16:31 ` [PATCH v3 6/6] pid: drop irq disablement around pidmap_lock Mateusz Guzik
2025-02-01 17:42   ` Oleg Nesterov
2025-02-01 17:45     ` Oleg Nesterov
2025-02-01 18:19   ` David Laight
2025-02-01 18:42     ` Mateusz Guzik
2025-02-01 21:51       ` David Laight
2025-02-01 22:00         ` Matthew Wilcox
2025-02-02 13:55           ` David Laight
2025-02-02 19:34             ` Mateusz Guzik
2025-02-02 20:44               ` David Laight
2025-02-02 22:06                 ` Matthew Wilcox
2025-02-03 17:47 ` [PATCH v3 0/6] reduce tasklist_lock hold time on exit and do some Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox