* [PATCH v3 0/2] further damage-control lack of clone scalability
@ 2025-12-06 13:19 Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 1/2] ns: pad refcount Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
0 siblings, 2 replies; 3+ messages in thread
From: Mateusz Guzik @ 2025-12-06 13:19 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.
Benchmark code at the end.
With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.
I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
bench from will-it-scale:
#include <assert.h>
#include <pthread.h>
char *testcase_description = "Thread creation and teardown";
static void *worker(void *arg)
{
return (NULL);
}
void testcase(unsigned long long *iterations, unsigned long nr)
{
pthread_t thread[1];
int error;
while (1) {
for (int i = 0; i < 1; i++) {
error = pthread_create(&thread[i], NULL, worker, NULL);
assert(error == 0);
}
for (int i = 0; i < 1; i++) {
error = pthread_join(thread[i], NULL);
assert(error == 0);
}
(*iterations)++;
}
}
v3:
- fix some whitespace and one typo
- slightly reword the ENOMEM comment
- move i-- in the first loop towards the end for consistency with the
other loop
- 2 extra unlikely for initial error conditions
I retained Oleg's r-b as the changes don't affect behavior
v2:
- cosmetic fixes from Oleg
- drop idr_preload_many, relock pidmap + call idr_preload again instead
- write a commit message
Mateusz Guzik (2):
ns: pad refcount
pid: only take pidmap_lock once on alloc
include/linux/ns/ns_common_types.h | 4 +-
kernel/pid.c | 134 ++++++++++++++++++-----------
2 files changed, 89 insertions(+), 49 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v3 1/2] ns: pad refcount
2025-12-06 13:19 [PATCH v3 0/2] further damage-control lack of clone scalability Mateusz Guzik
@ 2025-12-06 13:19 ` Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
1 sibling, 0 replies; 3+ messages in thread
From: Mateusz Guzik @ 2025-12-06 13:19 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
Note no effort is made to make sure structs embedding the namespace are
themselves aligned, so this is not guaranteed to eliminate cacheline
bouncing due to refcount management.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
include/linux/ns/ns_common_types.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/ns/ns_common_types.h b/include/linux/ns/ns_common_types.h
index b332b019b29c..0014fbc1c626 100644
--- a/include/linux/ns/ns_common_types.h
+++ b/include/linux/ns/ns_common_types.h
@@ -108,11 +108,13 @@ extern const struct proc_ns_operations utsns_operations;
* @ns_tree: namespace tree nodes and active reference count
*/
struct ns_common {
+ struct {
+ refcount_t __ns_ref; /* do not use directly */
+ } ____cacheline_aligned_in_smp;
u32 ns_type;
struct dentry *stashed;
const struct proc_ns_operations *ops;
unsigned int inum;
- refcount_t __ns_ref; /* do not use directly */
union {
struct ns_tree;
struct rcu_head ns_rcu;
--
2.48.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v3 2/2] pid: only take pidmap_lock once on alloc
2025-12-06 13:19 [PATCH v3 0/2] further damage-control lack of clone scalability Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 1/2] ns: pad refcount Mateusz Guzik
@ 2025-12-06 13:19 ` Mateusz Guzik
1 sibling, 0 replies; 3+ messages in thread
From: Mateusz Guzik @ 2025-12-06 13:19 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case. This aspect is fixed with
the patch.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
The change also facilitated some cosmetic changes.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
kernel/pid.c | 134 +++++++++++++++++++++++++++++++++------------------
1 file changed, 86 insertions(+), 48 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index a31771bc89c1..ad4400a9f15f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,65 +159,94 @@ void free_pids(struct pid **pids)
free_pid(pids[tmp]);
}
-struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
+ size_t arg_set_tid_size)
{
+ int set_tid[MAX_PID_NS_LEVEL + 1] = {};
+ int pid_max[MAX_PID_NS_LEVEL + 1] = {};
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
+ bool retried_preload;
/*
- * set_tid_size contains the size of the set_tid array. Starting at
+ * arg_set_tid_size contains the size of the arg_set_tid array. Starting at
* the most nested currently active PID namespace it tells alloc_pid()
* which PID to set for a process in that most nested PID namespace
- * up to set_tid_size PID namespaces. It does not have to set the PID
- * for a process in all nested PID namespaces but set_tid_size must
+ * up to arg_set_tid_size PID namespaces. It does not have to set the PID
+ * for a process in all nested PID namespaces but arg_set_tid_size must
* never be greater than the current ns->level + 1.
*/
- if (set_tid_size > ns->level + 1)
+ if (unlikely(arg_set_tid_size > ns->level + 1))
return ERR_PTR(-EINVAL);
+ /*
+ * Prep before we take locks:
+ *
+ * 1. allocate and fill in pid struct
+ */
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
- if (!pid)
+ if (unlikely(!pid))
return ERR_PTR(retval);
- tmp = ns;
+ get_pid_ns(ns);
pid->level = ns->level;
+ refcount_set(&pid->count, 1);
+ spin_lock_init(&pid->lock);
+ for (type = 0; type < PIDTYPE_MAX; ++type)
+ INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+ INIT_HLIST_HEAD(&pid->inodes);
- for (i = ns->level; i >= 0; i--) {
- int tid = 0;
- int pid_max = READ_ONCE(tmp->pid_max);
+ /*
+ * 2. perm check checkpoint_restore_ns_capable()
+ *
+ * This stores found pid_max to make sure the used value is the same should
+ * later code need it.
+ */
+ for (tmp = ns, i = ns->level; i >= 0;) {
+ pid_max[ns->level - i] = READ_ONCE(tmp->pid_max);
- if (set_tid_size) {
- tid = set_tid[ns->level - i];
+ if (arg_set_tid_size) {
+ int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i];
retval = -EINVAL;
- if (tid < 1 || tid >= pid_max)
- goto out_free;
+ if (tid < 1 || tid >= pid_max[ns->level - i])
+ goto out_abort;
/*
* Also fail if a PID != 1 is requested and
* no PID 1 exists.
*/
if (tid != 1 && !tmp->child_reaper)
- goto out_free;
+ goto out_abort;
retval = -EPERM;
if (!checkpoint_restore_ns_capable(tmp->user_ns))
- goto out_free;
- set_tid_size--;
+ goto out_abort;
+ arg_set_tid_size--;
}
- idr_preload(GFP_KERNEL);
- spin_lock(&pidmap_lock);
+ tmp = tmp->parent;
+ i--;
+ }
+
+ /*
+ * Prep is done, id allocation goes here:
+ */
+ retried_preload = false;
+ idr_preload(GFP_KERNEL);
+ spin_lock(&pidmap_lock);
+ for (tmp = ns, i = ns->level; i >= 0;) {
+ int tid = set_tid[ns->level - i];
if (tid) {
nr = idr_alloc(&tmp->idr, NULL, tid,
tid + 1, GFP_ATOMIC);
/*
* If ENOSPC is returned it means that the PID is
- * alreay in use. Return EEXIST in that case.
+ * already in use. Return EEXIST in that case.
*/
if (nr == -ENOSPC)
nr = -EEXIST;
@@ -235,19 +264,41 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* a partially initialized PID (see below).
*/
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ pid_max[ns->level - i], GFP_ATOMIC);
+ if (nr == -ENOSPC)
+ nr = -EAGAIN;
}
- spin_unlock(&pidmap_lock);
- idr_preload_end();
- if (nr < 0) {
- retval = (nr == -ENOSPC) ? -EAGAIN : nr;
+ if (unlikely(nr < 0)) {
+ /*
+ * Preload more memory if idr_alloc{,cyclic} failed with -ENOMEM.
+ *
+ * The IDR API only allows us to preload memory for one call, while we may end
+ * up doing several with GFP_ATOMIC. It may be the situation is salvageable with
+ * GFP_KERNEL. But make sure to not loop indefinitely if preload did not help
+ * (the routine unfortunately returns void, so we have no idea if it got anywhere).
+ *
+ * The pidmap lock can be safely dropped and picked up as historically pid allocation
+ * for different namespaces was *not* atomic -- we try to hold on to it the
+ * entire time only for performance reasons.
+ */
+ if (nr == -ENOMEM && !retried_preload) {
+ spin_unlock(&pidmap_lock);
+ idr_preload_end();
+ retried_preload = true;
+ idr_preload(GFP_KERNEL);
+ spin_lock(&pidmap_lock);
+ continue;
+ }
+ retval = nr;
goto out_free;
}
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
+ i--;
+ retried_preload = false;
}
/*
@@ -257,25 +308,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* is what we have exposed to userspace for a long time and it is
* documented behavior for pid namespaces. So we can't easily
* change it even if there were an error code better suited.
+ *
+ * This can't be done earlier because we need to preserve other
+ * error conditions.
*/
retval = -ENOMEM;
-
- get_pid_ns(ns);
- refcount_set(&pid->count, 1);
- spin_lock_init(&pid->lock);
- for (type = 0; type < PIDTYPE_MAX; ++type)
- INIT_HLIST_HEAD(&pid->tasks[type]);
-
- init_waitqueue_head(&pid->wait_pidfd);
- INIT_HLIST_HEAD(&pid->inodes);
-
- upid = pid->numbers + ns->level;
- idr_preload(GFP_KERNEL);
- spin_lock(&pidmap_lock);
- if (!(ns->pid_allocated & PIDNS_ADDING))
- goto out_unlock;
+ if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
+ goto out_free;
pidfs_add_pid(pid);
- for ( ; upid >= pid->numbers; --upid) {
+ for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
@@ -286,13 +327,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
return pid;
-out_unlock:
- spin_unlock(&pidmap_lock);
- idr_preload_end();
- put_pid_ns(ns);
-
out_free:
- spin_lock(&pidmap_lock);
while (++i <= ns->level) {
upid = pid->numbers + i;
idr_remove(&upid->ns->idr, upid->nr);
@@ -303,7 +338,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
idr_set_cursor(&ns->idr, 0);
spin_unlock(&pidmap_lock);
+ idr_preload_end();
+out_abort:
+ put_pid_ns(ns);
kmem_cache_free(ns->pid_cachep, pid);
return ERR_PTR(retval);
}
--
2.48.1
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-12-06 13:20 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-06 13:19 [PATCH v3 0/2] further damage-control lack of clone scalability Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 1/2] ns: pad refcount Mateusz Guzik
2025-12-06 13:19 ` [PATCH v3 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox