* [PATCH v2 0/2] further damage-control lack of clone scalability
@ 2025-12-03 9:28 Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 1/2] ns: pad refcount Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
0 siblings, 2 replies; 4+ messages in thread
From: Mateusz Guzik @ 2025-12-03 9:28 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.
Benchmark code at the end.
With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.
I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
bench from will-it-scale:
#include <assert.h>
#include <pthread.h>
char *testcase_description = "Thread creation and teardown";
static void *worker(void *arg)
{
return (NULL);
}
void testcase(unsigned long long *iterations, unsigned long nr)
{
pthread_t thread[1];
int error;
while (1) {
for (int i = 0; i < 1; i++) {
error = pthread_create(&thread[i], NULL, worker, NULL);
assert(error == 0);
}
for (int i = 0; i < 1; i++) {
error = pthread_join(thread[i], NULL);
assert(error == 0);
}
(*iterations)++;
}
}
v2:
- cosmetic fixes from Oleg
- drop idr_preload_many, relock pidmap + call idr_preload again instead
- write a commit message for the alloc pid patch
Mateusz Guzik (2):
ns: pad refcount
pid: only take pidmap_lock once on alloc
include/linux/ns/ns_common_types.h | 4 +-
kernel/pid.c | 131 +++++++++++++++++++----------
2 files changed, 88 insertions(+), 47 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/2] ns: pad refcount
2025-12-03 9:28 [PATCH v2 0/2] further damage-control lack of clone scalability Mateusz Guzik
@ 2025-12-03 9:28 ` Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
1 sibling, 0 replies; 4+ messages in thread
From: Mateusz Guzik @ 2025-12-03 9:28 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
Note no effort is made to make sure structs embedding the namespace are
themselves aligned, so this is not guaranteed to eliminate cacheline
bouncing due to refcount management.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
include/linux/ns/ns_common_types.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/ns/ns_common_types.h b/include/linux/ns/ns_common_types.h
index b332b019b29c..0014fbc1c626 100644
--- a/include/linux/ns/ns_common_types.h
+++ b/include/linux/ns/ns_common_types.h
@@ -108,11 +108,13 @@ extern const struct proc_ns_operations utsns_operations;
* @ns_tree: namespace tree nodes and active reference count
*/
struct ns_common {
+ struct {
+ refcount_t __ns_ref; /* do not use directly */
+ } ____cacheline_aligned_in_smp;
u32 ns_type;
struct dentry *stashed;
const struct proc_ns_operations *ops;
unsigned int inum;
- refcount_t __ns_ref; /* do not use directly */
union {
struct ns_tree;
struct rcu_head ns_rcu;
--
2.48.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 2/2] pid: only take pidmap_lock once on alloc
2025-12-03 9:28 [PATCH v2 0/2] further damage-control lack of clone scalability Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 1/2] ns: pad refcount Mateusz Guzik
@ 2025-12-03 9:28 ` Mateusz Guzik
2025-12-04 14:31 ` Oleg Nesterov
1 sibling, 1 reply; 4+ messages in thread
From: Mateusz Guzik @ 2025-12-03 9:28 UTC (permalink / raw)
To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, willy, Mateusz Guzik
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case. This aspect is fixed with
the patch.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
The change also facilitated some cosmetic fixes.
It has an unintentional side effect of no longer issuing spurious
idr_preload() around idr_replace().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
kernel/pid.c | 131 +++++++++++++++++++++++++++++++++------------------
1 file changed, 85 insertions(+), 46 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index a31771bc89c1..f45ae56db7da 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,58 +159,86 @@ void free_pids(struct pid **pids)
free_pid(pids[tmp]);
}
-struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
+ size_t arg_set_tid_size)
{
+ int set_tid[MAX_PID_NS_LEVEL + 1] = {};
+ int pid_max[MAX_PID_NS_LEVEL + 1] = {};
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
+ bool retried_preload;
/*
- * set_tid_size contains the size of the set_tid array. Starting at
+ * arg_set_tid_size contains the size of the arg_set_tid array. Starting at
* the most nested currently active PID namespace it tells alloc_pid()
* which PID to set for a process in that most nested PID namespace
- * up to set_tid_size PID namespaces. It does not have to set the PID
- * for a process in all nested PID namespaces but set_tid_size must
+ * up to arg_set_tid_size PID namespaces. It does not have to set the PID
+ * for a process in all nested PID namespaces but arg_set_tid_size must
* never be greater than the current ns->level + 1.
*/
- if (set_tid_size > ns->level + 1)
+ if (arg_set_tid_size > ns->level + 1)
return ERR_PTR(-EINVAL);
+ /*
+ * Prep before we take locks:
+ *
+ * 1. allocate and fill in pid struct
+ */
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
- tmp = ns;
+ get_pid_ns(ns);
pid->level = ns->level;
+ refcount_set(&pid->count, 1);
+ spin_lock_init(&pid->lock);
+ for (type = 0; type < PIDTYPE_MAX; ++type)
+ INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+ INIT_HLIST_HEAD(&pid->inodes);
- for (i = ns->level; i >= 0; i--) {
- int tid = 0;
- int pid_max = READ_ONCE(tmp->pid_max);
+ /*
+ * 2. perm check checkpoint_restore_ns_capable()
+ *
+ * This stores found pid_max to make sure the used value is the same should
+ * later code need it.
+ */
+ for (tmp = ns, i = ns->level; i >= 0; i--) {
+ pid_max[ns->level - i] = READ_ONCE(tmp->pid_max);
- if (set_tid_size) {
- tid = set_tid[ns->level - i];
+ if (arg_set_tid_size) {
+ int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i];
retval = -EINVAL;
- if (tid < 1 || tid >= pid_max)
- goto out_free;
+ if (tid < 1 || tid >= pid_max[ns->level - i])
+ goto out_abort;
/*
* Also fail if a PID != 1 is requested and
* no PID 1 exists.
*/
if (tid != 1 && !tmp->child_reaper)
- goto out_free;
+ goto out_abort;
retval = -EPERM;
if (!checkpoint_restore_ns_capable(tmp->user_ns))
- goto out_free;
- set_tid_size--;
+ goto out_abort;
+ arg_set_tid_size--;
}
- idr_preload(GFP_KERNEL);
- spin_lock(&pidmap_lock);
+ tmp = tmp->parent;
+ }
+
+ /*
+ * Prep is done, id allocation goes here:
+ */
+ retried_preload = false;
+ idr_preload(GFP_KERNEL);
+ spin_lock(&pidmap_lock);
+ for (tmp = ns, i = ns->level; i >= 0;) {
+ int tid = set_tid[ns->level - i];
if (tid) {
nr = idr_alloc(&tmp->idr, NULL, tid,
@@ -220,6 +248,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* alreay in use. Return EEXIST in that case.
*/
if (nr == -ENOSPC)
+
nr = -EEXIST;
} else {
int pid_min = 1;
@@ -235,19 +264,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* a partially initialized PID (see below).
*/
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ pid_max[ns->level - i], GFP_ATOMIC);
+ if (nr == -ENOSPC)
+ nr = -EAGAIN;
}
- spin_unlock(&pidmap_lock);
- idr_preload_end();
- if (nr < 0) {
- retval = (nr == -ENOSPC) ? -EAGAIN : nr;
+ if (unlikely(nr < 0)) {
+ /*
+ * Preload more memory if idr_alloc{,cyclic} failed with -ENOMEM.
+ *
+ * The IDR API only allows us to preload memory for one call, while we may end
+ * up doing several under pidmap_lock with GFP_ATOMIC. The situation may be
+ * salvageable with GFP_KERNEL. But make sure to not loop indefinitely if preload
+ * did not help (the routine unfortunately returns void, so we have no idea
+ * if it got anywhere).
+ *
+ * The lock can be safely dropped and picked up as historically pid allocation
+ * for different namespaces was *not* atomic -- we try to hold on to it the
+ * entire time only for performance reasons.
+ */
+ if (nr == -ENOMEM && !retried_preload) {
+ spin_unlock(&pidmap_lock);
+ idr_preload_end();
+ retried_preload = true;
+ idr_preload(GFP_KERNEL);
+ spin_lock(&pidmap_lock);
+ continue;
+ }
+ retval = nr;
goto out_free;
}
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
+ i--;
+ retried_preload = false;
}
/*
@@ -257,25 +309,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* is what we have exposed to userspace for a long time and it is
* documented behavior for pid namespaces. So we can't easily
* change it even if there were an error code better suited.
+ *
+ * This can't be done earlier because we need to preserve other
+ * error conditions.
*/
retval = -ENOMEM;
-
- get_pid_ns(ns);
- refcount_set(&pid->count, 1);
- spin_lock_init(&pid->lock);
- for (type = 0; type < PIDTYPE_MAX; ++type)
- INIT_HLIST_HEAD(&pid->tasks[type]);
-
- init_waitqueue_head(&pid->wait_pidfd);
- INIT_HLIST_HEAD(&pid->inodes);
-
- upid = pid->numbers + ns->level;
- idr_preload(GFP_KERNEL);
- spin_lock(&pidmap_lock);
- if (!(ns->pid_allocated & PIDNS_ADDING))
- goto out_unlock;
+ if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
+ goto out_free;
pidfs_add_pid(pid);
- for ( ; upid >= pid->numbers; --upid) {
+ for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
@@ -286,13 +328,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
return pid;
-out_unlock:
- spin_unlock(&pidmap_lock);
- idr_preload_end();
- put_pid_ns(ns);
-
out_free:
- spin_lock(&pidmap_lock);
while (++i <= ns->level) {
upid = pid->numbers + i;
idr_remove(&upid->ns->idr, upid->nr);
@@ -303,7 +339,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
idr_set_cursor(&ns->idr, 0);
spin_unlock(&pidmap_lock);
+ idr_preload_end();
+out_abort:
+ put_pid_ns(ns);
kmem_cache_free(ns->pid_cachep, pid);
return ERR_PTR(retval);
}
--
2.48.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2 2/2] pid: only take pidmap_lock once on alloc
2025-12-03 9:28 ` [PATCH v2 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
@ 2025-12-04 14:31 ` Oleg Nesterov
0 siblings, 0 replies; 4+ messages in thread
From: Oleg Nesterov @ 2025-12-04 14:31 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: brauner, linux-kernel, akpm, linux-mm, willy
On 12/03, Mateusz Guzik wrote:
>
> It has an unintentional side effect of no longer issuing spurious
> idr_preload() around idr_replace().
^^^^^^^^^^^
Around pidfs_add_pid(). The extra idr_preload/end was added by commit
9698d5a483654 ("pidfs: rework inode number allocation") because
pidfs_add_pid() used to call idr_alloc_cyclic(GFP_ATOMIC) too.
The patch looks technically correct to me, so FWIW
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-04 14:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-03 9:28 [PATCH v2 0/2] further damage-control lack of clone scalability Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 1/2] ns: pad refcount Mateusz Guzik
2025-12-03 9:28 ` [PATCH v2 2/2] pid: only take pidmap_lock once on alloc Mateusz Guzik
2025-12-04 14:31 ` Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox