[PATCH 0/3] further damage-control lack of clone scalability

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] further damage-control lack of clone scalability
@ 2025-11-23  6:30 Mateusz Guzik
  2025-11-23  6:30 ` [PATCH 1/3] idr: add idr_prealloc_many Mateusz Guzik
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23  6:30 UTC (permalink / raw)
  To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, Mateusz Guzik

When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.

Benchmark code at the end.

With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.

I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.

Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.

bench from will-it-scale:

#include <assert.h>
#include <pthread.h>

char *testcase_description = "Thread creation and teardown";

static void *worker(void *arg)
{
        return (NULL);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        pthread_t thread[1];
        int error;

        while (1) {
                for (int i = 0; i < 1; i++) {
                        error = pthread_create(&thread[i], NULL, worker, NULL);
                        assert(error == 0);
                }
                for (int i = 0; i < 1; i++) {
                        error = pthread_join(thread[i], NULL);
                        assert(error == 0);
                }
                (*iterations)++;
        }
}

Mateusz Guzik (3):
  idr: add idr_prealloc_many
  ns: pad refcount
  pid: only take pidmap_lock once on alloc

 include/linux/idr.h                |  1 +
 include/linux/ns/ns_common_types.h |  4 +-
 kernel/pid.c                       | 99 +++++++++++++++++-------------
 lib/radix-tree.c                   | 19 +++++-
 4 files changed, 76 insertions(+), 47 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] idr: add idr_prealloc_many
  2025-11-23  6:30 [PATCH 0/3] further damage-control lack of clone scalability Mateusz Guzik
@ 2025-11-23  6:30 ` Mateusz Guzik
  2025-11-23  6:30 ` [PATCH 2/3] ns: pad refcount Mateusz Guzik
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23  6:30 UTC (permalink / raw)
  To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, Mateusz Guzik

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/idr.h |  1 +
 lib/radix-tree.c    | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 789e23e67444..c9aeae695442 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -110,6 +110,7 @@ static inline void idr_set_cursor(struct idr *idr, unsigned int val)
 #define idr_unlock_irqrestore(idr, flags) \
 				xa_unlock_irqrestore(&(idr)->idr_rt, flags)
 
+void idr_preload_many(int nr, gfp_t gfp_mask);
 void idr_preload(gfp_t gfp_mask);
 
 int idr_alloc(struct idr *, void *ptr, int start, int end, gfp_t);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 976b9bd02a1b..2e71024e5323 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1459,6 +1459,22 @@ int radix_tree_tagged(const struct radix_tree_root *root, unsigned int tag)
 }
 EXPORT_SYMBOL(radix_tree_tagged);
 
+/**
+ * idr_preload_many - preload for idr_alloc()
+ * @gfp_mask: allocation mask to use for preloading
+ * @nr: how many calls to preload for
+ *
+ * Preallocate memory to use for n calls to idr_alloc().  This function
+ * returns with preemption disabled.  It will be enabled by idr_preload_end().
+ */
+void idr_preload_many(int nr, gfp_t gfp_mask)
+{
+	WARN_ON_ONCE(!nr);
+	if (__radix_tree_preload(gfp_mask, nr * IDR_PRELOAD_SIZE))
+		local_lock(&radix_tree_preloads.lock);
+}
+EXPORT_SYMBOL(idr_preload_many);
+
 /**
  * idr_preload - preload for idr_alloc()
  * @gfp_mask: allocation mask to use for preloading
@@ -1468,8 +1484,7 @@ EXPORT_SYMBOL(radix_tree_tagged);
  */
 void idr_preload(gfp_t gfp_mask)
 {
-	if (__radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE))
-		local_lock(&radix_tree_preloads.lock);
+	idr_preload_many(1, gfp_mask);
 }
 EXPORT_SYMBOL(idr_preload);
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/3] ns: pad refcount
  2025-11-23  6:30 [PATCH 0/3] further damage-control lack of clone scalability Mateusz Guzik
  2025-11-23  6:30 ` [PATCH 1/3] idr: add idr_prealloc_many Mateusz Guzik
@ 2025-11-23  6:30 ` Mateusz Guzik
  2025-11-23 18:58   ` Oleg Nesterov
  2025-11-23  6:30 ` [PATCH 3/3] pid: only take pidmap_lock once on alloc Mateusz Guzik
  2025-11-23 15:00 ` [PATCH 0/3] further damage-control lack of clone scalability Matthew Wilcox
  3 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23  6:30 UTC (permalink / raw)
  To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, Mateusz Guzik

Note no effort is made to make sure structs embedding the namespace are
themselves aligned, so this is not guaranteed to eliminate cacheline
bouncing due to refcount management.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/ns/ns_common_types.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/ns/ns_common_types.h b/include/linux/ns/ns_common_types.h
index b332b019b29c..0014fbc1c626 100644
--- a/include/linux/ns/ns_common_types.h
+++ b/include/linux/ns/ns_common_types.h
@@ -108,11 +108,13 @@ extern const struct proc_ns_operations utsns_operations;
  * @ns_tree: namespace tree nodes and active reference count
  */
 struct ns_common {
+	struct {
+		refcount_t __ns_ref; /* do not use directly */
+	} ____cacheline_aligned_in_smp;
 	u32 ns_type;
 	struct dentry *stashed;
 	const struct proc_ns_operations *ops;
 	unsigned int inum;
-	refcount_t __ns_ref; /* do not use directly */
 	union {
 		struct ns_tree;
 		struct rcu_head ns_rcu;
-- 
2.48.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/3] pid: only take pidmap_lock once on alloc
  2025-11-23  6:30 [PATCH 0/3] further damage-control lack of clone scalability Mateusz Guzik
  2025-11-23  6:30 ` [PATCH 1/3] idr: add idr_prealloc_many Mateusz Guzik
  2025-11-23  6:30 ` [PATCH 2/3] ns: pad refcount Mateusz Guzik
@ 2025-11-23  6:30 ` Mateusz Guzik
  2025-11-23 20:09   ` Oleg Nesterov
  2025-11-23 15:00 ` [PATCH 0/3] further damage-control lack of clone scalability Matthew Wilcox
  3 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23  6:30 UTC (permalink / raw)
  To: oleg; +Cc: brauner, linux-kernel, akpm, linux-mm, Mateusz Guzik

This reduces contention on the lock during parallel clone/exit.

It remains the primary bottleneck in such a case.

While here tidy up the code.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/pid.c | 99 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 55 insertions(+), 44 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index a31771bc89c1..6a87ce5a6040 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,9 +159,11 @@ void free_pids(struct pid **pids)
 			free_pid(pids[tmp]);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
-		      size_t set_tid_size)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
+		      size_t arg_set_tid_size)
 {
+	int set_tid[MAX_PID_NS_LEVEL + 1] = {};
+	int pid_max[MAX_PID_NS_LEVEL + 1] = {};
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
@@ -170,47 +172,71 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	int retval = -ENOMEM;
 
 	/*
-	 * set_tid_size contains the size of the set_tid array. Starting at
+	 * arg_set_tid_size contains the size of the arg_set_tid array. Starting at
 	 * the most nested currently active PID namespace it tells alloc_pid()
 	 * which PID to set for a process in that most nested PID namespace
-	 * up to set_tid_size PID namespaces. It does not have to set the PID
-	 * for a process in all nested PID namespaces but set_tid_size must
+	 * up to arg_set_tid_size PID namespaces. It does not have to set the PID
+	 * for a process in all nested PID namespaces but arg_set_tid_size must
 	 * never be greater than the current ns->level + 1.
 	 */
-	if (set_tid_size > ns->level + 1)
+	if (arg_set_tid_size > ns->level + 1)
 		return ERR_PTR(-EINVAL);
 
+	/*
+	 * Prep before we take locks:
+	 *
+	 * 1. allocate and fill in pid struct
+	 */
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		return ERR_PTR(retval);
 
-	tmp = ns;
+	get_pid_ns(ns);
 	pid->level = ns->level;
+	refcount_set(&pid->count, 1);
+	spin_lock_init(&pid->lock);
+	for (type = 0; type < PIDTYPE_MAX; ++type)
+		INIT_HLIST_HEAD(&pid->tasks[type]);
+	init_waitqueue_head(&pid->wait_pidfd);
+	INIT_HLIST_HEAD(&pid->inodes);
 
-	for (i = ns->level; i >= 0; i--) {
-		int tid = 0;
-		int pid_max = READ_ONCE(tmp->pid_max);
+	/*
+	 * 2. perm check checkpoint_restore_ns_capable()
+	 *
+	 * This stores found pid_max to make sure the used value is the same should
+	 * later code need it.
+	 */
+	for (tmp = ns, i = ns->level; i >= 0; i--) {
+		pid_max[ns->level - i] = READ_ONCE(tmp->pid_max);
 
-		if (set_tid_size) {
-			tid = set_tid[ns->level - i];
+		if (arg_set_tid_size) {
+			int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i];
 
 			retval = -EINVAL;
-			if (tid < 1 || tid >= pid_max)
-				goto out_free;
+			if (tid < 1 || tid >= pid_max[ns->level - i])
+				goto out_abort;
 			/*
 			 * Also fail if a PID != 1 is requested and
 			 * no PID 1 exists.
 			 */
 			if (tid != 1 && !tmp->child_reaper)
-				goto out_free;
+				goto out_abort;
 			retval = -EPERM;
 			if (!checkpoint_restore_ns_capable(tmp->user_ns))
-				goto out_free;
-			set_tid_size--;
+				goto out_abort;
+			arg_set_tid_size--;
 		}
 
-		idr_preload(GFP_KERNEL);
-		spin_lock(&pidmap_lock);
+		tmp = tmp->parent;
+	}
+
+	/*
+	 * Prep is done, id allocation goes here:
+	 */
+	idr_preload_many(ns->level + 1, GFP_KERNEL);
+	spin_lock(&pidmap_lock);
+	for (tmp = ns, i = ns->level; i >= 0; i--) {
+		int tid = set_tid[ns->level - i];
 
 		if (tid) {
 			nr = idr_alloc(&tmp->idr, NULL, tid,
@@ -235,10 +261,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			 * a partially initialized PID (see below).
 			 */
 			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-					      pid_max, GFP_ATOMIC);
+					      pid_max[ns->level - i], GFP_ATOMIC);
 		}
-		spin_unlock(&pidmap_lock);
-		idr_preload_end();
 
 		if (nr < 0) {
 			retval = (nr == -ENOSPC) ? -EAGAIN : nr;
@@ -257,25 +281,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	 * is what we have exposed to userspace for a long time and it is
 	 * documented behavior for pid namespaces. So we can't easily
 	 * change it even if there were an error code better suited.
+	 *
+	 * This can't be done earlier because we need to preserve other
+	 * error conditions.
 	 */
 	retval = -ENOMEM;
-
-	get_pid_ns(ns);
-	refcount_set(&pid->count, 1);
-	spin_lock_init(&pid->lock);
-	for (type = 0; type < PIDTYPE_MAX; ++type)
-		INIT_HLIST_HEAD(&pid->tasks[type]);
-
-	init_waitqueue_head(&pid->wait_pidfd);
-	INIT_HLIST_HEAD(&pid->inodes);
-
-	upid = pid->numbers + ns->level;
-	idr_preload(GFP_KERNEL);
-	spin_lock(&pidmap_lock);
-	if (!(ns->pid_allocated & PIDNS_ADDING))
-		goto out_unlock;
+	if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
+		goto out_free;
 	pidfs_add_pid(pid);
-	for ( ; upid >= pid->numbers; --upid) {
+	for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
 		/* Make the PID visible to find_pid_ns. */
 		idr_replace(&upid->ns->idr, pid, upid->nr);
 		upid->ns->pid_allocated++;
@@ -286,13 +300,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	return pid;
 
-out_unlock:
-	spin_unlock(&pidmap_lock);
-	idr_preload_end();
-	put_pid_ns(ns);
-
 out_free:
-	spin_lock(&pidmap_lock);
 	while (++i <= ns->level) {
 		upid = pid->numbers + i;
 		idr_remove(&upid->ns->idr, upid->nr);
@@ -303,7 +311,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		idr_set_cursor(&ns->idr, 0);
 
 	spin_unlock(&pidmap_lock);
+	idr_preload_end();
 
+out_abort:
+	put_pid_ns(ns);
 	kmem_cache_free(ns->pid_cachep, pid);
 	return ERR_PTR(retval);
 }
-- 
2.48.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23  6:30 [PATCH 0/3] further damage-control lack of clone scalability Mateusz Guzik
                   ` (2 preceding siblings ...)
  2025-11-23  6:30 ` [PATCH 3/3] pid: only take pidmap_lock once on alloc Mateusz Guzik
@ 2025-11-23 15:00 ` Matthew Wilcox
  2025-11-23 16:39   ` Mateusz Guzik
  2025-12-03  8:37   ` Mateusz Guzik
  3 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2025-11-23 15:00 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
> 
> Benchmark code at the end.
> 
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
> 
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.

Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
the XArray is quite appropriate for this usecase.  I am opposed to
introducing more IDR APIs.  Have you looked at converting to the XArray?
Or do you have a better data structure in mind than the XArray?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23 15:00 ` [PATCH 0/3] further damage-control lack of clone scalability Matthew Wilcox
@ 2025-11-23 16:39   ` Mateusz Guzik
  2025-11-23 21:45     ` Matthew Wilcox
  2025-12-03  8:37   ` Mateusz Guzik
  1 sibling, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23 16:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > When spawning and killing threads in separate processes in parallel the
> > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > of a back-to-back acquire in the common case.
> >
> > Benchmark code at the end.
> >
> > With this patchset alloc_pid() only takes the lock once and consequently
> > alleviates the problem. While scalability improves, the lock remains the
> > primary bottleneck by a large margin.
> >
> > I believe idr is a poor choice for the task at hand to begin with, but
> > sorting out that out beyond the scope of this patchset. At the same time
> > any replacement would be best evaluated against a state where the
> > above relock problem is fixed.
>
> Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> the XArray is quite appropriate for this usecase.  I am opposed to
> introducing more IDR APIs.  Have you looked at converting to the XArray?
> Or do you have a better data structure in mind than the XArray?
>

I have some recollection we talked about this on irc long time ago.

It is my *suspicion* this would be best served with a sparse bitmap +
a hash table.

Such a solution was already present, but it got replaced by
95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
API").

Commit message cites the following bench results:
    The following are the stats for ps, pstree and calling readdir on /proc
    for 10,000 processes.

    ps:
            With IDR API    With bitmap
    real    0m1.479s        0m2.319s
    user    0m0.070s        0m0.060s
    sys     0m0.289s        0m0.516s

    pstree:
            With IDR API    With bitmap
    real    0m1.024s        0m1.794s
    user    0m0.348s        0m0.612s
    sys     0m0.184s        0m0.264s

    proc:
            With IDR API    With bitmap
    real    0m0.059s        0m0.074s
    user    0m0.000s        0m0.004s
    sys     0m0.016s        0m0.016s

Impact on clone was not benchmarked afaics.

It may be a hash table is a bad pick here, but the commit does not
explain what (if anything) was done to try to make it work. As is I
suspect both sizing of the table itself and the hashing algo were
suboptimal for the above test.

I did find a patchset converting the thing to xarray here:
https://lore.kernel.org/all/20221202171620.509140-1-bfoster@redhat.com/
. It looks like it would need a mildly annoying rebase.

The patchset does not come with any numbers for forking either and it
predates other unscrewing I did in the area (which notably moved
tasklist lock away from being the primary bottleneck). If I'm reading
it correctly it comes with lock trips per ns level. This adds overhead
both from single-threaded and multi-threaded standpoint, in particular
still having multiple acquires of the same lock. If it is fine to use
xarray and retain pidmap_lock that's a different story, I have not
looked into it.

Regardless, in order to give whatever replacement a fair perf eval
against idr, at least the following 2 bits need to get sorted out:
- the self-induced repeat locking of pidmap_lock
- high cost of kmalloc (to my understanding waiting for sheaves4all)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] ns: pad refcount
  2025-11-23  6:30 ` [PATCH 2/3] ns: pad refcount Mateusz Guzik
@ 2025-11-23 18:58   ` Oleg Nesterov
  2025-11-23 19:47     ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-11-23 18:58 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, linux-kernel, akpm, linux-mm

On 11/23, Mateusz Guzik wrote:
>
>  struct ns_common {
> +	struct {
> +		refcount_t __ns_ref; /* do not use directly */
> +	} ____cacheline_aligned_in_smp;

Cough... stupid question. Why not just

	refcount_t __ns_ref ____cacheline_aligned_in_smp;

? why do we need the anonymous struct?

Oleg.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] ns: pad refcount
  2025-11-23 18:58   ` Oleg Nesterov
@ 2025-11-23 19:47     ` Mateusz Guzik
  2025-11-24 18:25       ` Oleg Nesterov
  0 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23 19:47 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 7:58 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 11/23, Mateusz Guzik wrote:
> >
> >  struct ns_common {
> > +     struct {
> > +             refcount_t __ns_ref; /* do not use directly */
> > +     } ____cacheline_aligned_in_smp;
>
> Cough... stupid question. Why not just
>
>         refcount_t __ns_ref ____cacheline_aligned_in_smp;
>
> ? why do we need the anonymous struct?
>

This would merely align the offset of the field, with the rest
directly following.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] pid: only take pidmap_lock once on alloc
  2025-11-23  6:30 ` [PATCH 3/3] pid: only take pidmap_lock once on alloc Mateusz Guzik
@ 2025-11-23 20:09   ` Oleg Nesterov
  2025-11-23 22:48     ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-11-23 20:09 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, linux-kernel, akpm, linux-mm

On 11/23, Mateusz Guzik wrote:
>
> This reduces contention on the lock during parallel clone/exit.
>
> It remains the primary bottleneck in such a case.
>
> While here tidy up the code.

Not sure I can review... But FWIW this patch looks good to me after the
very quick glance. I'll try to actually read it tomorrow.

But please find a couple of minor "can't resist" nits below.

> +	for (tmp = ns, i = ns->level; i >= 0; i--) {
> +		int tid = set_tid[ns->level - i];
>
>  		if (tid) {
>  			nr = idr_alloc(&tmp->idr, NULL, tid,
> @@ -235,10 +261,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			 * a partially initialized PID (see below).
>  			 */
>  			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> -					      pid_max, GFP_ATOMIC);
> +					      pid_max[ns->level - i], GFP_ATOMIC);
>  		}
> -		spin_unlock(&pidmap_lock);
> -		idr_preload_end();
>
>  		if (nr < 0) {
>  			retval = (nr == -ENOSPC) ? -EAGAIN : nr;

So. With or without this patch we have

	if (tid) {
		nr = idr_alloc(...);

		if (nr == -ENOSPC)
			nr = -EEXIST;
	} else {
		nr = idr_alloc_cyclic(...);
	}

	if (nr < 0) {
		retval = (nr == -ENOSPC) ? -EAGAIN : nr;
		goto out_free;
	}

and somehow this looks annoying to me... Perhaps it makes sense to make this
code more symmetric (and imo more readable) ?

	if (tid) {
		nr = idr_alloc(...);

		if (nr == -ENOSPC)
			nr = -EEXIST;
	} else {
		nr = idr_alloc_cyclic(...);

		if (nr == -ENOSPC)
			nr = -EAGAIN;
	}

	if (nr < 0)
		retval = nr;
		goto out_free;
	}

> -	idr_preload(GFP_KERNEL);
> -	spin_lock(&pidmap_lock);
> -	if (!(ns->pid_allocated & PIDNS_ADDING))
> -		goto out_unlock;
> +	if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
> +		goto out_free;
>  	pidfs_add_pid(pid);
> -	for ( ; upid >= pid->numbers; --upid) {
> +	for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
>  		/* Make the PID visible to find_pid_ns. */
>  		idr_replace(&upid->ns->idr, pid, upid->nr);
>  		upid->ns->pid_allocated++;

So.. unless I am totally confused the current code has another
idr_preload + idr_preload_end around pidfs_add_pid().

AFAICS, this makes no sense, and your patch removes it. But perhaps this
deserves a note in the changelog or even a separate patch?


And another stupid question... I don't understand fs/pidfs.c, but it looks
a bit strange to me that pidfs_add_pid() is called before the

	for (...)
		idr_replace(...);

loop. I don't see any problem, but to me it would look a bit better to do
pidfs_add_pid(pid) when this pid is fully initialized...

Oleg.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23 16:39   ` Mateusz Guzik
@ 2025-11-23 21:45     ` Matthew Wilcox
  2025-11-23 22:33       ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2025-11-23 21:45 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> I have some recollection we talked about this on irc long time ago.
> 
> It is my *suspicion* this would be best served with a sparse bitmap +
> a hash table.

Maybe!  I've heard other people speculate that would be a better data
structure.  I know we switched away from a hash table for the page
cache, but that has a different usage pattern where it's common to go
from page N to page N+1, N+2, ...  Other than ps, I don't think we often
have that pattern for PIDs.

> Such a solution was already present, but it got replaced by
> 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> API").
> 
> Commit message cites the following bench results:
>     The following are the stats for ps, pstree and calling readdir on /proc
>     for 10,000 processes.
> 
>     ps:
>             With IDR API    With bitmap
>     real    0m1.479s        0m2.319s
>     user    0m0.070s        0m0.060s
>     sys     0m0.289s        0m0.516s
> 
>     pstree:
>             With IDR API    With bitmap
>     real    0m1.024s        0m1.794s
>     user    0m0.348s        0m0.612s
>     sys     0m0.184s        0m0.264s
> 
>     proc:
>             With IDR API    With bitmap
>     real    0m0.059s        0m0.074s
>     user    0m0.000s        0m0.004s
>     sys     0m0.016s        0m0.016s
> 
> Impact on clone was not benchmarked afaics.

It shouldn't be too much effort for you to check out 95846ecf9dac5089
and 95846ecf9dac5089^ to run your benchmark on both?  That would seem
like the cheapest way of assessing the performance of hash+bitmap
vs IDR.

> Regardless, in order to give whatever replacement a fair perf eval
> against idr, at least the following 2 bits need to get sorted out:
> - the self-induced repeat locking of pidmap_lock
> - high cost of kmalloc (to my understanding waiting for sheaves4all)

The nice thing about XArray (compared to IDR) is that there's no
requirement to preallocate.  Only 1.6% of xa_alloc() calls result in
calling slab.  The downside is that means that XArray needs to know
where its lock is (ie xa_lock) so that it can drop the lock in order to
allocate without using GFP_ATOMIC.

At one point I kind of had a plan to create a multi-xarray where you had
multiple xarrays that shared a single lock.  Or maybe this sharding is
exactly what's needed; I haven't really analysed the pid locking to see
what's needed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23 21:45     ` Matthew Wilcox
@ 2025-11-23 22:33       ` Mateusz Guzik
  2025-11-24  4:03         ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23 22:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 10:45 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> > I have some recollection we talked about this on irc long time ago.
> >
> > It is my *suspicion* this would be best served with a sparse bitmap +
> > a hash table.
>
> Maybe!  I've heard other people speculate that would be a better data
> structure.  I know we switched away from a hash table for the page
> cache, but that has a different usage pattern where it's common to go
> from page N to page N+1, N+2, ...  Other than ps, I don't think we often
> have that pattern for PIDs.
>

Well it does seem like the obvious choice to try, and like I said it
even was tried in Linux. It also happens to be what FreeBSD is doing
and it gets better scalability, despite other global serialization
points (that is to say both kernels *suck* on this front, just so
happens FreeBSD sucks less).

> > Such a solution was already present, but it got replaced by
> > 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> > API").
> >
> > Commit message cites the following bench results:
> >     The following are the stats for ps, pstree and calling readdir on /proc
> >     for 10,000 processes.
> >
> >     ps:
> >             With IDR API    With bitmap
> >     real    0m1.479s        0m2.319s
> >     user    0m0.070s        0m0.060s
> >     sys     0m0.289s        0m0.516s
> >
> >     pstree:
> >             With IDR API    With bitmap
> >     real    0m1.024s        0m1.794s
> >     user    0m0.348s        0m0.612s
> >     sys     0m0.184s        0m0.264s
> >
> >     proc:
> >             With IDR API    With bitmap
> >     real    0m0.059s        0m0.074s
> >     user    0m0.000s        0m0.004s
> >     sys     0m0.016s        0m0.016s
> >
> > Impact on clone was not benchmarked afaics.
>
> It shouldn't be too much effort for you to check out 95846ecf9dac5089
> and 95846ecf9dac5089^ to run your benchmark on both?  That would seem
> like the cheapest way of assessing the performance of hash+bitmap
> vs IDR.
>

The commit is 2017 vintage and that kernel has some problems I fixed
in the area. The commit is not easily revertable either.

Even then the hash as implemented at the time looks suspicious.

I best way forward it to slap together a poc consisting of the obvious
vmalloced bitmap for the entire size + hasthable and see if that has
legs. If so, then one can spend time making it memory-efficient.

> > Regardless, in order to give whatever replacement a fair perf eval
> > against idr, at least the following 2 bits need to get sorted out:
> > - the self-induced repeat locking of pidmap_lock
> > - high cost of kmalloc (to my understanding waiting for sheaves4all)
>
> The nice thing about XArray (compared to IDR) is that there's no
> requirement to preallocate.  Only 1.6% of xa_alloc() calls result in
> calling slab.

Technically there is no requirement in idr either, as it also kmallocs
with GFP_ATOMIC inside.

> At one point I kind of had a plan to create a multi-xarray where you had
> multiple xarrays that shared a single lock.  Or maybe this sharding is
> exactly what's needed; I haven't really analysed the pid locking to see
> what's needed.

So ignoring bottlenecks outside of pid management, there is the global
pidmap_lock. The xarray patchset I linked above technically speaking
provides per-namespace locking, but that does not cut it as you have
to lock all namespaces anyway, including the global one.

Ultimately in order to makes this scale CPUs need to stop sharing the
locks (in the common case anyway). To that end PID space needs to get
partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
or similar -- small enough for contention to not matter outside of a
microbenchmark). The ID space on Linux is enormous, so this should not
be a problem. The only potential issue is that PIDs no longer will be
sequential which is userspace-visible (I mean, suppose a wanker knows
for a fact nobody is forking and insists on taking advantage of it (by
Hyrum's law), then his two forks in a row have predictable IDs).
Anyhow, then most real workloads should be able to allocate IDs
without depleting the range on whatever they happen to be running on
(but deallocating may still need to look at other CPU ranges).

With something like a bitmap + hash this is trivially achievable --
just have a set of bitmaps and have each assigned a 'base' which you
add/subtract on the id obtained from the given bitmap. The hash is a
non-problem.

I don't know about xarray. Bottom line though, even if one was to pass
a lock around to xarray primitives but without sorting out the range
lock problem, that would merely be a stopgap.

All that said, I don't know if/when I'll get around to something this.

I think my addition to the idr api is trivial and absent someone
volunteering to do the needful(tm) sooner than later it should not be
considered a blocker.

My worry is, should this patchset get stalled, someone is going to
swoop in and submit changes which make it infeasible to only take the
lock once.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] pid: only take pidmap_lock once on alloc
  2025-11-23 20:09   ` Oleg Nesterov
@ 2025-11-23 22:48     ` Mateusz Guzik
  0 siblings, 0 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-23 22:48 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 9:10 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 11/23, Mateusz Guzik wrote:
> >
> > This reduces contention on the lock during parallel clone/exit.
> >
> > It remains the primary bottleneck in such a case.
> >
> > While here tidy up the code.
>
> Not sure I can review... But FWIW this patch looks good to me after the
> very quick glance. I'll try to actually read it tomorrow.
>
> But please find a couple of minor "can't resist" nits below.
>
> > +     for (tmp = ns, i = ns->level; i >= 0; i--) {
> > +             int tid = set_tid[ns->level - i];
> >
> >               if (tid) {
> >                       nr = idr_alloc(&tmp->idr, NULL, tid,
> > @@ -235,10 +261,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >                        * a partially initialized PID (see below).
> >                        */
> >                       nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> > -                                           pid_max, GFP_ATOMIC);
> > +                                           pid_max[ns->level - i], GFP_ATOMIC);
> >               }
> > -             spin_unlock(&pidmap_lock);
> > -             idr_preload_end();
> >
> >               if (nr < 0) {
> >                       retval = (nr == -ENOSPC) ? -EAGAIN : nr;
>
> So. With or without this patch we have
>
>         if (tid) {
>                 nr = idr_alloc(...);
>
>                 if (nr == -ENOSPC)
>                         nr = -EEXIST;
>         } else {
>                 nr = idr_alloc_cyclic(...);
>         }
>
>         if (nr < 0) {
>                 retval = (nr == -ENOSPC) ? -EAGAIN : nr;
>                 goto out_free;
>         }
>
> and somehow this looks annoying to me... Perhaps it makes sense to make this
> code more symmetric (and imo more readable) ?
>

I agree, but I also tried to not make non-perf changes to make it
easier to review.

There is tons of clean up which can be done here, maybe I'll add some later.

>         if (tid) {
>                 nr = idr_alloc(...);
>
>                 if (nr == -ENOSPC)
>                         nr = -EEXIST;
>         } else {
>                 nr = idr_alloc_cyclic(...);
>
>                 if (nr == -ENOSPC)
>                         nr = -EAGAIN;
>         }
>
>         if (nr < 0)
>                 retval = nr;
>                 goto out_free;
>         }
>
> > -     idr_preload(GFP_KERNEL);
> > -     spin_lock(&pidmap_lock);
> > -     if (!(ns->pid_allocated & PIDNS_ADDING))
> > -             goto out_unlock;
> > +     if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
> > +             goto out_free;
> >       pidfs_add_pid(pid);
> > -     for ( ; upid >= pid->numbers; --upid) {
> > +     for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
> >               /* Make the PID visible to find_pid_ns. */
> >               idr_replace(&upid->ns->idr, pid, upid->nr);
> >               upid->ns->pid_allocated++;
>
> So.. unless I am totally confused the current code has another
> idr_preload + idr_preload_end around pidfs_add_pid().
>
> AFAICS, this makes no sense, and your patch removes it. But perhaps this
> deserves a note in the changelog or even a separate patch?
>

Oh heh, that was not intentional. After I was done massaging this I
just mindlessly removed the lock acquire + preload.

It does indeed look like it serves no purpose, I'm going to submit a
separate patch to remove it.

>
> And another stupid question... I don't understand fs/pidfs.c, but it looks
> a bit strange to me that pidfs_add_pid() is called before the
>
>         for (...)
>                 idr_replace(...);
>
> loop. I don't see any problem, but to me it would look a bit better to do
> pidfs_add_pid(pid) when this pid is fully initialized...
>

That's a question to Christian.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23 22:33       ` Mateusz Guzik
@ 2025-11-24  4:03         ` Mateusz Guzik
  0 siblings, 0 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-11-24  4:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 11:33 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> Ultimately in order to makes this scale CPUs need to stop sharing the
> locks (in the common case anyway). To that end PID space needs to get
> partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
> or similar -- small enough for contention to not matter outside of a
> microbenchmark). The ID space on Linux is enormous, so this should not
> be a problem. The only potential issue is that PIDs no longer will be
> sequential which is userspace-visible (I mean, suppose a wanker knows
> for a fact nobody is forking and insists on taking advantage of it (by
> Hyrum's law), then his two forks in a row have predictable IDs).
> Anyhow, then most real workloads should be able to allocate IDs
> without depleting the range on whatever they happen to be running on
> (but deallocating may still need to look at other CPU ranges).
>
> With something like a bitmap + hash this is trivially achievable --
> just have a set of bitmaps and have each assigned a 'base' which you
> add/subtract on the id obtained from the given bitmap. The hash is a
> non-problem.
>

So I had a little bit of a think and I got something and it boils down
to special casing the last level of a (quasi)radix tree. It provides
the scalability I was looking for, albeit with some uglification. This
is a sketch.

Note that as port of allocation policy the kernel must postpone pids
reuse, as in you can't just free/alloc in a tiny range and call it a
day.

Part of the issue for me is that there are 32 levels of allowed
namespaces. The stock code will relock pidmap_lock for every single
one(!) on alloc, this is just terrible. Suppose one introduces
per-namespace locks, that's still 32 lock trips to grab a pid and that
still does not solve the scalability problem. For my taste that's
questionable at best, but at the same time this is what the kernel is
already doing, so let's pretend for a minute the relocks are not a
concern.

The solution is based on (quasi)radix where the datum is a pointer to
a struct containing a spinlock, bitmap and array of pids. Likely will
be a xarray, but I'm going to keep typing radix for the time being as
this is the concept which matters.

The struct contains a carved out id range and can fit -- say -- 250
entries or so, or whatever else which fits in a size considered
fine(tm). The actual pid is obtained by adding up the radix id (which
serves as a prefix) and the offset into the array.

In order to alloc a pid for a given namespace, the calling CPU would
check if it has a range carved out. If so, it locks the thing and
looks for a free pid. Absent a free pid or a range in the first place
it goes to xarray to get space. This avoids synchronisation with other
CPUs for the 250 forks (modulo a thread with an id from this range
exiting on a different cpu), which sorts out the scalability problem
in practical settings.

Of course once someone notices that there are no IDs in use anymore
*and* the last id was handed out at some point, the range gets freed.

But you have to do the locking for every ns.

So let's say it is in fact a problem and it would be most preferred if
the CPU could take *one* lock and stick with it for all namespaces,
all while retaining scalability.

Instead, every CPU could have its own pidmap_lock. The struct with the
bitmap + pid array would have a pointer to a spinlock, which would
refer to the pidmap lock of whichever CPU which allocated the range.

Et voila, allocs still get away with one lock acquire in the common
case where there are free ids in all namespaces which need to be
touched. Contention is only present on xarray locks if you ran out of
space *or* on the pidmap lock if someone is freeing an id. Hell, this
can probably be made lockless to further speed it up if need be.
However, lockless or not, the key point is that most allocs will *not*
be bouncing any cachelines.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] ns: pad refcount
  2025-11-23 19:47     ` Mateusz Guzik
@ 2025-11-24 18:25       ` Oleg Nesterov
  0 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-11-24 18:25 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, linux-kernel, akpm, linux-mm

On 11/23, Mateusz Guzik wrote:
>
> On Sun, Nov 23, 2025 at 7:58 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 11/23, Mateusz Guzik wrote:
> > >
> > >  struct ns_common {
> > > +     struct {
> > > +             refcount_t __ns_ref; /* do not use directly */
> > > +     } ____cacheline_aligned_in_smp;
> >
> > Cough... stupid question. Why not just
> >
> >         refcount_t __ns_ref ____cacheline_aligned_in_smp;
> >
> > ? why do we need the anonymous struct?
> >
>
> This would merely align the offset of the field, with the rest
> directly following.

Ah. I didn't bother to read the changelog and misunderstood the intent.

OK, thanks. At least I have warned you that my question is stupid ;)

Oleg.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-11-23 15:00 ` [PATCH 0/3] further damage-control lack of clone scalability Matthew Wilcox
  2025-11-23 16:39   ` Mateusz Guzik
@ 2025-12-03  8:37   ` Mateusz Guzik
  2025-12-03  9:18     ` Mateusz Guzik
  1 sibling, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2025-12-03  8:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > When spawning and killing threads in separate processes in parallel the
> > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > of a back-to-back acquire in the common case.
> >
> > Benchmark code at the end.
> >
> > With this patchset alloc_pid() only takes the lock once and consequently
> > alleviates the problem. While scalability improves, the lock remains the
> > primary bottleneck by a large margin.
> >
> > I believe idr is a poor choice for the task at hand to begin with, but
> > sorting out that out beyond the scope of this patchset. At the same time
> > any replacement would be best evaluated against a state where the
> > above relock problem is fixed.
>
> Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> the XArray is quite appropriate for this usecase.  I am opposed to
> introducing more IDR APIs.  Have you looked at converting to the XArray?
> Or do you have a better data structure in mind than the XArray?
>

Hi Willy,

in other responses I outlined what I suspect would be a viable long
term solution, very much not idr-based.

However, that's not something I'm likely to implement anytime soon and
I doubt there is someone willing to pick up the matter.

Whatever the long term solution and who/when implements it, the
current code avoidably loses out on some performance because of at
least two lock acquires on each fork instead of one.

At the moment it is structured in a way which makes it possible to
take the lock once with minor effort.

Leaving this in the current state results in a minor risk that someone
will make changes which turn fixing the scalability issue into a
massive problem.

With my patch as-is I can suffer some pain and avoid modifying
idr_prealloc, but at the same time I don't think the modification at
hand is particularly problematic. Notably it does not change any of
the internals.

So the question is if by "opposed to introducing more IDR APIs" you
mean you are just not fond of it, or is it a straight up NAK. Per the
explanation above, I think the change is tolerable in its own right
and I provided reasoning why I'm adding it in the first place.

If the latter, I'll see about massaging this to drop locks and retry
memory alloc.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] further damage-control lack of clone scalability
  2025-12-03  8:37   ` Mateusz Guzik
@ 2025-12-03  9:18     ` Mateusz Guzik
  0 siblings, 0 replies; 16+ messages in thread
From: Mateusz Guzik @ 2025-12-03  9:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oleg, brauner, linux-kernel, akpm, linux-mm

On Wed, Dec 3, 2025 at 9:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > > When spawning and killing threads in separate processes in parallel the
> > > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > > of a back-to-back acquire in the common case.
> > >
> > > Benchmark code at the end.
> > >
> > > With this patchset alloc_pid() only takes the lock once and consequently
> > > alleviates the problem. While scalability improves, the lock remains the
> > > primary bottleneck by a large margin.
> > >
> > > I believe idr is a poor choice for the task at hand to begin with, but
> > > sorting out that out beyond the scope of this patchset. At the same time
> > > any replacement would be best evaluated against a state where the
> > > above relock problem is fixed.
> >
> > Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> > the XArray is quite appropriate for this usecase.  I am opposed to
> > introducing more IDR APIs.  Have you looked at converting to the XArray?
> > Or do you have a better data structure in mind than the XArray?
> >
>
> Hi Willy,
>
> in other responses I outlined what I suspect would be a viable long
> term solution, very much not idr-based.
>
> However, that's not something I'm likely to implement anytime soon and
> I doubt there is someone willing to pick up the matter.
>
> Whatever the long term solution and who/when implements it, the
> current code avoidably loses out on some performance because of at
> least two lock acquires on each fork instead of one.
>
> At the moment it is structured in a way which makes it possible to
> take the lock once with minor effort.
>
> Leaving this in the current state results in a minor risk that someone
> will make changes which turn fixing the scalability issue into a
> massive problem.
>
> With my patch as-is I can suffer some pain and avoid modifying
> idr_prealloc, but at the same time I don't think the modification at
> hand is particularly problematic. Notably it does not change any of
> the internals.
>
> So the question is if by "opposed to introducing more IDR APIs" you
> mean you are just not fond of it, or is it a straight up NAK. Per the
> explanation above, I think the change is tolerable in its own right
> and I provided reasoning why I'm adding it in the first place.
>
> If the latter, I'll see about massaging this to drop locks and retry
> memory alloc.

Welp, retrying preload turned out to be significantly less painful
than I thought, so you can consider the above question moot. I'll be
posting v2 without idr changes.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-12-03  9:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-23  6:30 [PATCH 0/3] further damage-control lack of clone scalability Mateusz Guzik
2025-11-23  6:30 ` [PATCH 1/3] idr: add idr_prealloc_many Mateusz Guzik
2025-11-23  6:30 ` [PATCH 2/3] ns: pad refcount Mateusz Guzik
2025-11-23 18:58   ` Oleg Nesterov
2025-11-23 19:47     ` Mateusz Guzik
2025-11-24 18:25       ` Oleg Nesterov
2025-11-23  6:30 ` [PATCH 3/3] pid: only take pidmap_lock once on alloc Mateusz Guzik
2025-11-23 20:09   ` Oleg Nesterov
2025-11-23 22:48     ` Mateusz Guzik
2025-11-23 15:00 ` [PATCH 0/3] further damage-control lack of clone scalability Matthew Wilcox
2025-11-23 16:39   ` Mateusz Guzik
2025-11-23 21:45     ` Matthew Wilcox
2025-11-23 22:33       ` Mateusz Guzik
2025-11-24  4:03         ` Mateusz Guzik
2025-12-03  8:37   ` Mateusz Guzik
2025-12-03  9:18     ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox