[PATCH v3 0/2] execve scalability issues, part 1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] execve scalability issues, part 1
@ 2023-08-23  5:06 Mateusz Guzik
  2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Mateusz Guzik @ 2023-08-23  5:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: dennis, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm, Mateusz Guzik

To start I figured I'm going to bench about as friendly case as it gets
-- statically linked *separate* binaries all doing execve in a loop.

I borrowed the bench from here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

It prints a result every second.

My test box is temporarily only 26 cores and even at this scale I run
into massive lock contention stemming from back-to-back calls to
percpu_counter_init (and _destroy later).

While not a panacea, one simple thing to do here is to batch these ops.
Since the term "batching" is already used in the file, I decided to
refer to it as "grouping" instead.

Even if this code could be patched to dodge these counters,  I would
argue a high-traffic alloc/free consumer is only a matter of time so it
makes sense to facilitate it.

With the fix I get an ok win, to quote from the commit:
> Even at a very modest scale of 26 cores (ops/s):
> before: 133543.63
> after:  186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to:

    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irqsave+57
    folio_lruvec_lock_irqsave+91
    release_pages+590
    tlb_batch_pages_flush+61
    tlb_finish_mmu+101
    exit_mmap+327
    __mmput+61
    begin_new_exec+1245
    load_elf_binary+712
    bprm_execve+644
    do_execveat_common.isra.0+429
    __x64_sys_execve+50
    do_syscall_64+46
    entry_SYSCALL_64_after_hwframe+110

I intend to do more work on the area to mostly sort it out, but I would
not mind if someone else took the hammer to folio. :)

With this out of the way I'll be looking at some form of caching to
eliminate these allocs as a problem.

v3:
- fix !CONFIG_SMP build
- drop the backtrace from fork commit message

v2:
- force bigger alignment on alloc
- rename "counters" to "nr_counters" and pass prior to lock key
- drop {}'s for single-statement loops

Mateusz Guzik (2):
  pcpcntr: add group allocation/free
  fork: group allocation of per-cpu counters for mm struct

 include/linux/percpu_counter.h | 39 ++++++++++++++++++----
 kernel/fork.c                  | 14 ++------
 lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
 3 files changed, 77 insertions(+), 37 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] pcpcntr: add group allocation/free
  2023-08-23  5:06 [PATCH v3 0/2] execve scalability issues, part 1 Mateusz Guzik
@ 2023-08-23  5:06 ` Mateusz Guzik
  2023-08-24  6:26   ` Dennis Zhou
  2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
  2023-08-25 15:14 ` [PATCH v3 0/2] execve scalability issues, part 1 Dennis Zhou
  2 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2023-08-23  5:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: dennis, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm, Mateusz Guzik

Allocations and frees are globally serialized on the pcpu lock (and the
CPU hotplug lock if enabled, which is the case on Debian).

At least one frequent consumer allocates 4 back-to-back counters (and
frees them in the same manner), exacerbating the problem.

While this does not fully remedy scalability issues, it is a step
towards that goal and provides immediate relief.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/percpu_counter.h | 39 ++++++++++++++++++----
 lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
 2 files changed, 74 insertions(+), 26 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 75b73c83bc9d..f1e7c987e3d3 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -30,17 +30,27 @@ struct percpu_counter {
 
 extern int percpu_counter_batch;
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
-			  struct lock_class_key *key);
+int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
+			  u32 nr_counters, struct lock_class_key *key);
 
-#define percpu_counter_init(fbc, value, gfp)				\
+#define percpu_counter_init_many(fbc, value, gfp, nr_counters)		\
 	({								\
 		static struct lock_class_key __key;			\
 									\
-		__percpu_counter_init(fbc, value, gfp, &__key);		\
+		__percpu_counter_init_many(fbc, value, gfp, nr_counters,\
+					   &__key);			\
 	})
 
-void percpu_counter_destroy(struct percpu_counter *fbc);
+
+#define percpu_counter_init(fbc, value, gfp)				\
+	percpu_counter_init_many(fbc, value, gfp, 1)
+
+void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters);
+static inline void percpu_counter_destroy(struct percpu_counter *fbc)
+{
+	percpu_counter_destroy_many(fbc, 1);
+}
+
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
@@ -116,11 +126,26 @@ struct percpu_counter {
 	s64 count;
 };
 
+static inline int percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
+				           gfp_t gfp, u32 nr_counters)
+{
+	u32 i;
+
+	for (i = 0; i < nr_counters; i++)
+		fbc[i].count = amount;
+
+	return 0;
+}
+
 static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 				      gfp_t gfp)
 {
-	fbc->count = amount;
-	return 0;
+	return percpu_counter_init_many(fbc, amount, gfp, 1);
+}
+
+static inline void percpu_counter_destroy_many(struct percpu_counter *fbc,
+					       u32 nr_counters)
+{
 }
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..9338b27f1cdd 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -151,48 +151,71 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
-			  struct lock_class_key *key)
+int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
+			  u32 nr_counters, struct lock_class_key *key)
 {
 	unsigned long flags __maybe_unused;
-
-	raw_spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
-	fbc->counters = alloc_percpu_gfp(s32, gfp);
-	if (!fbc->counters)
+	size_t counter_size;
+	s32 __percpu *counters;
+	u32 i;
+
+	counter_size = ALIGN(sizeof(*counters), __alignof__(*counters));
+	counters = __alloc_percpu_gfp(nr_counters * counter_size,
+				      __alignof__(*counters), gfp);
+	if (!counters) {
+		fbc[0].counters = NULL;
 		return -ENOMEM;
+	}
 
-	debug_percpu_counter_activate(fbc);
+	for (i = 0; i < nr_counters; i++) {
+		raw_spin_lock_init(&fbc[i].lock);
+		lockdep_set_class(&fbc[i].lock, key);
+#ifdef CONFIG_HOTPLUG_CPU
+		INIT_LIST_HEAD(&fbc[i].list);
+#endif
+		fbc[i].count = amount;
+		fbc[i].counters = (void *)counters + (i * counter_size);
+
+		debug_percpu_counter_activate(&fbc[i]);
+	}
 
 #ifdef CONFIG_HOTPLUG_CPU
-	INIT_LIST_HEAD(&fbc->list);
 	spin_lock_irqsave(&percpu_counters_lock, flags);
-	list_add(&fbc->list, &percpu_counters);
+	for (i = 0; i < nr_counters; i++)
+		list_add(&fbc[i].list, &percpu_counters);
 	spin_unlock_irqrestore(&percpu_counters_lock, flags);
 #endif
 	return 0;
 }
-EXPORT_SYMBOL(__percpu_counter_init);
+EXPORT_SYMBOL(__percpu_counter_init_many);
 
-void percpu_counter_destroy(struct percpu_counter *fbc)
+void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters)
 {
 	unsigned long flags __maybe_unused;
+	u32 i;
+
+	if (WARN_ON_ONCE(!fbc))
+		return;
 
-	if (!fbc->counters)
+	if (!fbc[0].counters)
 		return;
 
-	debug_percpu_counter_deactivate(fbc);
+	for (i = 0; i < nr_counters; i++)
+		debug_percpu_counter_deactivate(&fbc[i]);
 
 #ifdef CONFIG_HOTPLUG_CPU
 	spin_lock_irqsave(&percpu_counters_lock, flags);
-	list_del(&fbc->list);
+	for (i = 0; i < nr_counters; i++)
+		list_del(&fbc[i].list);
 	spin_unlock_irqrestore(&percpu_counters_lock, flags);
 #endif
-	free_percpu(fbc->counters);
-	fbc->counters = NULL;
+
+	free_percpu(fbc[0].counters);
+
+	for (i = 0; i < nr_counters; i++)
+		fbc[i].counters = NULL;
 }
-EXPORT_SYMBOL(percpu_counter_destroy);
+EXPORT_SYMBOL(percpu_counter_destroy_many);
 
 int percpu_counter_batch __read_mostly = 32;
 EXPORT_SYMBOL(percpu_counter_batch);
-- 
2.41.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] pcpcntr: add group allocation/free
  2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
@ 2023-08-24  6:26   ` Dennis Zhou
  2023-08-24 10:01     ` Vegard Nossum
  0 siblings, 1 reply; 8+ messages in thread
From: Dennis Zhou @ 2023-08-24  6:26 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-kernel, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm

On Wed, Aug 23, 2023 at 07:06:08AM +0200, Mateusz Guzik wrote:
> Allocations and frees are globally serialized on the pcpu lock (and the
> CPU hotplug lock if enabled, which is the case on Debian).
> 
> At least one frequent consumer allocates 4 back-to-back counters (and
> frees them in the same manner), exacerbating the problem.
> 
> While this does not fully remedy scalability issues, it is a step
> towards that goal and provides immediate relief.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

I'm happy with this. There are a few minor reflow of lines that I'd like
to do but other than that nice.

If there are no other comments and it's okay with Andrew I'll pick this
up tomorrow for-6.6 and the corresponding changes to fork.c.

Reviewed-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

> ---
>  include/linux/percpu_counter.h | 39 ++++++++++++++++++----
>  lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
>  2 files changed, 74 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 75b73c83bc9d..f1e7c987e3d3 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -30,17 +30,27 @@ struct percpu_counter {
>  
>  extern int percpu_counter_batch;
>  
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key);
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  u32 nr_counters, struct lock_class_key *key);
>  
> -#define percpu_counter_init(fbc, value, gfp)				\
> +#define percpu_counter_init_many(fbc, value, gfp, nr_counters)		\
>  	({								\
>  		static struct lock_class_key __key;			\
>  									\
> -		__percpu_counter_init(fbc, value, gfp, &__key);		\
> +		__percpu_counter_init_many(fbc, value, gfp, nr_counters,\
> +					   &__key);			\
>  	})
>  
> -void percpu_counter_destroy(struct percpu_counter *fbc);
> +
> +#define percpu_counter_init(fbc, value, gfp)				\
> +	percpu_counter_init_many(fbc, value, gfp, 1)
> +
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters);
> +static inline void percpu_counter_destroy(struct percpu_counter *fbc)
> +{
> +	percpu_counter_destroy_many(fbc, 1);
> +}
> +
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
>  			      s32 batch);
> @@ -116,11 +126,26 @@ struct percpu_counter {
>  	s64 count;
>  };
>  
> +static inline int percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
> +				           gfp_t gfp, u32 nr_counters)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < nr_counters; i++)
> +		fbc[i].count = amount;
> +
> +	return 0;
> +}
> +
>  static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount,
>  				      gfp_t gfp)
>  {
> -	fbc->count = amount;
> -	return 0;
> +	return percpu_counter_init_many(fbc, amount, gfp, 1);
> +}
> +
> +static inline void percpu_counter_destroy_many(struct percpu_counter *fbc,
> +					       u32 nr_counters)
> +{
>  }
>  
>  static inline void percpu_counter_destroy(struct percpu_counter *fbc)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..9338b27f1cdd 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -151,48 +151,71 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
>  }
>  EXPORT_SYMBOL(__percpu_counter_sum);
>  
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key)
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  u32 nr_counters, struct lock_class_key *key)
>  {
>  	unsigned long flags __maybe_unused;
> -
> -	raw_spin_lock_init(&fbc->lock);
> -	lockdep_set_class(&fbc->lock, key);
> -	fbc->count = amount;
> -	fbc->counters = alloc_percpu_gfp(s32, gfp);
> -	if (!fbc->counters)
> +	size_t counter_size;
> +	s32 __percpu *counters;
> +	u32 i;
> +
> +	counter_size = ALIGN(sizeof(*counters), __alignof__(*counters));
> +	counters = __alloc_percpu_gfp(nr_counters * counter_size,
> +				      __alignof__(*counters), gfp);
> +	if (!counters) {
> +		fbc[0].counters = NULL;
>  		return -ENOMEM;
> +	}
>  
> -	debug_percpu_counter_activate(fbc);
> +	for (i = 0; i < nr_counters; i++) {
> +		raw_spin_lock_init(&fbc[i].lock);
> +		lockdep_set_class(&fbc[i].lock, key);
> +#ifdef CONFIG_HOTPLUG_CPU
> +		INIT_LIST_HEAD(&fbc[i].list);
> +#endif
> +		fbc[i].count = amount;
> +		fbc[i].counters = (void *)counters + (i * counter_size);
> +
> +		debug_percpu_counter_activate(&fbc[i]);
> +	}
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> -	INIT_LIST_HEAD(&fbc->list);
>  	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_add(&fbc->list, &percpu_counters);
> +	for (i = 0; i < nr_counters; i++)
> +		list_add(&fbc[i].list, &percpu_counters);
>  	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
>  	return 0;
>  }
> -EXPORT_SYMBOL(__percpu_counter_init);
> +EXPORT_SYMBOL(__percpu_counter_init_many);
>  
> -void percpu_counter_destroy(struct percpu_counter *fbc)
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters)
>  {
>  	unsigned long flags __maybe_unused;
> +	u32 i;
> +
> +	if (WARN_ON_ONCE(!fbc))
> +		return;
>  
> -	if (!fbc->counters)
> +	if (!fbc[0].counters)
>  		return;
>  
> -	debug_percpu_counter_deactivate(fbc);
> +	for (i = 0; i < nr_counters; i++)
> +		debug_percpu_counter_deactivate(&fbc[i]);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_del(&fbc->list);
> +	for (i = 0; i < nr_counters; i++)
> +		list_del(&fbc[i].list);
>  	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
> -	free_percpu(fbc->counters);
> -	fbc->counters = NULL;
> +
> +	free_percpu(fbc[0].counters);
> +
> +	for (i = 0; i < nr_counters; i++)
> +		fbc[i].counters = NULL;
>  }
> -EXPORT_SYMBOL(percpu_counter_destroy);
> +EXPORT_SYMBOL(percpu_counter_destroy_many);
>  
>  int percpu_counter_batch __read_mostly = 32;
>  EXPORT_SYMBOL(percpu_counter_batch);
> -- 
> 2.41.0
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] pcpcntr: add group allocation/free
  2023-08-24  6:26   ` Dennis Zhou
@ 2023-08-24 10:01     ` Vegard Nossum
  0 siblings, 0 replies; 8+ messages in thread
From: Vegard Nossum @ 2023-08-24 10:01 UTC (permalink / raw)
  To: Dennis Zhou, Mateusz Guzik; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On 8/24/23 08:26, Dennis Zhou wrote:
> On Wed, Aug 23, 2023 at 07:06:08AM +0200, Mateusz Guzik wrote:
>> Allocations and frees are globally serialized on the pcpu lock (and the
>> CPU hotplug lock if enabled, which is the case on Debian).
>>
>> At least one frequent consumer allocates 4 back-to-back counters (and
>> frees them in the same manner), exacerbating the problem.
>>
>> While this does not fully remedy scalability issues, it is a step
>> towards that goal and provides immediate relief.
>>
>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> 
> I'm happy with this. There are a few minor reflow of lines that I'd like
> to do but other than that nice.
> 
> If there are no other comments and it's okay with Andrew I'll pick this
> up tomorrow for-6.6 and the corresponding changes to fork.c.
> 
> Reviewed-by: Dennis Zhou <dennis@kernel.org>

FWIW, the new version also looks correct to me.

Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>


Vegard


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct
  2023-08-23  5:06 [PATCH v3 0/2] execve scalability issues, part 1 Mateusz Guzik
  2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
@ 2023-08-23  5:06 ` Mateusz Guzik
  2023-08-24  6:28   ` Dennis Zhou
  2023-09-06  8:25   ` kernel test robot
  2023-08-25 15:14 ` [PATCH v3 0/2] execve scalability issues, part 1 Dennis Zhou
  2 siblings, 2 replies; 8+ messages in thread
From: Mateusz Guzik @ 2023-08-23  5:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: dennis, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm, Mateusz Guzik

A trivial execve scalability test which tries to be very friendly
(statically linked binaries, all separate) is predominantly bottlenecked
by back-to-back per-cpu counter allocations which serialize on global
locks.

Ease the pain by allocating and freeing them in one go.

Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

Even at a very modest scale of 26 cores (ops/s):
before:	133543.63
after:	186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to page release handling.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/fork.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..4f0ada33457e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -909,8 +909,6 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
-	int i;
-
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 
@@ -925,9 +923,8 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
+	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
 
-	for (i = 0; i < NR_MM_COUNTERS; i++)
-		percpu_counter_destroy(&mm->rss_stat[i]);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1252,8 +1249,6 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-	int i;
-
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	atomic_set(&mm->mm_users, 1);
@@ -1301,17 +1296,14 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm))
 		goto fail_cid;
 
-	for (i = 0; i < NR_MM_COUNTERS; i++)
-		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
-			goto fail_pcpu;
+	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS))
+		goto fail_pcpu;
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
 fail_pcpu:
-	while (i > 0)
-		percpu_counter_destroy(&mm->rss_stat[--i]);
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
-- 
2.41.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct
  2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
@ 2023-08-24  6:28   ` Dennis Zhou
  2023-09-06  8:25   ` kernel test robot
  1 sibling, 0 replies; 8+ messages in thread
From: Dennis Zhou @ 2023-08-24  6:28 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-kernel, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm

On Wed, Aug 23, 2023 at 07:06:09AM +0200, Mateusz Guzik wrote:
> A trivial execve scalability test which tries to be very friendly
> (statically linked binaries, all separate) is predominantly bottlenecked
> by back-to-back per-cpu counter allocations which serialize on global
> locks.
> 
> Ease the pain by allocating and freeing them in one go.
> 
> Bench can be found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> Even at a very modest scale of 26 cores (ops/s):
> before:	133543.63
> after:	186061.81 (+39%)
> 
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to page release handling.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

Same message as for 1/2. I'm happy with this, just a minor reflow. I'll
take this for-6.6 unless there are other comments / objections to that.

I'll run a few tests myself too tomorrow just for validation.

Reviewed-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

> ---
>  kernel/fork.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d2e12b6d2b18..4f0ada33457e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -909,8 +909,6 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
>   */
>  void __mmdrop(struct mm_struct *mm)
>  {
> -	int i;
> -
>  	BUG_ON(mm == &init_mm);
>  	WARN_ON_ONCE(mm == current->mm);
>  
> @@ -925,9 +923,8 @@ void __mmdrop(struct mm_struct *mm)
>  	put_user_ns(mm->user_ns);
>  	mm_pasid_drop(mm);
>  	mm_destroy_cid(mm);
> +	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>  
> -	for (i = 0; i < NR_MM_COUNTERS; i++)
> -		percpu_counter_destroy(&mm->rss_stat[i]);
>  	free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> @@ -1252,8 +1249,6 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	struct user_namespace *user_ns)
>  {
> -	int i;
> -
>  	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
>  	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
>  	atomic_set(&mm->mm_users, 1);
> @@ -1301,17 +1296,14 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	if (mm_alloc_cid(mm))
>  		goto fail_cid;
>  
> -	for (i = 0; i < NR_MM_COUNTERS; i++)
> -		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
> -			goto fail_pcpu;
> +	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS))
> +		goto fail_pcpu;
>  
>  	mm->user_ns = get_user_ns(user_ns);
>  	lru_gen_init_mm(mm);
>  	return mm;
>  
>  fail_pcpu:
> -	while (i > 0)
> -		percpu_counter_destroy(&mm->rss_stat[--i]);
>  	mm_destroy_cid(mm);
>  fail_cid:
>  	destroy_context(mm);
> -- 
> 2.41.0
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct
  2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
  2023-08-24  6:28   ` Dennis Zhou
@ 2023-09-06  8:25   ` kernel test robot
  1 sibling, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-09-06  8:25 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: oe-lkp, lkp, linux-kernel, ying.huang, feng.tang, fengwei.yin,
	dennis, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm,
	Mateusz Guzik, oliver.sang



Hello,

kernel test robot noticed a -8.2% improvement of phoronix-test-suite.osbench.LaunchPrograms.us_per_event on:


commit: 9d32938c115580bfff128a926d704199d2f33ba3 ("[PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct")
url: https://github.com/intel-lab-lkp/linux/commits/Mateusz-Guzik/pcpcntr-add-group-allocation-free/20230823-130803
base: https://git.kernel.org/cgit/linux/kernel/git/dennis/percpu.git for-next
patch link: https://lore.kernel.org/all/20230823050609.2228718-3-mjguzik@gmail.com/
patch subject: [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct

testcase: phoronix-test-suite
test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
parameters:

	test: osbench-1.0.2
	option_a: Launch Programs
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230906/202309061504.7e645826-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/Launch Programs/debian-x86_64-phoronix/lkp-csl-2sp7/osbench-1.0.2/phoronix-test-suite

commit: 
  1db50472c8 ("pcpcntr: add group allocation/free")
  9d32938c11 ("kernel/fork: group allocation/free of per-cpu counters for mm struct")

1db50472c8bc1d34 9d32938c115580bfff128a926d7 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      3.00           +33.3%       4.00        vmstat.procs.r
     14111            +5.7%      14918        vmstat.system.cs
      2114            +1.1%       2136        turbostat.Bzy_MHz
      1.67            +0.2        1.83        turbostat.C1E%
    121.98            +5.1%     128.24        turbostat.PkgWatt
     98.05            -8.2%      90.02        phoronix-test-suite.osbench.LaunchPrograms.us_per_event
     16246 ±  4%      +6.1%      17243        phoronix-test-suite.time.involuntary_context_switches
   9791476            +9.2%   10689455        phoronix-test-suite.time.minor_page_faults
    311.33            +9.3%     340.33        phoronix-test-suite.time.percent_of_cpu_this_job_got
     83.40 ±  2%      +9.2%      91.07 ±  2%  phoronix-test-suite.time.system_time
    151333            +8.6%     164355        phoronix-test-suite.time.voluntary_context_switches
      3225            -5.5%       3046 ±  5%  proc-vmstat.nr_page_table_pages
   9150454            +8.0%    9884178        proc-vmstat.numa_hit
   9088660            +8.7%    9882518        proc-vmstat.numa_local
   9971116            +8.3%   10802925        proc-vmstat.pgalloc_normal
  10202032            +8.8%   11099649        proc-vmstat.pgfault
   9845338            +8.4%   10676360        proc-vmstat.pgfree
    207049           +10.3%     228380 ±  8%  proc-vmstat.pgreuse
 1.947e+09            +5.0%  2.045e+09        perf-stat.i.branch-instructions
  52304206            +4.4%   54610501        perf-stat.i.branch-misses
      9.06 ±  2%      +0.5        9.52        perf-stat.i.cache-miss-rate%
  19663522 ±  3%     +10.0%   21634645        perf-stat.i.cache-misses
 1.658e+08            +3.6%  1.717e+08        perf-stat.i.cache-references
     14769            +6.2%      15691        perf-stat.i.context-switches
 1.338e+10            +6.2%   1.42e+10        perf-stat.i.cpu-cycles
   3112873 ±  3%     -12.5%    2724690 ±  3%  perf-stat.i.dTLB-load-misses
 2.396e+09            +5.5%  2.528e+09        perf-stat.i.dTLB-loads
      0.11 ±  4%      -0.0        0.10 ±  2%  perf-stat.i.dTLB-store-miss-rate%
   1003394 ±  6%     -14.0%     862768 ±  5%  perf-stat.i.dTLB-store-misses
  1.25e+09            +6.0%  1.325e+09        perf-stat.i.dTLB-stores
     71.16            -1.3       69.88        perf-stat.i.iTLB-load-miss-rate%
   1872082            +8.2%    2025999        perf-stat.i.iTLB-loads
 9.606e+09            +5.4%  1.012e+10        perf-stat.i.instructions
     23.37 ±  5%     +30.6%      30.53 ±  4%  perf-stat.i.major-faults
      0.14            +6.2%       0.15        perf-stat.i.metric.GHz
     59.39            +5.4%      62.61        perf-stat.i.metric.M/sec
    249517           +10.0%     274572        perf-stat.i.minor-faults
   5081285            +6.0%    5385686 ±  4%  perf-stat.i.node-load-misses
    565117 ±  3%      +8.1%     610682 ±  3%  perf-stat.i.node-loads
    249541           +10.0%     274602        perf-stat.i.page-faults
     17.27            -1.7%      16.98        perf-stat.overall.MPKI
     11.85 ±  2%      +0.7       12.59        perf-stat.overall.cache-miss-rate%
      0.13 ±  2%      -0.0        0.11 ±  2%  perf-stat.overall.dTLB-load-miss-rate%
      0.08 ±  7%      -0.0        0.07 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     67.26            -1.1       66.12        perf-stat.overall.iTLB-load-miss-rate%
 1.895e+09            +5.0%   1.99e+09        perf-stat.ps.branch-instructions
  50921385            +4.4%   53146828        perf-stat.ps.branch-misses
  19140130 ±  3%     +10.0%   21047707        perf-stat.ps.cache-misses
 1.615e+08            +3.5%  1.672e+08        perf-stat.ps.cache-references
     14376            +6.2%      15266        perf-stat.ps.context-switches
 1.303e+10            +6.1%  1.383e+10        perf-stat.ps.cpu-cycles
   3033019 ±  3%     -12.5%    2654269 ±  3%  perf-stat.ps.dTLB-load-misses
 2.332e+09            +5.5%   2.46e+09        perf-stat.ps.dTLB-loads
    976773 ±  6%     -14.1%     839517 ±  5%  perf-stat.ps.dTLB-store-misses
 1.217e+09            +6.0%  1.289e+09        perf-stat.ps.dTLB-stores
   1822198            +8.2%    1971115        perf-stat.ps.iTLB-loads
 9.349e+09            +5.3%  9.846e+09        perf-stat.ps.instructions
     22.75 ±  5%     +30.5%      29.69 ±  4%  perf-stat.ps.major-faults
    242831           +10.0%     267074        perf-stat.ps.minor-faults
   4945101            +5.9%    5238638 ±  4%  perf-stat.ps.node-load-misses
    550029 ±  3%      +8.0%     594116 ±  3%  perf-stat.ps.node-loads
    242854           +10.0%     267104        perf-stat.ps.page-faults
 3.719e+11            +4.4%  3.883e+11        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/2] execve scalability issues, part 1
  2023-08-23  5:06 [PATCH v3 0/2] execve scalability issues, part 1 Mateusz Guzik
  2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
  2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
@ 2023-08-25 15:14 ` Dennis Zhou
  2 siblings, 0 replies; 8+ messages in thread
From: Dennis Zhou @ 2023-08-25 15:14 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-kernel, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm

Hello,

On Wed, Aug 23, 2023 at 07:06:07AM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
> 
> I borrowed the bench from here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> It prints a result every second.
> 
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
> 
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
> 
> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
> 
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after:  186061.81 (+39%)
> 
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
> 
>     __pv_queued_spin_lock_slowpath+1
>     _raw_spin_lock_irqsave+57
>     folio_lruvec_lock_irqsave+91
>     release_pages+590
>     tlb_batch_pages_flush+61
>     tlb_finish_mmu+101
>     exit_mmap+327
>     __mmput+61
>     begin_new_exec+1245
>     load_elf_binary+712
>     bprm_execve+644
>     do_execveat_common.isra.0+429
>     __x64_sys_execve+50
>     do_syscall_64+46
>     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
> 
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
> 
> v3:
> - fix !CONFIG_SMP build
> - drop the backtrace from fork commit message
> 
> v2:
> - force bigger alignment on alloc
> - rename "counters" to "nr_counters" and pass prior to lock key
> - drop {}'s for single-statement loops
> 
> 
> Mateusz Guzik (2):
>   pcpcntr: add group allocation/free
>   fork: group allocation of per-cpu counters for mm struct
> 
>  include/linux/percpu_counter.h | 39 ++++++++++++++++++----
>  kernel/fork.c                  | 14 ++------
>  lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
>  3 files changed, 77 insertions(+), 37 deletions(-)
> 
> -- 
> 2.41.0
> 

I've applied both to percpu#for-6.6.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-06  8:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-23  5:06 [PATCH v3 0/2] execve scalability issues, part 1 Mateusz Guzik
2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
2023-08-24  6:26   ` Dennis Zhou
2023-08-24 10:01     ` Vegard Nossum
2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
2023-08-24  6:28   ` Dennis Zhou
2023-09-06  8:25   ` kernel test robot
2023-08-25 15:14 ` [PATCH v3 0/2] execve scalability issues, part 1 Dennis Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox