hi, Shakeel,

On Tue, May 21, 2024 at 09:18:19PM -0700, Shakeel Butt wrote:
> On Tue, May 21, 2024 at 10:43:16AM +0800, Oliver Sang wrote:
> > hi, Shakeel,
> > 
> [...]
> > 
> > we reported regression on a 2-node Skylake server. so I found a 1-node Skylake
> > desktop (we don't have 1 node server) to check.
> > 
> 
> Please try the following patch on both single node and dual node
> machines:


the regression is partially recovered by applying your patch.
(but one even more regression case as below)

details:

since you mentioned the whole patch-set behavior last time, I applied the
patch upon
  a94032b35e5f9 memcg: use proper type for mod_memcg_state

below fd2296741e2686ed6ecd05187e4 = a94032b35e5f9 + patch


for the regression in our original report, test machine is:

model: Skylake
nr_node: 2
nr_cpu: 104
memory: 192G

regression partially recovered:

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/page_fault2/will-it-scale

59142d87ab03b8ff a94032b35e5f97dc1023030d929 fd2296741e2686ed6ecd05187e4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
     91713           -13.0%      79833            -4.5%      87614        will-it-scale.per_process_ops

detail data is in part [1] in attachment.


in later threads, we also reported similar regression on other platforms.

on:
model: Ice Lake
nr_node: 2
nr_cpu: 64
memory: 256G
brand: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz

regression partially recovered but not so obvious as above:

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp9/page_fault2/will-it-scale

59142d87ab03b8ff a94032b35e5f97dc1023030d929 fd2296741e2686ed6ecd05187e4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    240373           -12.9%     209394           -10.1%     215996        will-it-scale.per_process_ops

detail data is in part [2] in attachment.


on:
model: Sapphire Rapids
nr_node: 2
nr_cpu: 224
memory: 512G
brand: Intel(R) Xeon(R) Platinum 8480CTDX


regression NOT recovered, even a little worse:

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/page_fault2/will-it-scale

59142d87ab03b8ff a94032b35e5f97dc1023030d929 fd2296741e2686ed6ecd05187e4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
     78072            -5.6%      73683            -6.5%      72975        will-it-scale.per_process_ops

detail data is in part [3] in attachment.


for single node machine, we reported last time no regression on:

model: Skylake
nr_node: 1
nr_cpu: 36
memory: 32G
brand: Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

we confirmed it's not impacted by this new patch, either:

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-skl-d08/page_fault2/will-it-scale

59142d87ab03b8ff a94032b35e5f97dc1023030d929 fd2296741e2686ed6ecd05187e4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    136040            -0.1%     135881            -0.1%     135953        will-it-scale.per_process_ops

if you need detail data for this comparison, please let us know.


BTW, after last update, we found another single node machine which can reproduce
the regression in our original report:

model: Cascade Lake
nr_node: 1
nr_cpu: 36
memory: 128G
brand: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

the regression is also partially recovered now:

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-csl-d02/page_fault2/will-it-scale

59142d87ab03b8ff a94032b35e5f97dc1023030d929 fd2296741e2686ed6ecd05187e4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    187483           -19.4%     151162           -12.1%     164714        will-it-scale.per_process_ops

detail data is in part [4] in attachment.

> 
> 
> From 00a84b489b9e18abd1b8ec575ea31afacaf0734b Mon Sep 17 00:00:00 2001
> From: Shakeel Butt <shakeel.butt@linux.dev>
> Date: Tue, 21 May 2024 20:27:11 -0700
> Subject: [PATCH] memcg: rearrage fields of mem_cgroup_per_node
> 
> At the moment the fields of mem_cgroup_per_node which get read on the
> performance critical path share the cacheline with the fields which
> might get updated. This cause contention of that cacheline for
> concurrent readers. Let's move all the read only pointers at the start
> of the struct, followed by memcg-v1 only fields and at the end fields
> which get updated often.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/memcontrol.h | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 030d34e9d117..16efd9737be9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -96,23 +96,25 @@ struct mem_cgroup_reclaim_iter {
>   * per-node information in memory controller.
>   */
>  struct mem_cgroup_per_node {
> -	struct lruvec		lruvec;
> +	/* Keep the read-only fields at the start */
> +	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> +						/* use container_of	   */
>  
>  	struct lruvec_stats_percpu __percpu	*lruvec_stats_percpu;
>  	struct lruvec_stats			*lruvec_stats;
> -
> -	unsigned long		lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> -
> -	struct mem_cgroup_reclaim_iter	iter;
> -
>  	struct shrinker_info __rcu	*shrinker_info;
>  
> +	/* memcg-v1 only stuff in middle */
> +
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long		usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
>  	bool			on_tree;
> -	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> -						/* use container_of	   */
> +
> +	/* Fields which get updated often at the end. */
> +	struct lruvec		lruvec;
> +	unsigned long		lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> +	struct mem_cgroup_reclaim_iter	iter;
>  };
>  
>  struct mem_cgroup_threshold {
> -- 
> 2.43.0
> 
>