[PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params
@ 2022-09-19 18:06 Kairui Song
  2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
  2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
  0 siblings, 2 replies; 9+ messages in thread
From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

V2 of "[PATCH 0/2] mm: memcontrol: cleanup and optimize for accounting params".

Patch 1/2 is a minor cleanup for kmem accouting path.

Patch 2/2 optimizes some hot paths by making cgroup_memory_noswap a
static key, benchmark shows swap paths now have a ~5% lower overhead.

Update from V1:
- Collect Acked-by for patch 2/2 from Michal Hocko, Thanks!
- Simplify patch 1/1, don't drop mem_cgroup_kmem_disabled and only
  change one call site as a minor clean up, update commit message
  for patch 2/2, as suggested by Michal Hocko.

Kairui Song (2):
  mm: memcontrol: use memcg_kmem_enabled in count_objcg_event
  mm: memcontrol: make cgroup_memory_noswap a static key

 include/linux/memcontrol.h |  2 +-
 mm/memcontrol.c            | 27 +++++++++++++++++++--------
 2 files changed, 20 insertions(+), 9 deletions(-)

-- 
2.35.2



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event
  2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song
@ 2022-09-19 18:06 ` Kairui Song
  2022-09-19 21:36   ` Shakeel Butt
                     ` (2 more replies)
  2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
  1 sibling, 3 replies; 9+ messages in thread
From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

There are currently two helpers for checking if cgroup kmem
accounting is enabled:

- mem_cgroup_kmem_disabled
- memcg_kmem_enabled

mem_cgroup_kmem_disabled is a simple helper that returns true
if cgroup.memory=nokmem is specified, otherwise returns false.

memcg_kmem_enabled is a bit different, it returns true if
cgroup.memory=nokmem is not specified and there was at least one
non-root memory control enabled cgroup ever created. This help improve
performance when kmem accounting was not actually activated. And it's
optimized with static branch.

The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
preallocate data for kmem accounting since they could be initialized
before kmem accounting is activated. But count_objcg_event doesn't
need that, so using memcg_kmem_enabled is better here.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/memcontrol.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6257867fbf95..e6d3d5870d6f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1779,7 +1779,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 {
 	struct mem_cgroup *memcg;

-	if (mem_cgroup_kmem_disabled())
+	if (!memcg_kmem_enabled())
 		return;

 	rcu_read_lock();
-- 
2.35.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event
  2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
@ 2022-09-19 21:36   ` Shakeel Butt
  2022-09-20  2:27   ` Roman Gushchin
  2022-09-20  2:41   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2022-09-19 21:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, linux-kernel

On Tue, Sep 20, 2022 at 02:06:33AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> There are currently two helpers for checking if cgroup kmem
> accounting is enabled:
> 
> - mem_cgroup_kmem_disabled
> - memcg_kmem_enabled
> 
> mem_cgroup_kmem_disabled is a simple helper that returns true
> if cgroup.memory=nokmem is specified, otherwise returns false.
> 
> memcg_kmem_enabled is a bit different, it returns true if
> cgroup.memory=nokmem is not specified and there was at least one
> non-root memory control enabled cgroup ever created. This help improve
> performance when kmem accounting was not actually activated. And it's
> optimized with static branch.
> 
> The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
> preallocate data for kmem accounting since they could be initialized
> before kmem accounting is activated. But count_objcg_event doesn't
> need that, so using memcg_kmem_enabled is better here.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event
  2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
  2022-09-19 21:36   ` Shakeel Butt
@ 2022-09-20  2:27   ` Roman Gushchin
  2022-09-20  2:41   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Roman Gushchin @ 2022-09-20  2:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, linux-kernel

On Tue, Sep 20, 2022 at 02:06:33AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> There are currently two helpers for checking if cgroup kmem
> accounting is enabled:
> 
> - mem_cgroup_kmem_disabled
> - memcg_kmem_enabled
> 
> mem_cgroup_kmem_disabled is a simple helper that returns true
> if cgroup.memory=nokmem is specified, otherwise returns false.
> 
> memcg_kmem_enabled is a bit different, it returns true if
> cgroup.memory=nokmem is not specified and there was at least one
> non-root memory control enabled cgroup ever created. This help improve
> performance when kmem accounting was not actually activated. And it's
> optimized with static branch.
> 
> The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
> preallocate data for kmem accounting since they could be initialized
> before kmem accounting is activated. But count_objcg_event doesn't
> need that, so using memcg_kmem_enabled is better here.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks, Kairui!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event
  2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
  2022-09-19 21:36   ` Shakeel Butt
  2022-09-20  2:27   ` Roman Gushchin
@ 2022-09-20  2:41   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2022-09-20  2:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, Linux MM, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel



> On Sep 20, 2022, at 02:06, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> There are currently two helpers for checking if cgroup kmem
> accounting is enabled:
> 
> - mem_cgroup_kmem_disabled
> - memcg_kmem_enabled
> 
> mem_cgroup_kmem_disabled is a simple helper that returns true
> if cgroup.memory=nokmem is specified, otherwise returns false.
> 
> memcg_kmem_enabled is a bit different, it returns true if
> cgroup.memory=nokmem is not specified and there was at least one
> non-root memory control enabled cgroup ever created. This help improve
> performance when kmem accounting was not actually activated. And it's
> optimized with static branch.
> 
> The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
> preallocate data for kmem accounting since they could be initialized
> before kmem accounting is activated. But count_objcg_event doesn't
> need that, so using memcg_kmem_enabled is better here.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key
  2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song
  2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
@ 2022-09-19 18:06 ` Kairui Song
  2022-09-19 21:37   ` Shakeel Butt
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, linux-kernel, Kairui Song,
	Michal Hocko

From: Kairui Song <kasong@tencent.com>

cgroup_memory_noswap is used in many hot path, so make it a static key
to lower the kernel overhead.

Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
with the following code snip in a non-root cgroup:

   #include <stdio.h>
   #include <string.h>
   #include <linux/mman.h>
   #include <sys/mman.h>
   #define MB 1024UL * 1024UL
   int main(int argc, char **argv){
      void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      memset(p, 0xff, 8000 * MB);
      madvise(p, 8000 * MB, MADV_PAGEOUT);
      memset(p, 0xff, 8000 * MB);
      return 0;
   }

Before:
          7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
             4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
                 0      cpu-migrations            #    0.000 /sec
         2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
    12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
       156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
       310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
    18,692,516,591      instructions              #    1.49  insn per cycle
                                                  #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
     4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
        13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
     7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
       649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
     1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
        31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
         6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
         5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
               765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
         4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
       149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)

           7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )

After:
          6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
             4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
                 0      cpu-migrations            #    0.000 /sec
         2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
    11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
       161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
       253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
    19,328,171,892      instructions              #    1.65  insn per cycle
                                                  #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
     5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
        12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
     7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
       649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
     1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
        31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
         6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
         6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
               736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
         4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
       144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)

           6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )

The performance is clearly better. There is no significant hotspot
improvement according to perf report, as there are quite a few
callers of memcg_swap_enabled and do_memsw_account (which calls
memcg_swap_enabled). Many pieces of minor optimizations resulted
in lower overhead for the branch predictor, and bettter performance.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memcontrol.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b69979c9ced5..5bb89c745233 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,9 +90,18 @@ static bool cgroup_memory_nokmem __ro_after_init;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
-static bool cgroup_memory_noswap __ro_after_init;
+static bool cgroup_memory_noswap __initdata;
+
+static DEFINE_STATIC_KEY_FALSE(memcg_swap_enabled_key);
+static inline bool memcg_swap_enabled(void)
+{
+	return static_branch_likely(&memcg_swap_enabled_key);
+}
 #else
-#define cgroup_memory_noswap		1
+static inline bool memcg_swap_enabled(void)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -102,7 +111,7 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 /* Whether legacy memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap;
+	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg_swap_enabled();
 }
 
 #define THRESHOLDS_EVENTS_TARGET 128
@@ -7267,7 +7276,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, nr_entries);
 
-	if (!cgroup_memory_noswap && memcg != swap_memcg) {
+	if (memcg_swap_enabled() && memcg != swap_memcg) {
 		if (!mem_cgroup_is_root(swap_memcg))
 			page_counter_charge(&swap_memcg->memsw, nr_entries);
 		page_counter_uncharge(&memcg->memsw, nr_entries);
@@ -7319,7 +7328,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 
 	memcg = mem_cgroup_id_get_online(memcg);
 
-	if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) &&
+	if (memcg_swap_enabled() && !mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
@@ -7351,7 +7360,7 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
-		if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg)) {
+		if (memcg_swap_enabled() && !mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
@@ -7367,7 +7376,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!memcg_swap_enabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return nr_swap_pages;
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
@@ -7384,7 +7393,7 @@ bool mem_cgroup_swap_full(struct page *page)
 
 	if (vm_swap_full())
 		return true;
-	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!memcg_swap_enabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
 	memcg = page_memcg(page);
@@ -7692,6 +7701,8 @@ static int __init mem_cgroup_swap_init(void)
 	if (cgroup_memory_noswap)
 		return 0;
 
+	static_branch_enable(&memcg_swap_enabled_key);
+
 	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
-- 
2.35.2



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key
  2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
@ 2022-09-19 21:37   ` Shakeel Butt
  2022-09-20  2:29   ` Roman Gushchin
  2022-09-20  2:43   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2022-09-19 21:37 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, linux-kernel, Michal Hocko

On Tue, Sep 20, 2022 at 02:06:34AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> cgroup_memory_noswap is used in many hot path, so make it a static key
> to lower the kernel overhead.
> 
> Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
> with the following code snip in a non-root cgroup:
> 
>    #include <stdio.h>
>    #include <string.h>
>    #include <linux/mman.h>
>    #include <sys/mman.h>
>    #define MB 1024UL * 1024UL
>    int main(int argc, char **argv){
>       void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
>                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>       memset(p, 0xff, 8000 * MB);
>       madvise(p, 8000 * MB, MADV_PAGEOUT);
>       memset(p, 0xff, 8000 * MB);
>       return 0;
>    }
> 
> Before:
>           7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
>              4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
>                  0      cpu-migrations            #    0.000 /sec
>          2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
>     12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
>        156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
>        310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
>     18,692,516,591      instructions              #    1.49  insn per cycle
>                                                   #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
>      4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
>         13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
>      7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
>        649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
>      1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
>         31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
>          6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
>          5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
>                765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
>          4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
>        149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)
> 
>            7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
>              4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
>                  0      cpu-migrations            #    0.000 /sec
>          2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
>     11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
>        161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
>        253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
>     19,328,171,892      instructions              #    1.65  insn per cycle
>                                                   #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
>      5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
>         12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
>      7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
>        649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
>      1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
>         31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
>          6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
>          6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
>                736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
>          4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
>        144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)
> 
>            6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )
> 
> The performance is clearly better. There is no significant hotspot
> improvement according to perf report, as there are quite a few
> callers of memcg_swap_enabled and do_memsw_account (which calls
> memcg_swap_enabled). Many pieces of minor optimizations resulted
> in lower overhead for the branch predictor, and bettter performance.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key
  2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
  2022-09-19 21:37   ` Shakeel Butt
@ 2022-09-20  2:29   ` Roman Gushchin
  2022-09-20  2:43   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Roman Gushchin @ 2022-09-20  2:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, linux-kernel, Michal Hocko

On Tue, Sep 20, 2022 at 02:06:34AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> cgroup_memory_noswap is used in many hot path, so make it a static key
> to lower the kernel overhead.
> 
> Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
> with the following code snip in a non-root cgroup:
> 
>    #include <stdio.h>
>    #include <string.h>
>    #include <linux/mman.h>
>    #include <sys/mman.h>
>    #define MB 1024UL * 1024UL
>    int main(int argc, char **argv){
>       void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
>                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>       memset(p, 0xff, 8000 * MB);
>       madvise(p, 8000 * MB, MADV_PAGEOUT);
>       memset(p, 0xff, 8000 * MB);
>       return 0;
>    }
> 
> Before:
>           7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
>              4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
>                  0      cpu-migrations            #    0.000 /sec
>          2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
>     12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
>        156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
>        310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
>     18,692,516,591      instructions              #    1.49  insn per cycle
>                                                   #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
>      4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
>         13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
>      7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
>        649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
>      1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
>         31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
>          6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
>          5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
>                765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
>          4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
>        149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)
> 
>            7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
>              4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
>                  0      cpu-migrations            #    0.000 /sec
>          2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
>     11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
>        161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
>        253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
>     19,328,171,892      instructions              #    1.65  insn per cycle
>                                                   #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
>      5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
>         12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
>      7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
>        649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
>      1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
>         31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
>          6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
>          6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
>                736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
>          4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
>        144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)
> 
>            6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )
> 
> The performance is clearly better. There is no significant hotspot
> improvement according to perf report, as there are quite a few
> callers of memcg_swap_enabled and do_memsw_account (which calls
> memcg_swap_enabled). Many pieces of minor optimizations resulted
> in lower overhead for the branch predictor, and bettter performance.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key
  2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
  2022-09-19 21:37   ` Shakeel Butt
  2022-09-20  2:29   ` Roman Gushchin
@ 2022-09-20  2:43   ` Muchun Song
  2 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2022-09-20  2:43 UTC (permalink / raw)
  To: Kairui Song
  Cc: cgroups, Linux MM, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel,
	Michal Hocko



> On Sep 20, 2022, at 02:06, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> cgroup_memory_noswap is used in many hot path, so make it a static key
> to lower the kernel overhead.
> 
> Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
> with the following code snip in a non-root cgroup:
> 
>   #include <stdio.h>
>   #include <string.h>
>   #include <linux/mman.h>
>   #include <sys/mman.h>
>   #define MB 1024UL * 1024UL
>   int main(int argc, char **argv){
>      void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
>                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      memset(p, 0xff, 8000 * MB);
>      madvise(p, 8000 * MB, MADV_PAGEOUT);
>      memset(p, 0xff, 8000 * MB);
>      return 0;
>   }
> 
> Before:
>          7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
>             4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
>                 0      cpu-migrations            #    0.000 /sec
>         2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
>    12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
>       156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
>       310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
>    18,692,516,591      instructions              #    1.49  insn per cycle
>                                                  #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
>     4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
>        13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
>     7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
>       649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
>     1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
>        31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
>         6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
>         5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
>               765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
>         4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
>       149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)
> 
>           7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )
> 
> After:
>          6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
>             4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
>                 0      cpu-migrations            #    0.000 /sec
>         2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
>    11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
>       161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
>       253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
>    19,328,171,892      instructions              #    1.65  insn per cycle
>                                                  #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
>     5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
>        12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
>     7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
>       649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
>     1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
>        31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
>         6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
>         6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
>               736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
>         4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
>       144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)
> 
>           6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )
> 
> The performance is clearly better. There is no significant hotspot
> improvement according to perf report, as there are quite a few
> callers of memcg_swap_enabled and do_memsw_account (which calls
> memcg_swap_enabled). Many pieces of minor optimizations resulted
> in lower overhead for the branch predictor, and bettter performance.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-09-20  2:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song
2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song
2022-09-19 21:36   ` Shakeel Butt
2022-09-20  2:27   ` Roman Gushchin
2022-09-20  2:41   ` Muchun Song
2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song
2022-09-19 21:37   ` Shakeel Butt
2022-09-20  2:29   ` Roman Gushchin
2022-09-20  2:43   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox