* [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params @ 2022-09-19 18:06 Kairui Song 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song 0 siblings, 2 replies; 9+ messages in thread From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw) To: cgroups, linux-mm Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel, Kairui Song From: Kairui Song <kasong@tencent.com> V2 of "[PATCH 0/2] mm: memcontrol: cleanup and optimize for accounting params". Patch 1/2 is a minor cleanup for kmem accouting path. Patch 2/2 optimizes some hot paths by making cgroup_memory_noswap a static key, benchmark shows swap paths now have a ~5% lower overhead. Update from V1: - Collect Acked-by for patch 2/2 from Michal Hocko, Thanks! - Simplify patch 1/1, don't drop mem_cgroup_kmem_disabled and only change one call site as a minor clean up, update commit message for patch 2/2, as suggested by Michal Hocko. Kairui Song (2): mm: memcontrol: use memcg_kmem_enabled in count_objcg_event mm: memcontrol: make cgroup_memory_noswap a static key include/linux/memcontrol.h | 2 +- mm/memcontrol.c | 27 +++++++++++++++++++-------- 2 files changed, 20 insertions(+), 9 deletions(-) -- 2.35.2 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event 2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song @ 2022-09-19 18:06 ` Kairui Song 2022-09-19 21:36 ` Shakeel Butt ` (2 more replies) 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song 1 sibling, 3 replies; 9+ messages in thread From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw) To: cgroups, linux-mm Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel, Kairui Song From: Kairui Song <kasong@tencent.com> There are currently two helpers for checking if cgroup kmem accounting is enabled: - mem_cgroup_kmem_disabled - memcg_kmem_enabled mem_cgroup_kmem_disabled is a simple helper that returns true if cgroup.memory=nokmem is specified, otherwise returns false. memcg_kmem_enabled is a bit different, it returns true if cgroup.memory=nokmem is not specified and there was at least one non-root memory control enabled cgroup ever created. This help improve performance when kmem accounting was not actually activated. And it's optimized with static branch. The usage of mem_cgroup_kmem_disabled is for sub-systems that need to preallocate data for kmem accounting since they could be initialized before kmem accounting is activated. But count_objcg_event doesn't need that, so using memcg_kmem_enabled is better here. Signed-off-by: Kairui Song <kasong@tencent.com> --- include/linux/memcontrol.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6257867fbf95..e6d3d5870d6f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1779,7 +1779,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, { struct mem_cgroup *memcg; - if (mem_cgroup_kmem_disabled()) + if (!memcg_kmem_enabled()) return; rcu_read_lock(); -- 2.35.2 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song @ 2022-09-19 21:36 ` Shakeel Butt 2022-09-20 2:27 ` Roman Gushchin 2022-09-20 2:41 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Shakeel Butt @ 2022-09-19 21:36 UTC (permalink / raw) To: Kairui Song Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel On Tue, Sep 20, 2022 at 02:06:33AM +0800, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > There are currently two helpers for checking if cgroup kmem > accounting is enabled: > > - mem_cgroup_kmem_disabled > - memcg_kmem_enabled > > mem_cgroup_kmem_disabled is a simple helper that returns true > if cgroup.memory=nokmem is specified, otherwise returns false. > > memcg_kmem_enabled is a bit different, it returns true if > cgroup.memory=nokmem is not specified and there was at least one > non-root memory control enabled cgroup ever created. This help improve > performance when kmem accounting was not actually activated. And it's > optimized with static branch. > > The usage of mem_cgroup_kmem_disabled is for sub-systems that need to > preallocate data for kmem accounting since they could be initialized > before kmem accounting is activated. But count_objcg_event doesn't > need that, so using memcg_kmem_enabled is better here. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song 2022-09-19 21:36 ` Shakeel Butt @ 2022-09-20 2:27 ` Roman Gushchin 2022-09-20 2:41 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Roman Gushchin @ 2022-09-20 2:27 UTC (permalink / raw) To: Kairui Song Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel On Tue, Sep 20, 2022 at 02:06:33AM +0800, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > There are currently two helpers for checking if cgroup kmem > accounting is enabled: > > - mem_cgroup_kmem_disabled > - memcg_kmem_enabled > > mem_cgroup_kmem_disabled is a simple helper that returns true > if cgroup.memory=nokmem is specified, otherwise returns false. > > memcg_kmem_enabled is a bit different, it returns true if > cgroup.memory=nokmem is not specified and there was at least one > non-root memory control enabled cgroup ever created. This help improve > performance when kmem accounting was not actually activated. And it's > optimized with static branch. > > The usage of mem_cgroup_kmem_disabled is for sub-systems that need to > preallocate data for kmem accounting since they could be initialized > before kmem accounting is activated. But count_objcg_event doesn't > need that, so using memcg_kmem_enabled is better here. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks, Kairui! ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song 2022-09-19 21:36 ` Shakeel Butt 2022-09-20 2:27 ` Roman Gushchin @ 2022-09-20 2:41 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Muchun Song @ 2022-09-20 2:41 UTC (permalink / raw) To: Kairui Song Cc: cgroups, Linux MM, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel > On Sep 20, 2022, at 02:06, Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > There are currently two helpers for checking if cgroup kmem > accounting is enabled: > > - mem_cgroup_kmem_disabled > - memcg_kmem_enabled > > mem_cgroup_kmem_disabled is a simple helper that returns true > if cgroup.memory=nokmem is specified, otherwise returns false. > > memcg_kmem_enabled is a bit different, it returns true if > cgroup.memory=nokmem is not specified and there was at least one > non-root memory control enabled cgroup ever created. This help improve > performance when kmem accounting was not actually activated. And it's > optimized with static branch. > > The usage of mem_cgroup_kmem_disabled is for sub-systems that need to > preallocate data for kmem accounting since they could be initialized > before kmem accounting is activated. But count_objcg_event doesn't > need that, so using memcg_kmem_enabled is better here. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key 2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song @ 2022-09-19 18:06 ` Kairui Song 2022-09-19 21:37 ` Shakeel Butt ` (2 more replies) 1 sibling, 3 replies; 9+ messages in thread From: Kairui Song @ 2022-09-19 18:06 UTC (permalink / raw) To: cgroups, linux-mm Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel, Kairui Song, Michal Hocko From: Kairui Song <kasong@tencent.com> cgroup_memory_noswap is used in many hot path, so make it a static key to lower the kernel overhead. Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100` with the following code snip in a non-root cgroup: #include <stdio.h> #include <string.h> #include <linux/mman.h> #include <sys/mman.h> #define MB 1024UL * 1024UL int main(int argc, char **argv){ void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0xff, 8000 * MB); madvise(p, 8000 * MB, MADV_PAGEOUT); memset(p, 0xff, 8000 * MB); return 0; } Before: 7,021.43 msec task-clock # 0.967 CPUs utilized ( +- 0.03% ) 4,010 context-switches # 573.853 /sec ( +- 0.01% ) 0 cpu-migrations # 0.000 /sec 2,052,057 page-faults # 293.661 K/sec ( +- 0.00% ) 12,616,546,027 cycles # 1.805 GHz ( +- 0.06% ) (39.92%) 156,823,666 stalled-cycles-frontend # 1.25% frontend cycles idle ( +- 0.10% ) (40.25%) 310,130,812 stalled-cycles-backend # 2.47% backend cycles idle ( +- 4.39% ) (40.73%) 18,692,516,591 instructions # 1.49 insn per cycle # 0.01 stalled cycles per insn ( +- 0.04% ) (40.75%) 4,907,447,976 branches # 702.283 M/sec ( +- 0.05% ) (40.30%) 13,002,578 branch-misses # 0.26% of all branches ( +- 0.08% ) (40.48%) 7,069,786,296 L1-dcache-loads # 1.012 G/sec ( +- 0.03% ) (40.32%) 649,385,847 L1-dcache-load-misses # 9.13% of all L1-dcache accesses ( +- 0.07% ) (40.10%) 1,485,448,688 L1-icache-loads # 212.576 M/sec ( +- 0.15% ) (39.49%) 31,628,457 L1-icache-load-misses # 2.13% of all L1-icache accesses ( +- 0.40% ) (39.57%) 6,667,311 dTLB-loads # 954.129 K/sec ( +- 0.21% ) (39.50%) 5,668,555 dTLB-load-misses # 86.40% of all dTLB cache accesses ( +- 0.12% ) (39.03%) 765 iTLB-loads # 109.476 /sec ( +- 21.81% ) (39.44%) 4,370,351 iTLB-load-misses # 214320.09% of all iTLB cache accesses ( +- 1.44% ) (39.86%) 149,207,254 L1-dcache-prefetches # 21.352 M/sec ( +- 0.13% ) (40.27%) 7.25869 +- 0.00203 seconds time elapsed ( +- 0.03% ) After: 6,576.16 msec task-clock # 0.953 CPUs utilized ( +- 0.10% ) 4,020 context-switches # 605.595 /sec ( +- 0.01% ) 0 cpu-migrations # 0.000 /sec 2,052,056 page-faults # 309.133 K/sec ( +- 0.00% ) 11,967,619,180 cycles # 1.803 GHz ( +- 0.36% ) (38.76%) 161,259,240 stalled-cycles-frontend # 1.38% frontend cycles idle ( +- 0.27% ) (36.58%) 253,605,302 stalled-cycles-backend # 2.16% backend cycles idle ( +- 4.45% ) (34.78%) 19,328,171,892 instructions # 1.65 insn per cycle # 0.01 stalled cycles per insn ( +- 0.10% ) (31.46%) 5,213,967,902 branches # 785.461 M/sec ( +- 0.18% ) (30.68%) 12,385,170 branch-misses # 0.24% of all branches ( +- 0.26% ) (34.13%) 7,271,687,822 L1-dcache-loads # 1.095 G/sec ( +- 0.12% ) (35.29%) 649,873,045 L1-dcache-load-misses # 8.93% of all L1-dcache accesses ( +- 0.11% ) (41.41%) 1,950,037,608 L1-icache-loads # 293.764 M/sec ( +- 0.33% ) (43.11%) 31,365,566 L1-icache-load-misses # 1.62% of all L1-icache accesses ( +- 0.39% ) (45.89%) 6,767,809 dTLB-loads # 1.020 M/sec ( +- 0.47% ) (48.42%) 6,339,590 dTLB-load-misses # 95.43% of all dTLB cache accesses ( +- 0.50% ) (46.60%) 736 iTLB-loads # 110.875 /sec ( +- 1.79% ) (48.60%) 4,314,836 iTLB-load-misses # 518653.73% of all iTLB cache accesses ( +- 0.63% ) (42.91%) 144,950,156 L1-dcache-prefetches # 21.836 M/sec ( +- 0.37% ) (41.39%) 6.89935 +- 0.00703 seconds time elapsed ( +- 0.10% ) The performance is clearly better. There is no significant hotspot improvement according to perf report, as there are quite a few callers of memcg_swap_enabled and do_memsw_account (which calls memcg_swap_enabled). Many pieces of minor optimizations resulted in lower overhead for the branch predictor, and bettter performance. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> --- mm/memcontrol.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b69979c9ced5..5bb89c745233 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -90,9 +90,18 @@ static bool cgroup_memory_nokmem __ro_after_init; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP -static bool cgroup_memory_noswap __ro_after_init; +static bool cgroup_memory_noswap __initdata; + +static DEFINE_STATIC_KEY_FALSE(memcg_swap_enabled_key); +static inline bool memcg_swap_enabled(void) +{ + return static_branch_likely(&memcg_swap_enabled_key); +} #else -#define cgroup_memory_noswap 1 +static inline bool memcg_swap_enabled(void) +{ + return false; +} #endif #ifdef CONFIG_CGROUP_WRITEBACK @@ -102,7 +111,7 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { - return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap; + return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg_swap_enabled(); } #define THRESHOLDS_EVENTS_TARGET 128 @@ -7267,7 +7276,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(&memcg->memory, nr_entries); - if (!cgroup_memory_noswap && memcg != swap_memcg) { + if (memcg_swap_enabled() && memcg != swap_memcg) { if (!mem_cgroup_is_root(swap_memcg)) page_counter_charge(&swap_memcg->memsw, nr_entries); page_counter_uncharge(&memcg->memsw, nr_entries); @@ -7319,7 +7328,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) memcg = mem_cgroup_id_get_online(memcg); - if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) && + if (memcg_swap_enabled() && !mem_cgroup_is_root(memcg) && !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { memcg_memory_event(memcg, MEMCG_SWAP_MAX); memcg_memory_event(memcg, MEMCG_SWAP_FAIL); @@ -7351,7 +7360,7 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) rcu_read_lock(); memcg = mem_cgroup_from_id(id); if (memcg) { - if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg)) { + if (memcg_swap_enabled() && !mem_cgroup_is_root(memcg)) { if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) page_counter_uncharge(&memcg->swap, nr_pages); else @@ -7367,7 +7376,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg) { long nr_swap_pages = get_nr_swap_pages(); - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) + if (!memcg_swap_enabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return nr_swap_pages; for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) nr_swap_pages = min_t(long, nr_swap_pages, @@ -7384,7 +7393,7 @@ bool mem_cgroup_swap_full(struct page *page) if (vm_swap_full()) return true; - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) + if (!memcg_swap_enabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; memcg = page_memcg(page); @@ -7692,6 +7701,8 @@ static int __init mem_cgroup_swap_init(void) if (cgroup_memory_noswap) return 0; + static_branch_enable(&memcg_swap_enabled_key); + WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files)); WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files)); #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) -- 2.35.2 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song @ 2022-09-19 21:37 ` Shakeel Butt 2022-09-20 2:29 ` Roman Gushchin 2022-09-20 2:43 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Shakeel Butt @ 2022-09-19 21:37 UTC (permalink / raw) To: Kairui Song Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel, Michal Hocko On Tue, Sep 20, 2022 at 02:06:34AM +0800, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > cgroup_memory_noswap is used in many hot path, so make it a static key > to lower the kernel overhead. > > Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100` > with the following code snip in a non-root cgroup: > > #include <stdio.h> > #include <string.h> > #include <linux/mman.h> > #include <sys/mman.h> > #define MB 1024UL * 1024UL > int main(int argc, char **argv){ > void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > memset(p, 0xff, 8000 * MB); > madvise(p, 8000 * MB, MADV_PAGEOUT); > memset(p, 0xff, 8000 * MB); > return 0; > } > > Before: > 7,021.43 msec task-clock # 0.967 CPUs utilized ( +- 0.03% ) > 4,010 context-switches # 573.853 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,057 page-faults # 293.661 K/sec ( +- 0.00% ) > 12,616,546,027 cycles # 1.805 GHz ( +- 0.06% ) (39.92%) > 156,823,666 stalled-cycles-frontend # 1.25% frontend cycles idle ( +- 0.10% ) (40.25%) > 310,130,812 stalled-cycles-backend # 2.47% backend cycles idle ( +- 4.39% ) (40.73%) > 18,692,516,591 instructions # 1.49 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.04% ) (40.75%) > 4,907,447,976 branches # 702.283 M/sec ( +- 0.05% ) (40.30%) > 13,002,578 branch-misses # 0.26% of all branches ( +- 0.08% ) (40.48%) > 7,069,786,296 L1-dcache-loads # 1.012 G/sec ( +- 0.03% ) (40.32%) > 649,385,847 L1-dcache-load-misses # 9.13% of all L1-dcache accesses ( +- 0.07% ) (40.10%) > 1,485,448,688 L1-icache-loads # 212.576 M/sec ( +- 0.15% ) (39.49%) > 31,628,457 L1-icache-load-misses # 2.13% of all L1-icache accesses ( +- 0.40% ) (39.57%) > 6,667,311 dTLB-loads # 954.129 K/sec ( +- 0.21% ) (39.50%) > 5,668,555 dTLB-load-misses # 86.40% of all dTLB cache accesses ( +- 0.12% ) (39.03%) > 765 iTLB-loads # 109.476 /sec ( +- 21.81% ) (39.44%) > 4,370,351 iTLB-load-misses # 214320.09% of all iTLB cache accesses ( +- 1.44% ) (39.86%) > 149,207,254 L1-dcache-prefetches # 21.352 M/sec ( +- 0.13% ) (40.27%) > > 7.25869 +- 0.00203 seconds time elapsed ( +- 0.03% ) > > After: > 6,576.16 msec task-clock # 0.953 CPUs utilized ( +- 0.10% ) > 4,020 context-switches # 605.595 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,056 page-faults # 309.133 K/sec ( +- 0.00% ) > 11,967,619,180 cycles # 1.803 GHz ( +- 0.36% ) (38.76%) > 161,259,240 stalled-cycles-frontend # 1.38% frontend cycles idle ( +- 0.27% ) (36.58%) > 253,605,302 stalled-cycles-backend # 2.16% backend cycles idle ( +- 4.45% ) (34.78%) > 19,328,171,892 instructions # 1.65 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.10% ) (31.46%) > 5,213,967,902 branches # 785.461 M/sec ( +- 0.18% ) (30.68%) > 12,385,170 branch-misses # 0.24% of all branches ( +- 0.26% ) (34.13%) > 7,271,687,822 L1-dcache-loads # 1.095 G/sec ( +- 0.12% ) (35.29%) > 649,873,045 L1-dcache-load-misses # 8.93% of all L1-dcache accesses ( +- 0.11% ) (41.41%) > 1,950,037,608 L1-icache-loads # 293.764 M/sec ( +- 0.33% ) (43.11%) > 31,365,566 L1-icache-load-misses # 1.62% of all L1-icache accesses ( +- 0.39% ) (45.89%) > 6,767,809 dTLB-loads # 1.020 M/sec ( +- 0.47% ) (48.42%) > 6,339,590 dTLB-load-misses # 95.43% of all dTLB cache accesses ( +- 0.50% ) (46.60%) > 736 iTLB-loads # 110.875 /sec ( +- 1.79% ) (48.60%) > 4,314,836 iTLB-load-misses # 518653.73% of all iTLB cache accesses ( +- 0.63% ) (42.91%) > 144,950,156 L1-dcache-prefetches # 21.836 M/sec ( +- 0.37% ) (41.39%) > > 6.89935 +- 0.00703 seconds time elapsed ( +- 0.10% ) > > The performance is clearly better. There is no significant hotspot > improvement according to perf report, as there are quite a few > callers of memcg_swap_enabled and do_memsw_account (which calls > memcg_swap_enabled). Many pieces of minor optimizations resulted > in lower overhead for the branch predictor, and bettter performance. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song 2022-09-19 21:37 ` Shakeel Butt @ 2022-09-20 2:29 ` Roman Gushchin 2022-09-20 2:43 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Roman Gushchin @ 2022-09-20 2:29 UTC (permalink / raw) To: Kairui Song Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel, Michal Hocko On Tue, Sep 20, 2022 at 02:06:34AM +0800, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > cgroup_memory_noswap is used in many hot path, so make it a static key > to lower the kernel overhead. > > Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100` > with the following code snip in a non-root cgroup: > > #include <stdio.h> > #include <string.h> > #include <linux/mman.h> > #include <sys/mman.h> > #define MB 1024UL * 1024UL > int main(int argc, char **argv){ > void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > memset(p, 0xff, 8000 * MB); > madvise(p, 8000 * MB, MADV_PAGEOUT); > memset(p, 0xff, 8000 * MB); > return 0; > } > > Before: > 7,021.43 msec task-clock # 0.967 CPUs utilized ( +- 0.03% ) > 4,010 context-switches # 573.853 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,057 page-faults # 293.661 K/sec ( +- 0.00% ) > 12,616,546,027 cycles # 1.805 GHz ( +- 0.06% ) (39.92%) > 156,823,666 stalled-cycles-frontend # 1.25% frontend cycles idle ( +- 0.10% ) (40.25%) > 310,130,812 stalled-cycles-backend # 2.47% backend cycles idle ( +- 4.39% ) (40.73%) > 18,692,516,591 instructions # 1.49 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.04% ) (40.75%) > 4,907,447,976 branches # 702.283 M/sec ( +- 0.05% ) (40.30%) > 13,002,578 branch-misses # 0.26% of all branches ( +- 0.08% ) (40.48%) > 7,069,786,296 L1-dcache-loads # 1.012 G/sec ( +- 0.03% ) (40.32%) > 649,385,847 L1-dcache-load-misses # 9.13% of all L1-dcache accesses ( +- 0.07% ) (40.10%) > 1,485,448,688 L1-icache-loads # 212.576 M/sec ( +- 0.15% ) (39.49%) > 31,628,457 L1-icache-load-misses # 2.13% of all L1-icache accesses ( +- 0.40% ) (39.57%) > 6,667,311 dTLB-loads # 954.129 K/sec ( +- 0.21% ) (39.50%) > 5,668,555 dTLB-load-misses # 86.40% of all dTLB cache accesses ( +- 0.12% ) (39.03%) > 765 iTLB-loads # 109.476 /sec ( +- 21.81% ) (39.44%) > 4,370,351 iTLB-load-misses # 214320.09% of all iTLB cache accesses ( +- 1.44% ) (39.86%) > 149,207,254 L1-dcache-prefetches # 21.352 M/sec ( +- 0.13% ) (40.27%) > > 7.25869 +- 0.00203 seconds time elapsed ( +- 0.03% ) > > After: > 6,576.16 msec task-clock # 0.953 CPUs utilized ( +- 0.10% ) > 4,020 context-switches # 605.595 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,056 page-faults # 309.133 K/sec ( +- 0.00% ) > 11,967,619,180 cycles # 1.803 GHz ( +- 0.36% ) (38.76%) > 161,259,240 stalled-cycles-frontend # 1.38% frontend cycles idle ( +- 0.27% ) (36.58%) > 253,605,302 stalled-cycles-backend # 2.16% backend cycles idle ( +- 4.45% ) (34.78%) > 19,328,171,892 instructions # 1.65 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.10% ) (31.46%) > 5,213,967,902 branches # 785.461 M/sec ( +- 0.18% ) (30.68%) > 12,385,170 branch-misses # 0.24% of all branches ( +- 0.26% ) (34.13%) > 7,271,687,822 L1-dcache-loads # 1.095 G/sec ( +- 0.12% ) (35.29%) > 649,873,045 L1-dcache-load-misses # 8.93% of all L1-dcache accesses ( +- 0.11% ) (41.41%) > 1,950,037,608 L1-icache-loads # 293.764 M/sec ( +- 0.33% ) (43.11%) > 31,365,566 L1-icache-load-misses # 1.62% of all L1-icache accesses ( +- 0.39% ) (45.89%) > 6,767,809 dTLB-loads # 1.020 M/sec ( +- 0.47% ) (48.42%) > 6,339,590 dTLB-load-misses # 95.43% of all dTLB cache accesses ( +- 0.50% ) (46.60%) > 736 iTLB-loads # 110.875 /sec ( +- 1.79% ) (48.60%) > 4,314,836 iTLB-load-misses # 518653.73% of all iTLB cache accesses ( +- 0.63% ) (42.91%) > 144,950,156 L1-dcache-prefetches # 21.836 M/sec ( +- 0.37% ) (41.39%) > > 6.89935 +- 0.00703 seconds time elapsed ( +- 0.10% ) > > The performance is clearly better. There is no significant hotspot > improvement according to perf report, as there are quite a few > callers of memcg_swap_enabled and do_memsw_account (which calls > memcg_swap_enabled). Many pieces of minor optimizations resulted > in lower overhead for the branch predictor, and bettter performance. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks! ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song 2022-09-19 21:37 ` Shakeel Butt 2022-09-20 2:29 ` Roman Gushchin @ 2022-09-20 2:43 ` Muchun Song 2 siblings, 0 replies; 9+ messages in thread From: Muchun Song @ 2022-09-20 2:43 UTC (permalink / raw) To: Kairui Song Cc: cgroups, Linux MM, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, linux-kernel, Michal Hocko > On Sep 20, 2022, at 02:06, Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > cgroup_memory_noswap is used in many hot path, so make it a static key > to lower the kernel overhead. > > Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100` > with the following code snip in a non-root cgroup: > > #include <stdio.h> > #include <string.h> > #include <linux/mman.h> > #include <sys/mman.h> > #define MB 1024UL * 1024UL > int main(int argc, char **argv){ > void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > memset(p, 0xff, 8000 * MB); > madvise(p, 8000 * MB, MADV_PAGEOUT); > memset(p, 0xff, 8000 * MB); > return 0; > } > > Before: > 7,021.43 msec task-clock # 0.967 CPUs utilized ( +- 0.03% ) > 4,010 context-switches # 573.853 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,057 page-faults # 293.661 K/sec ( +- 0.00% ) > 12,616,546,027 cycles # 1.805 GHz ( +- 0.06% ) (39.92%) > 156,823,666 stalled-cycles-frontend # 1.25% frontend cycles idle ( +- 0.10% ) (40.25%) > 310,130,812 stalled-cycles-backend # 2.47% backend cycles idle ( +- 4.39% ) (40.73%) > 18,692,516,591 instructions # 1.49 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.04% ) (40.75%) > 4,907,447,976 branches # 702.283 M/sec ( +- 0.05% ) (40.30%) > 13,002,578 branch-misses # 0.26% of all branches ( +- 0.08% ) (40.48%) > 7,069,786,296 L1-dcache-loads # 1.012 G/sec ( +- 0.03% ) (40.32%) > 649,385,847 L1-dcache-load-misses # 9.13% of all L1-dcache accesses ( +- 0.07% ) (40.10%) > 1,485,448,688 L1-icache-loads # 212.576 M/sec ( +- 0.15% ) (39.49%) > 31,628,457 L1-icache-load-misses # 2.13% of all L1-icache accesses ( +- 0.40% ) (39.57%) > 6,667,311 dTLB-loads # 954.129 K/sec ( +- 0.21% ) (39.50%) > 5,668,555 dTLB-load-misses # 86.40% of all dTLB cache accesses ( +- 0.12% ) (39.03%) > 765 iTLB-loads # 109.476 /sec ( +- 21.81% ) (39.44%) > 4,370,351 iTLB-load-misses # 214320.09% of all iTLB cache accesses ( +- 1.44% ) (39.86%) > 149,207,254 L1-dcache-prefetches # 21.352 M/sec ( +- 0.13% ) (40.27%) > > 7.25869 +- 0.00203 seconds time elapsed ( +- 0.03% ) > > After: > 6,576.16 msec task-clock # 0.953 CPUs utilized ( +- 0.10% ) > 4,020 context-switches # 605.595 /sec ( +- 0.01% ) > 0 cpu-migrations # 0.000 /sec > 2,052,056 page-faults # 309.133 K/sec ( +- 0.00% ) > 11,967,619,180 cycles # 1.803 GHz ( +- 0.36% ) (38.76%) > 161,259,240 stalled-cycles-frontend # 1.38% frontend cycles idle ( +- 0.27% ) (36.58%) > 253,605,302 stalled-cycles-backend # 2.16% backend cycles idle ( +- 4.45% ) (34.78%) > 19,328,171,892 instructions # 1.65 insn per cycle > # 0.01 stalled cycles per insn ( +- 0.10% ) (31.46%) > 5,213,967,902 branches # 785.461 M/sec ( +- 0.18% ) (30.68%) > 12,385,170 branch-misses # 0.24% of all branches ( +- 0.26% ) (34.13%) > 7,271,687,822 L1-dcache-loads # 1.095 G/sec ( +- 0.12% ) (35.29%) > 649,873,045 L1-dcache-load-misses # 8.93% of all L1-dcache accesses ( +- 0.11% ) (41.41%) > 1,950,037,608 L1-icache-loads # 293.764 M/sec ( +- 0.33% ) (43.11%) > 31,365,566 L1-icache-load-misses # 1.62% of all L1-icache accesses ( +- 0.39% ) (45.89%) > 6,767,809 dTLB-loads # 1.020 M/sec ( +- 0.47% ) (48.42%) > 6,339,590 dTLB-load-misses # 95.43% of all dTLB cache accesses ( +- 0.50% ) (46.60%) > 736 iTLB-loads # 110.875 /sec ( +- 1.79% ) (48.60%) > 4,314,836 iTLB-load-misses # 518653.73% of all iTLB cache accesses ( +- 0.63% ) (42.91%) > 144,950,156 L1-dcache-prefetches # 21.836 M/sec ( +- 0.37% ) (41.39%) > > 6.89935 +- 0.00703 seconds time elapsed ( +- 0.10% ) > > The performance is clearly better. There is no significant hotspot > improvement according to perf report, as there are quite a few > callers of memcg_swap_enabled and do_memsw_account (which calls > memcg_swap_enabled). Many pieces of minor optimizations resulted > in lower overhead for the branch predictor, and bettter performance. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-09-20 2:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-19 18:06 [PATCH v2 0/2] mm: memcontrol: cleanup and optimize for two accounting params Kairui Song 2022-09-19 18:06 ` [PATCH v2 1/2] mm: memcontrol: use memcg_kmem_enabled in count_objcg_event Kairui Song 2022-09-19 21:36 ` Shakeel Butt 2022-09-20 2:27 ` Roman Gushchin 2022-09-20 2:41 ` Muchun Song 2022-09-19 18:06 ` [PATCH v2 2/2] mm: memcontrol: make cgroup_memory_noswap a static key Kairui Song 2022-09-19 21:37 ` Shakeel Butt 2022-09-20 2:29 ` Roman Gushchin 2022-09-20 2:43 ` Muchun Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox