Hi, On 2024-03-21 17:36, Suren Baghdasaryan wrote: > Overview: > Low overhead [1] per-callsite memory allocation profiling. Not just for > debug kernels, overhead low enough to be deployed in production. > > Example output: > root@moria-kvm:~# sort -rn /proc/allocinfo > 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext > 56373248 4737 mm/slub.c:2259 func:alloc_slab_page > 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded > 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash > 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs > 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio > 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node > 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable > 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start > 3940352 962 mm/memory.c:4214 func:alloc_anon_folio > 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node > ... > > Since v5 [2]: > - Added Reviewed-by and Acked-by, per Vlastimil Babka and Miguel Ojeda > - Changed pgalloc_tag_{add|sub} to use number of pages instead of order, per Matthew Wilcox > - Changed pgalloc_tag_sub_bytes to pgalloc_tag_sub_pages and adjusted the usage, per Matthew Wilcox > - Moved static key check before prepare_slab_obj_exts_hook(), per Vlastimil Babka > - Fixed RUST helper, per Miguel Ojeda > - Fixed documentation, per Randy Dunlap > - Rebased over mm-unstable > > Usage: > kconfig options: > - CONFIG_MEM_ALLOC_PROFILING > - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT > - CONFIG_MEM_ALLOC_PROFILING_DEBUG > adds warnings for allocations that weren't accounted because of a > missing annotation > > sysctl: > /proc/sys/vm/mem_profiling > > Runtime info: > /proc/allocinfo > > Notes: > > [1]: Overhead > To measure the overhead we are comparing the following configurations: > (1) Baseline with CONFIG_MEMCG_KMEM=n > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT > (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y > (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y > > Performance overhead: > To evaluate performance we implemented an in-kernel test executing > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > affinity set to a specific CPU to minimize the noise. Below are results > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > 56 core Intel Xeon: > > kmalloc pgalloc > (1 baseline) 6.764s 16.902s > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) > (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) > (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) > > Memory overhead: > Kernel size: > > text data bss dec diff > (1) 26515311 18890222 17018880 62424413 > (2) 26524728 19423818 16740352 62688898 264485 > (3) 26524724 19423818 16740352 62688894 264481 > (4) 26524728 19423818 16740352 62688898 264485 > (5) 26541782 18964374 16957440 62463596 39183 > > Memory consumption on a 56 core Intel CPU with 125GB of memory: > Code tags: 192 kB > PageExts: 262144 kB (256MB) > SlabExts: 9876 kB (9.6MB) > PcpuExts: 512 kB (0.5MB) > > Total overhead is 0.2% of total memory. > > Benchmarks: > > Hackbench tests run 100 times: > hackbench -s 512 -l 200 -g 15 -f 25 -P > baseline disabled profiling enabled profiling > avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) > stdev 0.0137 0.0188 0.0077 > > > hackbench -l 10000 > baseline disabled profiling enabled profiling > avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) > stdev 0.0933 0.0286 0.0489 > > stress-ng tests: > stress-ng --class memory --seq 4 -t 60 > stress-ng --class cpu --seq 4 -t 60 > Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ > > [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ If I enable this, I consistently get percpu allocation failures. I can occasionally reproduce it in qemu. I've attached the logs and my config, please let me know if there's anything else that could be relevant. Kind regards, Klara Modin