Hi, Huang, Ying, You are right. Using high_min is better when the workload is located on a single CCX. By the way, I'm wondering why the free_high heuristic is only applied to high-order pages. Would there also be cache misses if cache-hot order-0 pages are not reused? Best, Jiacheng > Huang, Ying writes: > > Hi, Jiacheng, > > 史嘉成 > writes: > >> Hi, >> >> I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs, >> 64 GB of memory): >> bin/x86_64-linux-gnu/bw_unix -P 16 >> The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was >> set to 8; however, the result drops to 21595.98 MB/s when >> percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning). >> >> I first inspected the auto-tuning code, but the root cause of the performance >> degradation lies in the triggering threshold of the free_high heuristic: >> pcp->free_count >= (batch + pcp->high_min / 2) > > free_high heuristic is used to increase last level (shared) cache > hotness via letting one core allocate cache-hot pages just freed by > another core. The target use case is network workload. > > It appears that free_high heuristic hurts your performance. One > possible reason may be that the last level cache isn't always shared on > AMD CPU. Can you try to bind workload to one CCX and verify whether > this is the root cause? > >> I noticed that commit c544a95 increases this threshold, but pcp->high_min is >> relatively small when auto-tuning is enabled, and the PCP draining leads to >> the performance degradation. >> >> The problem was fixed when increasing the threshold to (batch + pcp->high / 2). >> Is it intended to use high_min instead of high in the threshold? Would it be >> more adaptive to introduce some new tunables for the free_high threshold? > > In general, new knob isn't welcomed in community, because it's hard for > users to tune so many knobs already. > > --- > Best Regards, > Huang, Ying