Hi, Huang, Ying,

You are right. Using high_min is better when the workload is located on a single CCX.

By the way, I'm wondering why the free_high heuristic is only applied to
high-order pages. Would there also be cache misses if cache-hot order-0 pages
are not reused?

Best,
Jiacheng


Huang, Ying <ying.huang@linux.alibaba.com> writes:

Hi, Jiacheng,

史嘉成 <billsjc@sjtu.edu.cn> writes:

Hi,

I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
64 GB of memory):
   bin/x86_64-linux-gnu/bw_unix -P 16
The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
set to 8; however, the result drops to 21595.98 MB/s when
percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).

I first inspected the auto-tuning code, but the root cause of the performance
degradation lies in the triggering threshold of the free_high heuristic:
   pcp->free_count >= (batch + pcp->high_min / 2)

free_high heuristic is used to increase last level (shared) cache
hotness via letting one core allocate cache-hot pages just freed by
another core.  The target use case is network workload.

It appears that free_high heuristic hurts your performance.  One
possible reason may be that the last level cache isn't always shared on
AMD CPU.  Can you try to bind workload to one CCX and verify whether
this is the root cause?

I noticed that commit c544a95 increases this threshold, but pcp->high_min is
relatively small when auto-tuning is enabled, and the PCP draining leads to
the performance degradation.

The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
Is it intended to use high_min instead of high in the threshold? Would it be
more adaptive to introduce some new tunables for the free_high threshold?

In general, new knob isn't welcomed in community, because it's hard for
users to tune so many knobs already.

---
Best Regards,
Huang, Ying