On 17/07/2024 02.35, Yosry Ahmed wrote: > [..] >> >> >> This is a clean (meaning no cadvisor interference) example of kswapd >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases >> hit the race (which is handled in V6 and V7). >> >> The BPF "cnt" maps are getting cleared every second, so this >> approximates per sec numbers. This patch reduce pressure on the lock, >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full >> flushes approx 37 per sec (every 27 ms). On the positive side >> ongoing_flusher mitigation stopped 98 per sec of these. >> >> In this clean kswapd case the patch removes the lock contention issue >> for kswapd. The lock_contended cases 27 seems to be all related to >> handled_race cases 27. >> >> The remaning high flush rate should also be addressed, and we should >> also work on aproaches to limit this like my ealier proposal[1]. > > I honestly don't think a high number of flushes is a problem on its > own as long as we are not spending too much time flushing, especially > when we have magnitude-based thresholding so we know there is > something to flush (although it may not be relevant to what we are > doing). > We are "spending too much time flushing" see below. > If we keep observing a lot of lock contention, one thing that I > thought about is to have a variant of spin_lock with a timeout. This > limits the flushing latency, instead of limiting the number of flushes > (which I believe is the wrong metric to optimize). > > It also seems to me that we are doing a flush each 27ms, and your > proposed threshold was once per 50ms. It doesn't seem like a > fundamental difference. > Looking at the production numbers for the time the lock is held for level 0: @locked_time_level[0]: [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | The time is in nanosec, so M corresponds to ms (milliseconds). With 36 flushes per second (as shown earlier) this is a flush every 27.7ms. It is not unreasonable (from above data) that the flush time also spend 27ms, which means that we spend a full CPU second flushing. That is spending too much time flushing. This around 1 sec CPU usage for kswapd is also quite clear in the attached grafana graph for when server was rebooted into this V7 kernel. I choose 50ms because at the time I saw flush taking around 30ms, and I view the flush time as queue service-time. When arrival-rate is faster than service-time, then a queue will form. So, choosing 50ms as arrival-rate gave me some headroom. As I mentioned earlier, optimally this threshold should be dynamically measured. --Jesper