On Mon, 20 Apr 2009, Dave Hansen wrote: > On Mon, 2009-04-20 at 09:15 -0700, Dave Hansen wrote: > > On Mon, 2009-04-20 at 10:09 +0100, Eric B Munson wrote: > > > 1. Change NR_CPUS to min(64, NR_CPUS) > > > This will limit the amount of possible skew on kernels compiled for very > > > large SMP machines. 64 is an arbitrary number selected to limit the worst > > > of the skew without using more cache lines. min(64, NR_CPUS) is used > > > instead of nr_online_cpus() because nr_online_cpus() requires a shared > > > cache line and a call to hweight to make the calculation. Its runtime > > > overhead and keeping this counter accurate showed up in profiles and it's > > > possible that nr_online_cpus() would also show. > > Wow, that empty reply was really informative, wasn't it? :) > > My worry with this min(64, NR_CPUS) approach is that you effectively > ensure that you're going to be doing a lot more cacheline bouncing, but > it isn't quite as explicit. Unfortunately this is a choice we have to make, do we want to avoid cache line bouncing of fork-heavy workloads using more than 64 pages or bad information being used for overcommit decisions? > > Now, every time there's a mapping (or set of them) created or destroyed > that nets greater than 64 pages, you've got to go get a r/w cacheline to > a possibly highly contended atomic. With a number this low, you're > almost guaranteed to hit it at fork() and exec(). Could you > double-check that this doesn't hurt any of the fork() AIM tests? It is unlikely that the aim9 benchmarks would show if this patch was a problem because it forks in a tight loop and in a process that is not necessarily beig enough to hit ACCT_THRESHOLD, likely on a single CPU. In order to show any problems here we need a fork heavy workload with many threads on many CPUs. > > Another thought is that, instead of trying to fix this up in meminfo, we > could do this in a way that is guaranteed to never skew the global > counter negative: we always keep the *percpu* skew negative. This > should be the same as what's in the kernel now: > > void vm_acct_memory(long pages) > { > long *local; > long local_min = -ACCT_THRESHOLD; > long local_max = ACCT_THRESHOLD; > long local_goal = 0; > > preempt_disable(); > local = &__get_cpu_var(committed_space); > *local += pages; > if (*local > local_max || *local < local_min) { > atomic_long_add(*local - local_goal, &vm_committed_space); > *local = local_goal; > } > preempt_enable(); > } > > But now consider if we changed the local_* variables a bit: > > long local_min = -(ACCT_THRESHOLD*2); > long local_max = 0 > long local_goal = -ACCT_THRESHOLD; > > We'll get some possibly *large* numbers in meminfo, but it will at least > never underflow. > > -- Dave > -- Eric B Munson IBM Linux Technology Center ebmunson@us.ibm.com