On Mon, 20 Apr 2009, Dave Hansen wrote:

> On Mon, 2009-04-20 at 09:15 -0700, Dave Hansen wrote:
> > On Mon, 2009-04-20 at 10:09 +0100, Eric B Munson wrote:
> > > 1. Change NR_CPUS to min(64, NR_CPUS)
> > >    This will limit the amount of possible skew on kernels compiled for very
> > >    large SMP machines.  64 is an arbitrary number selected to limit the worst
> > >    of the skew without using more cache lines.  min(64, NR_CPUS) is used
> > >    instead of nr_online_cpus() because nr_online_cpus() requires a shared
> > >    cache line and a call to hweight to make the calculation.  Its runtime
> > >    overhead and keeping this counter accurate showed up in profiles and it's
> > >    possible that nr_online_cpus() would also show.
> 
> Wow, that empty reply was really informative, wasn't it? :)
> 
> My worry with this min(64, NR_CPUS) approach is that you effectively
> ensure that you're going to be doing a lot more cacheline bouncing, but
> it isn't quite as explicit.

Unfortunately this is a choice we have to make, do we want to avoid cache
line bouncing of fork-heavy workloads using more than 64 pages or bad
information being used for overcommit decisions?

> 
> Now, every time there's a mapping (or set of them) created or destroyed
> that nets greater than 64 pages, you've got to go get a r/w cacheline to
> a possibly highly contended atomic.  With a number this low, you're
> almost guaranteed to hit it at fork() and exec().  Could you
> double-check that this doesn't hurt any of the fork() AIM tests?

It is unlikely that the aim9 benchmarks would show if this patch was a
problem because it forks in a tight loop and in a process that is not
necessarily beig enough to hit ACCT_THRESHOLD, likely on a single CPU.
In order to show any problems here we need a fork heavy workload with
many threads on many CPUs.

> 
> Another thought is that, instead of trying to fix this up in meminfo, we
> could do this in a way that is guaranteed to never skew the global
> counter negative: we always keep the *percpu* skew negative.  This
> should be the same as what's in the kernel now:
> 
> void vm_acct_memory(long pages)
> {
>         long *local;
> 	long local_min = -ACCT_THRESHOLD;
> 	long local_max = ACCT_THRESHOLD;
> 	long local_goal = 0;
> 
>         preempt_disable();
>         local = &__get_cpu_var(committed_space);
>         *local += pages;
>         if (*local > local_max || *local < local_min) {
>                 atomic_long_add(*local - local_goal, &vm_committed_space);
>                 *local = local_goal;
>         }
>         preempt_enable();
> }
> 
> But now consider if we changed the local_* variables a bit:
> 
> 	long local_min = -(ACCT_THRESHOLD*2);
> 	long local_max = 0
> 	long local_goal = -ACCT_THRESHOLD;
> 
> We'll get some possibly *large* numbers in meminfo, but it will at least
> never underflow.
> 
> -- Dave
> 

-- 
Eric B Munson
IBM Linux Technology Center
ebmunson@us.ibm.com