On Thu, 15 Jul 2004, Hugh Dickins wrote:

> I'm as likely to find a 512P machine as a basilisk, so scalability
> testing I leave to you.

OK, I managed to grab some time on the machine today.  Parallel
page faulting for /dev/zero and SysV shared memory has definitely
improved in the first few test cases I have.

The test we have is a program which specifically targets page faulting.
This test program was written after observing some of these issues on
MPI and OpenMP applications.

The test program does this:

	1. Forks N child processes, or creates N Pthreads.
	2. Each child/thread creates a memory object via malloc,
	   mmap of /dev/zero, or shmget.
	3. Each child/thread touches each page of the memory object
	   by writing a single byte to the page.
	4. Time to perform step 3 is measured.
	5. The results are aggregated by the main process/thread
	   and a report generated, including statistics such as
	   pagefaults per CPU per wallclock second.

Another variant has the main thread/process create the memory object
and assign the range to be touched to each child/thread, which then
omit the object creation stage and skip to step 3.  We call this
the "preallocate" option.

In our case we typically run with 100MB per child/thread, and run a
sequence of powers of 2 number of CPUs, up to 512.  All of the work
to this point has been on the fork without preallocation variants.

I'm now looking at the fork with preallocation variants, and find
that we're hammering *VERY* hard on the shmem_inode_info i_lock,
mostly in shmem_getpage code.  In fact, performance drops off
significantly even at 4P, and gets positively horrible by 32P
(you don't even want to know about >32P -- but things get 2-4x
worse with each doubling of CPUs).

Just so you can see it, I've attached the most recent run output
from the program.  Take a look at the next to last column of numbers.
In the past few days the last few rows of the second two test cases
have gone from 2-3 digit numbers to 5-digit numbers -- that's what
we've been concentrating on.

Note that due to hardware failures the machine is only running 510 CPUs
in the attached output, and that things got so miserably slow that I
didn't even let the runs finish.  The last column is meaningless, and
always 0.  Also the label "shared" means "preallocate" from the
discussion above.  Oh, and this is a 2.6.7 based kernel -- I'll change
to 2.6.8 sometime soon.

Anyway, the i_lock is my next vic^H^H^Hsubject of investigation.

Cheers, and have a great weekend,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.