On Thu, 15 Jul 2004, Hugh Dickins wrote: > I'm as likely to find a 512P machine as a basilisk, so scalability > testing I leave to you. OK, I managed to grab some time on the machine today. Parallel page faulting for /dev/zero and SysV shared memory has definitely improved in the first few test cases I have. The test we have is a program which specifically targets page faulting. This test program was written after observing some of these issues on MPI and OpenMP applications. The test program does this: 1. Forks N child processes, or creates N Pthreads. 2. Each child/thread creates a memory object via malloc, mmap of /dev/zero, or shmget. 3. Each child/thread touches each page of the memory object by writing a single byte to the page. 4. Time to perform step 3 is measured. 5. The results are aggregated by the main process/thread and a report generated, including statistics such as pagefaults per CPU per wallclock second. Another variant has the main thread/process create the memory object and assign the range to be touched to each child/thread, which then omit the object creation stage and skip to step 3. We call this the "preallocate" option. In our case we typically run with 100MB per child/thread, and run a sequence of powers of 2 number of CPUs, up to 512. All of the work to this point has been on the fork without preallocation variants. I'm now looking at the fork with preallocation variants, and find that we're hammering *VERY* hard on the shmem_inode_info i_lock, mostly in shmem_getpage code. In fact, performance drops off significantly even at 4P, and gets positively horrible by 32P (you don't even want to know about >32P -- but things get 2-4x worse with each doubling of CPUs). Just so you can see it, I've attached the most recent run output from the program. Take a look at the next to last column of numbers. In the past few days the last few rows of the second two test cases have gone from 2-3 digit numbers to 5-digit numbers -- that's what we've been concentrating on. Note that due to hardware failures the machine is only running 510 CPUs in the attached output, and that things got so miserably slow that I didn't even let the runs finish. The last column is meaningless, and always 0. Also the label "shared" means "preallocate" from the discussion above. Oh, and this is a 2.6.7 based kernel -- I'll change to 2.6.8 sometime soon. Anyway, the i_lock is my next vic^H^H^Hsubject of investigation. Cheers, and have a great weekend, Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.