On Wed, Jun 08, 2016 at 02:43:44PM -0400, neha agarwal wrote:
> On Mon, Jun 6, 2016 at 9:51 AM, Kirill A. Shutemov <kirill@shutemov.name>
> wrote:
> 
> > On Wed, May 25, 2016 at 03:11:55PM -0400, neha agarwal wrote:
> > > Hi All,
> > >
> > > I have been testing Hugh's and Kirill's huge tmpfs patch sets with
> > > Cassandra (NoSQL database). I am seeing significant performance gap
> > between
> > > these two implementations (~30%). Hugh's implementation performs better
> > > than Kirill's implementation. I am surprised why I am seeing this
> > > performance gap. Following is my test setup.
> > >
> > > Patchsets
> > > ========
> > > - For Hugh's:
> > > I checked out 4.6-rc3, applied Hugh's preliminary patches (01 to 10
> > > patches) from here: https://lkml.org/lkml/2016/4/5/792 and then applied
> > the
> > > THP patches posted on April 16 (01 to 29 patches).
> > >
> > > - For Kirill's:
> > > I am using his branch  "git://
> > > git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8",
> > which
> > > is based off of 4.6-rc3, posted on May 12.
> > >
> > >
> > > Khugepaged settings
> > > ================
> > > cd /sys/kernel/mm/transparent_hugepage
> > > echo 10 >khugepaged/alloc_sleep_millisecs
> > > echo 10 >khugepaged/scan_sleep_millisecs
> > > echo 511 >khugepaged/max_ptes_none
> > >
> > >
> > > Mount options
> > > ===========
> > > - For Hugh's:
> > > sudo sysctl -w vm/shmem_huge=2
> > > sudo mount -o remount,huge=1 /hugetmpfs
> > >
> > > - For Kirill's:
> > > sudo mount -o remount,huge=always /hugetmpfs
> > > echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> > > echo 511 >khugepaged/max_ptes_swap
> > >
> > >
> > > Workload Setting
> > > =============
> > > Please look at the attached setup document for Cassandra (NoSQL
> > database):
> > > cassandra-setup.txt
> > >
> > >
> > > Machine setup
> > > ===========
> > > 36-core (72 hardware thread) dual-socket x86 server with 512 GB RAM
> > running
> > > Ubuntu. I use control groups for resource isolation. Server and client
> > > threads run on different sockets. Frequency governor set to "performance"
> > > to remove any performance fluctuations due to frequency variation.
> > >
> > >
> > > Throughput numbers
> > > ================
> > > Hugh's implementation: 74522.08 ops/sec
> > > Kirill's implementation: 54919.10 ops/sec
> >
> > In my setup I don't see the difference:
> >
> > v4.7-rc1 + my implementation:
> > [OVERALL], RunTime(ms), 822862.0
> > [OVERALL], Throughput(ops/sec), 60763.53021527304
> > ShmemPmdMapped:  4999168 kB
> >
> > v4.6-rc2 + Hugh's implementation:
> > [OVERALL], RunTime(ms), 833157.0
> > [OVERALL], Throughput(ops/sec), 60012.698687042175
> > ShmemPmdMapped:  5021696 kB
> >
> > It's basically within measuarment error. 'ShmemPmdMapped' indicate how
> > much memory is mapped with huge pages by the end of test.
> >
> > It's on dual-socket 24-core machine with 64G of RAM.
> >
> > I guess we have some configuration difference or something, but so far I
> > don't see the drastic performance difference you've pointed to.
> >
> > May be my implementation behaves slower on bigger machines, I don't know..
> > There's no architectural reason for this.
> >
> > I'll post my updated patchset today.
> >
> > --
> >  Kirill A. Shutemov
> >
> 
> Thanks a lot Kirill for the testing. It is interesting that you don't see
> any significant performance difference. Also, your absolute throughput
> numbers are different from mine, more so for Hugh's implementation.
> 
> Can you please share your kernel config file?

Attached.

> I will try to look if I have some different config settings. Also, I am
> assuming that you had turned off DVFS.

DVFS? I'm not sure what you're talking about. I guess it's not "dynamic
voltage and frequency scaling". :)

> 
> One thing I forgot mentioning in my previous setup email was: I use 8 cores
> for running Cassandra server threads. Can you please tell how many cores
> did you use? As Cassandra is CPU bound that can make a difference in
> throughput number we are seeing.

I have 24-core machine, and I didn't limit CPU usage in any way.
I can see load avarage easily over 15.

-- 
 Kirill A. Shutemov