On Wed, May 25, 2016 at 4:03 PM, Kirill A. Shutemov wrote: > On Wed, May 25, 2016 at 03:11:55PM -0400, neha agarwal wrote: > > Hi All, > > > > I have been testing Hugh's and Kirill's huge tmpfs patch sets with > > Cassandra (NoSQL database). I am seeing significant performance gap > between > > these two implementations (~30%). Hugh's implementation performs better > > than Kirill's implementation. I am surprised why I am seeing this > > performance gap. Following is my test setup. > > Thanks for the report. I'll look into it. > Thanks Kirill for looking into it. > > Patchsets > > ======== > > - For Hugh's: > > I checked out 4.6-rc3, applied Hugh's preliminary patches (01 to 10 > > patches) from here: https://lkml.org/lkml/2016/4/5/792 and then applied > the > > THP patches posted on April 16 (01 to 29 patches). > > > > - For Kirill's: > > I am using his branch "git:// > > git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8", > which > > is based off of 4.6-rc3, posted on May 12. > > > > > > Khugepaged settings > > ================ > > cd /sys/kernel/mm/transparent_hugepage > > echo 10 >khugepaged/alloc_sleep_millisecs > > echo 10 >khugepaged/scan_sleep_millisecs > > echo 511 >khugepaged/max_ptes_none > > Do you make this for both setup? > > It's not really nessesary for Hugh's, but it makes sense to have this > idenatical for testing. > Yeah right, Hugh's will not be impacted by these settings but for identical testing I did that. > Do you have swap in the system. Is it in use during testing? > I do not have swap in the system. > > Mount options > > =========== > > - For Hugh's: > > sudo sysctl -w vm/shmem_huge=2 > > sudo mount -o remount,huge=1 /hugetmpfs > > > > - For Kirill's: > > sudo mount -o remount,huge=always /hugetmpfs > > echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled > > echo 511 >khugepaged/max_ptes_swap > > > > > > Workload Setting > > ============= > > Please look at the attached setup document for Cassandra (NoSQL > database): > > cassandra-setup.txt > > > > > > Machine setup > > =========== > > 36-core (72 hardware thread) dual-socket x86 server with 512 GB RAM > running > > Ubuntu. I use control groups for resource isolation. Server and client > > threads run on different sockets. Frequency governor set to "performance" > > to remove any performance fluctuations due to frequency variation. > > > > > > Throughput numbers > > ================ > > Hugh's implementation: 74522.08 ops/sec > > Kirill's implementation: 54919.10 ops/sec > > > > > > I am not sure if something is fishy with my test environment or if there > is > > actually a performance gap between the two implementations. I have run > this > > test 5-6 times so I am certain that this experiment is repeatable. I will > > appreciate if someone can help me understand the reason for this > > performance gap. > > > > On Thu, May 12, 2016 at 11:40 AM, Kirill A. Shutemov < > > kirill.shutemov@linux.intel.com> wrote: > > > > > This update aimed to address my todo list from lsf/mm summit: > > > > > > - we now able to recovery memory by splitting huge pages partly beyond > > > i_size. This should address concern about small files. > > > > > > - bunch of bug fixes for khugepaged, including fix for data corruption > > > reported by Hugh. > > > > > > - Disabled for Power as it requires deposited page table to get THP > > > mapped and we don't do deposit/withdraw for file THP. > > > > > > The main part of patchset (up to khugepaged stuff) is relatively > stable -- > > > I fixed few minor bugs there, but nothing major. > > > > > > I would appreciate rigorous review of khugepaged and code to split huge > > > pages under memory pressure. > > > > > > The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to > > > THPagecache" and Ebru's khugepaged swapin patches form -mm tree. > > > > > > Git tree: > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git > hugetmpfs/v8 > > > > > > == Changelog == > > > > > > v8: > > > - khugepaged updates: > > > + mark collapsed page dirty, otherwise vmscan would discard it; > > > + account pages to mapping->nrpages on shmem_charge; > > > + fix a situation when not all tail pages put on radix tree on > > > collapse; > > > + fix off-by-one in loop-exit condition in khugepaged_scan_shmem(); > > > + use radix_tree_iter_next/radix_tree_iter_retry instead of gotos; > > > + fix build withount CONFIG_SHMEM (again); > > > - split huge pages beyond i_size under memory pressure; > > > - disable huge tmpfs on Power, as it makes use of deposited page > tables, > > > we don't have; > > > - fix filesystem size limit accouting; > > > - mark page referenced on split_huge_pmd() if the pmd is young; > > > - uncharge pages from shmem, removed during split_huge_page(); > > > - make shmem_inode_info::lock irq-safe -- required by khugepaged; > > > > > > v7: > > > - khugepaged updates: > > > + fix page leak/page cache corruption on collapse fail; > > > + filter out VMAs not suitable for huge pages due misaligned > vm_pgoff; > > > + fix build without CONFIG_SHMEM; > > > + drop few over-protective checks; > > > - fix bogus VM_BUG_ON() in __delete_from_page_cache(); > > > > > > v6: > > > - experimental collapse support; > > > - fix swapout mapped huge pages; > > > - fix page leak in faularound code; > > > - fix exessive huge page allocation with huge=within_size; > > > - rename VM_NO_THP to VM_NO_KHUGEPAGED; > > > - fix condition in hugepage_madvise(); > > > - accounting reworked again; > > > > > > v5: > > > - add FileHugeMapped to /proc/PID/smaps; > > > - make FileHugeMapped in meminfo aligned with other fields; > > > - Documentation/vm/transhuge.txt updated; > > > > > > v4: > > > - first four patch were applied to -mm tree; > > > - drop pages beyond i_size on split_huge_pages; > > > - few small random bugfixes; > > > > > > v3: > > > - huge= mountoption now can have values always, within_size, advice > and > > > never; > > > - sysctl handle is replaced with sysfs knob; > > > - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation > via > > > page fault; > > > - mlock() handling had been fixed; > > > - bunch of smaller bugfixes and cleanups. > > > > > > == Design overview == > > > > > > Huge pages are allocated by shmem when it's allowed (by mount option) > and > > > there's no entries for the range in radix-tree. Huge page is > represented by > > > HPAGE_PMD_NR entries in radix-tree. > > > > > > MM core maps a page with PMD if ->fault() returns huge page and the > VMA is > > > suitable for huge pages (size, alignment). There's no need into two > > > requests to file system: filesystem returns huge page if it can, > > > graceful fallback to small pages otherwise. > > > > > > As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we > can > > > re-fault the page with PTEs later. > > > > > > Basic scheme for split_huge_page() is the same as for anon-THP. > > > Few differences: > > > > > > - File pages are on radix-tree, so we have head->_count offset by > > > HPAGE_PMD_NR. The count got distributed to small pages during > split. > > > > > > - mapping->tree_lock prevents non-lockless access to pages under > split > > > over radix-tree; > > > > > > - Lockless access is prevented by setting the head->_count to 0 > during > > > split, so get_page_unless_zero() would fail; > > > > > > - After split, some pages can be beyond i_size. We drop them from > > > radix-tree. > > > > > > - We don't setup migration entries. Just unmap pages. It helps > > > handling cases when i_size is in the middle of the page: no need > > > handle unmap pages beyond i_size manually. > > > > > > COW mapping handled on PTE-level. It's not clear how beneficial would > be > > > allocation of huge pages on COW faults. And it would require some code > to > > > make them work. > > > > > > I think at some point we can consider teaching khugepaged to collapse > > > pages in COW mappings, but allocating huge on fault is probably > overkill. > > > > > > As with anon THP, we mlock file huge page only if it mapped with PMD. > > > PTE-mapped THPs are never mlocked. This way we can avoid all sorts of > > > scenarios when we can leak mlocked page. > > > > > > As with anon THP, we split huge page on swap out. > > > > > > Truncate and punch hole that only cover part of THP range is > implemented > > > by zero out this part of THP. > > > > > > This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. > > > As we don't really create hole in this case, lseek(SEEK_HOLE) may have > > > inconsistent results depending what pages happened to be allocated. > > > I don't think this will be a problem. > > > > > > We track per-super_block list of inodes which potentially have huge > page > > > partly beyond i_size. Under memory pressure or if we hit -ENOSPC, we > split > > > such pages in order to recovery memory. > > > > > > The list is per-sb, as we need to split a page from our filesystem if > hit > > > -ENOSPC (-o size= limit) during shmem_getpage_gfp() to free some space. > > > > > > Hugh Dickins (1): > > > shmem: get_unmapped_area align huge page > > > > > > Kirill A. Shutemov (31): > > > thp, mlock: update unevictable-lru.txt > > > mm: do not pass mm_struct into handle_mm_fault > > > mm: introduce fault_env > > > mm: postpone page table allocation until we have page to map > > > rmap: support file thp > > > mm: introduce do_set_pmd() > > > thp, vmstats: add counters for huge file pages > > > thp: support file pages in zap_huge_pmd() > > > thp: handle file pages in split_huge_pmd() > > > thp: handle file COW faults > > > thp: skip file huge pmd on copy_huge_pmd() > > > thp: prepare change_huge_pmd() for file thp > > > thp: run vma_adjust_trans_huge() outside i_mmap_rwsem > > > thp: file pages support for split_huge_page() > > > thp, mlock: do not mlock PTE-mapped file huge pages > > > vmscan: split file huge pages before paging them out > > > page-flags: relax policy for PG_mappedtodisk and PG_reclaim > > > radix-tree: implement radix_tree_maybe_preload_order() > > > filemap: prepare find and delete operations for huge pages > > > truncate: handle file thp > > > mm, rmap: account shmem thp pages > > > shmem: prepare huge= mount option and sysfs knob > > > shmem: add huge pages support > > > shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings > > > thp: update Documentation/vm/transhuge.txt > > > thp: extract khugepaged from mm/huge_memory.c > > > khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() > > > shmem: make shmem_inode_info::lock irq-safe > > > khugepaged: add support of collapse for tmpfs/shmem pages > > > thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE > > > shmem: split huge pages beyond i_size under memory pressure > > > > > > Documentation/filesystems/Locking | 10 +- > > > Documentation/vm/transhuge.txt | 130 ++- > > > Documentation/vm/unevictable-lru.txt | 21 + > > > arch/alpha/mm/fault.c | 2 +- > > > arch/arc/mm/fault.c | 2 +- > > > arch/arm/mm/fault.c | 2 +- > > > arch/arm64/mm/fault.c | 2 +- > > > arch/avr32/mm/fault.c | 2 +- > > > arch/cris/mm/fault.c | 2 +- > > > arch/frv/mm/fault.c | 2 +- > > > arch/hexagon/mm/vm_fault.c | 2 +- > > > arch/ia64/mm/fault.c | 2 +- > > > arch/m32r/mm/fault.c | 2 +- > > > arch/m68k/mm/fault.c | 2 +- > > > arch/metag/mm/fault.c | 2 +- > > > arch/microblaze/mm/fault.c | 2 +- > > > arch/mips/mm/fault.c | 2 +- > > > arch/mn10300/mm/fault.c | 2 +- > > > arch/nios2/mm/fault.c | 2 +- > > > arch/openrisc/mm/fault.c | 2 +- > > > arch/parisc/mm/fault.c | 2 +- > > > arch/powerpc/mm/copro_fault.c | 2 +- > > > arch/powerpc/mm/fault.c | 2 +- > > > arch/s390/mm/fault.c | 2 +- > > > arch/score/mm/fault.c | 2 +- > > > arch/sh/mm/fault.c | 2 +- > > > arch/sparc/mm/fault_32.c | 4 +- > > > arch/sparc/mm/fault_64.c | 2 +- > > > arch/tile/mm/fault.c | 2 +- > > > arch/um/kernel/trap.c | 2 +- > > > arch/unicore32/mm/fault.c | 2 +- > > > arch/x86/mm/fault.c | 2 +- > > > arch/xtensa/mm/fault.c | 2 +- > > > drivers/base/node.c | 13 +- > > > drivers/char/mem.c | 24 + > > > drivers/iommu/amd_iommu_v2.c | 3 +- > > > drivers/iommu/intel-svm.c | 2 +- > > > fs/proc/meminfo.c | 7 +- > > > fs/proc/task_mmu.c | 10 +- > > > fs/userfaultfd.c | 22 +- > > > include/linux/huge_mm.h | 36 +- > > > include/linux/khugepaged.h | 6 + > > > include/linux/mm.h | 51 +- > > > include/linux/mmzone.h | 4 +- > > > include/linux/page-flags.h | 19 +- > > > include/linux/radix-tree.h | 1 + > > > include/linux/rmap.h | 2 +- > > > include/linux/shmem_fs.h | 45 +- > > > include/linux/userfaultfd_k.h | 8 +- > > > include/linux/vm_event_item.h | 7 + > > > include/trace/events/huge_memory.h | 3 +- > > > ipc/shm.c | 10 +- > > > lib/radix-tree.c | 68 +- > > > mm/Kconfig | 8 + > > > mm/Makefile | 2 +- > > > mm/filemap.c | 226 ++-- > > > mm/gup.c | 7 +- > > > mm/huge_memory.c | 2032 > > > ++++++---------------------------- > > > mm/internal.h | 4 +- > > > mm/khugepaged.c | 1851 > > > +++++++++++++++++++++++++++++++ > > > mm/ksm.c | 5 +- > > > mm/memory.c | 860 +++++++------- > > > mm/mempolicy.c | 4 +- > > > mm/migrate.c | 5 +- > > > mm/mmap.c | 26 +- > > > mm/nommu.c | 3 +- > > > mm/page-writeback.c | 1 + > > > mm/page_alloc.c | 21 + > > > mm/rmap.c | 78 +- > > > mm/shmem.c | 918 +++++++++++++-- > > > mm/swap.c | 2 + > > > mm/truncate.c | 22 +- > > > mm/util.c | 6 + > > > mm/vmscan.c | 6 + > > > mm/vmstat.c | 4 + > > > 75 files changed, 4240 insertions(+), 2415 deletions(-) > > > create mode 100644 mm/khugepaged.c > > > > > > -- > > > 2.8.1 > > > > > > -- > > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > > the body to majordomo@kvack.org. For more info on Linux MM, > > > see: http://www.linux-mm.org/ . > > > Don't email: email@kvack.org > > > > > > > > > > > -- > > Thanks and Regards, > > Neha Agarwal > > University of Michigan > > > 1. Download and extract Cassandra > > > http://archive.apache.org/dist/cassandra/2.0.16/apache-cassandra-2.0.16-bin.tar.gz > > > > Note that my test version is Cassandra-2.0.16. > > We will denote the path to which the file is extracted as CASSANDRA_BIN > > > > 2. Setup environment for cassandra > > mkdir -p run_cassandra/cassandra_conf/triggers > > > > - Download cassandra-env.sh, cassandra.yaml, log4j-server.properties > from my mail > > attachement and then copy those files in run_cassandra/cassandra_conf > > - Search for /home/nehaag/hugetmpfs in these files and change this to a > local > > directory mounted as tmpfs. Let’s say that is CASSANDRA_DATA. A folder > named > > "cassandra" will be automatically created (For example: > > CASSANDRA_DATA/cassandra) when running Cassandra. > > - Please note that these scripts will need modifications if you use > Cassandra > > version other that 2.0.16 > > > > - Download create-ycsb-table.cql.j2 from my email attachment and copy it > in > > run_cassandra/ > > > > 3. JAVA setup, get JRE: openjdk v1.7.0_101 (sudo apt-get install > openjdk-7-jre > > for Ubuntu) > > > > 4. Setup YCSB Load generator: > > - Clone ycsb from: https://github.com/brianfrankcooper/YCSB.git. Let’s > say this is > > downloaded to YCSB_ROOT > > - You need to have maven 3 installed (`sudo apt-get install maven’ in > ubuntu) > > - Create a script (say run-cassandra.sh) in run_cassandra as follows: > > > > input_file=run_cassandra/create-ycsb-table.cql.j2 > > cassandra_cli=${CASSANDRA_BIN}/bin/cassandra-cli > > host=”127.0.0.1” #Ip address of the machine running cassasndra server > > $cassandra_cli -h $host --jmxport 7199 -f create-ycsb-table.cql > > cd ${YCSB_ROOT} > > > > # Load dataset > > ${YCSB_ROOT}/bin/ycsb -cp > ${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar:${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar > load cassandra-10 -p hosts=$host -threads 20 -p fieldcount=20 -p > recordcount=5000000 -P ${YCSB_ROOT}/workloads/workloadb -s > > > > # Run benchmark > > ${YCSB_ROOT}/bin/ycsb -cp > ${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar:${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar > run cassandra-10 -p hosts=$host -threads 20 -p fieldcount=20 -p > operationcount=50000000 -p recordcount=5000000 -p readproportion=0.05 -p > updateproportion=0.95 -P ${YCSB_ROOT}/workloads/workloadb -s > > > > 5. Run the cassandra server on host machine: > > rm -r ${CASSANDRA_DATA}/cassandra && > CASSANDRA_CONF=run_cassandra/cassandra_conf > JRE_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre > ${CASSANDRA_BIN}/bin/cassandra -f > > > > 6. Run load generator on same/some other machine: > > ./run-cassandra.sh > > > > YCSB periodcally spits out the throughput and latency number > > At the end overall throughput and latency will be printed out > > > > > > > -- > Kirill A. Shutemov > -- Thanks and Regards, Neha