From: Ning Qu <quning@google.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Hugh Dickins <hughd@google.com>,
Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
Mel Gorman <mgorman@suse.de>,
linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
Matthew Wilcox <willy@linux.intel.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
Hillf Danton <dhillf@gmail.com>, Dave Hansen <dave@sr71.net>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
Date: Tue, 24 Sep 2013 17:12:33 -0700 [thread overview]
Message-ID: <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com> (raw)
In-Reply-To: <1379937950-8411-1-git-send-email-kirill.shutemov@linux.intel.com>
[-- Attachment #1: Type: text/plain, Size: 14023 bytes --]
Hi, Kirill,
Seems you dropped one patch in v5, is that intentional? Just wondering ...
thp, mm: handle tail pages in page_cache_get_speculative()
Thanks!
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:
> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.
>
> Please review and consider applying.
>
> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
>
> Design overview
> ---------------
>
> Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
> (512 on x86-64) entries. All entries points to head page -- refcounting for
> tail pages is pretty expensive.
>
> Radix tree manipulations are implemented in batched way: we add and remove
> whole huge page at once, under one tree_lock. To make it possible, we
> extended radix-tree interface to be able to pre-allocate memory enough to
> insert a number of *contiguous* elements (kudos to Matthew Wilcox).
>
> Huge pages can be added to page cache three ways:
> - write(2) to file or page;
> - read(2) from sparse file;
> - fault sparse file.
>
> Potentially, one more way is collapsing small page, but it's outside
> initial
> implementation.
>
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
> some room for speed up later.
>
> Since mmap() isn't targeted for this patchset, we just split huge page on
> page fault.
>
> To minimize memory overhead for small files we aviod write-allocation in
> first huge page area (2M on x86-64) of the file.
>
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range. If a huge page is only partly in the range we zero out
> the part, exactly like we do for partial small pages.
>
> split_huge_page() for file pages works similar to anon pages, but we
> walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
> truncate_inode_pages() to drop small pages beyond i_size, if any.
>
> inode->i_split_sem taken on read will protect hugepages in inode's
> pagecache
> against splitting. We take it on write during splitting.
>
> Changes since v5
> ----------------
> - change how hugepage stored in pagecache: head page for all relevant
> indexes;
> - introduce i_split_sem;
> - do not create huge pages on write(2) into first hugepage area;
> - compile-disabled by default;
> - fix transparent_hugepage_pagecache();
>
> Benchmarks
> ----------
>
> Since the patchset doesn't include mmap() support, we should expect much
> change in performance. We just need to check that we don't introduce any
> major regression.
>
> On average read/write on ramfs with thp is a bit slower, but I don't think
> it's a stopper -- ramfs is a toy anyway, on real world filesystems I
> expect difference to be smaller.
>
> postmark
> ========
>
> workload1:
> chmod +x postmark
> mount -t ramfs none /mnt
> cat >/root/workload1 <<EOF
> set transactions 250000
> set size 5120 524288
> set number 500
> run
> quit
>
> workload2:
> set transactions 10000
> set size 2097152 10485760
> set number 100
> run
> quit
>
> throughput (transactions/sec)
> workload1 workload2
> baseline 8333 416
> patched 8333 454
>
> FS-Mark
> =======
>
> throughput (files/sec)
>
> 2000 files by 1M 200 files by 10M
> baseline 5326.1 548.1
> patched 5192.8 528.4
>
> tiobench
> ========
>
> baseline:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2048 MBs | 0.2 s | 8667.792 MB/s | 445.2 % | 5535.9 % |
> | Random Write 62 MBs | 0.0 s | 8341.118 MB/s | 0.0 % | 2615.8 % |
> | Read 2048 MBs | 0.2 s | 11680.431 MB/s | 339.9 % | 5470.6 % |
> | Random Read 62 MBs | 0.0 s | 9451.081 MB/s | 786.3 % | 1451.7 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 0.006 ms | 28.019 ms | 0.00000 | 0.00000 |
> | Random Write | 0.002 ms | 5.574 ms | 0.00000 | 0.00000 |
> | Read | 0.005 ms | 28.018 ms | 0.00000 | 0.00000 |
> | Random Read | 0.002 ms | 4.852 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 0.005 ms | 28.019 ms | 0.00000 | 0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> patched:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2048 MBs | 0.3 s | 7942.818 MB/s | 442.1 % | 5533.6 % |
> | Random Write 62 MBs | 0.0 s | 9425.426 MB/s | 723.9 % | 965.2 % |
> | Read 2048 MBs | 0.2 s | 11998.008 MB/s | 374.9 % | 5485.8 % |
> | Random Read 62 MBs | 0.0 s | 9823.955 MB/s | 251.5 % | 2011.9 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 0.007 ms | 28.020 ms | 0.00000 | 0.00000 |
> | Random Write | 0.001 ms | 0.022 ms | 0.00000 | 0.00000 |
> | Read | 0.004 ms | 24.011 ms | 0.00000 | 0.00000 |
> | Random Read | 0.001 ms | 0.019 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 0.005 ms | 28.020 ms | 0.00000 | 0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> IOZone
> ======
>
> Syscalls, not mmap.
>
> ** Initial writers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 4741691 7986408 9149064 9898695 9868597
> 9629383 9469202 11605064 9507802 10641869 11360701 11040376
> patched: 4682864 7275535 8691034 8872887 8712492
> 8771912 8397216 7701346 7366853 8839736 8299893 10788439
> speed-up(times): 0.99 0.91 0.95 0.90 0.88
> 0.91 0.89 0.66 0.77 0.83 0.73 0.98
>
> ** Rewriters **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 5807891 9554869 12101083 13113533 12989751
> 14359910 16998236 16833861 24735659 17502634 17396706 20448655
> patched: 6161690 9981294 12285789 13428846 13610058
> 13669153 20060182 17328347 24109999 19247934 24225103 34686574
> speed-up(times): 1.06 1.04 1.02 1.02 1.05
> 0.95 1.18 1.03 0.97 1.10 1.39 1.70
>
> ** Readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 7978066 11825735 13808941 14049598 14765175
> 14422642 17322681 23209831 21386483 20060744 22032935 31166663
> patched: 7723293 11481500 13796383 14363808 14353966
> 14979865 17648225 18701258 29192810 23973723 22163317 23104638
> speed-up(times): 0.97 0.97 1.00 1.02 0.97
> 1.04 1.02 0.81 1.37 1.20 1.01 0.74
>
> ** Re-readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 7966269 11878323 14000782 14678206 14154235
> 14271991 15170829 20924052 27393344 19114990 12509316 18495597
> patched: 7719350 11410937 13710233 13232756 14040928
> 15895021 16279330 17256068 26023572 18364678 27834483 23288680
> speed-up(times): 0.97 0.96 0.98 0.90 0.99
> 1.11 1.07 0.82 0.95 0.96 2.23 1.26
>
> ** Reverse readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 6630795 10331013 12839501 13157433 12783323
> 13580283 15753068 15434572 21928982 17636994 14737489 19470679
> patched: 6502341 9887711 12639278 12979232 13212825
> 12928255 13961195 14695786 21370667 19873807 20902582 21892899
> speed-up(times): 0.98 0.96 0.98 0.99 1.03
> 0.95 0.89 0.95 0.97 1.13 1.42 1.12
>
> ** Random_readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 5152935 9043813 11752615 11996078 12283579
> 12484039 14588004 15781507 23847538 15748906 13698335 27195847
> patched: 5009089 8438137 11266015 11631218 12093650
> 12779308 17768691 13640378 30468890 19269033 23444358 22775908
> speed-up(times): 0.97 0.93 0.96 0.97 0.98
> 1.02 1.22 0.86 1.28 1.22 1.71 0.84
>
> ** Random_writers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 3886268 7405345 10531192 10858984 10994693
> 12758450 10729531 9656825 10370144 13139452 4528331 12615812
> patched: 4335323 7916132 10978892 11423247 11790932
> 11424525 11798171 11413452 12230616 13075887 11165314 16925679
> speed-up(times): 1.12 1.07 1.04 1.05 1.07
> 0.90 1.10 1.18 1.18 1.00 2.47 1.34
>
> Kirill A. Shutemov (22):
> mm: implement zero_huge_user_segment and friends
> radix-tree: implement preload for multiple contiguous elements
> memcg, thp: charge huge cache pages
> thp: compile-time and sysfs knob for thp pagecache
> thp, mm: introduce mapping_can_have_hugepages() predicate
> thp: represent file thp pages in meminfo and friends
> thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> mm: trace filemap: dump page order
> block: implement add_bdi_stat()
> thp, mm: rewrite delete_from_page_cache() to support huge pages
> thp, mm: warn if we try to use replace_page_cache_page() with THP
> thp, mm: add event counters for huge page alloc on file write or read
> mm, vfs: introduce i_split_sem
> thp, mm: allocate huge pages in grab_cache_page_write_begin()
> thp, mm: naive support of thp in generic_perform_write
> thp, mm: handle transhuge pages in do_generic_file_read()
> thp, libfs: initial thp support
> truncate: support huge pages
> thp: handle file pages in split_huge_page()
> thp: wait_split_huge_page(): serialize over i_mmap_mutex too
> thp, mm: split huge page on mmap file page
> ramfs: enable transparent huge page cache
>
> Documentation/vm/transhuge.txt | 16 ++++
> drivers/base/node.c | 4 +
> fs/inode.c | 3 +
> fs/libfs.c | 58 +++++++++++-
> fs/proc/meminfo.c | 3 +
> fs/ramfs/file-mmu.c | 2 +-
> fs/ramfs/inode.c | 6 +-
> include/linux/backing-dev.h | 10 +++
> include/linux/fs.h | 11 +++
> include/linux/huge_mm.h | 68 +++++++++++++-
> include/linux/mm.h | 18 ++++
> include/linux/mmzone.h | 1 +
> include/linux/page-flags.h | 13 +++
> include/linux/pagemap.h | 31 +++++++
> include/linux/radix-tree.h | 11 +++
> include/linux/vm_event_item.h | 4 +
> include/trace/events/filemap.h | 7 +-
> lib/radix-tree.c | 94 ++++++++++++++++++--
> mm/Kconfig | 11 +++
> mm/filemap.c | 196
> ++++++++++++++++++++++++++++++++---------
> mm/huge_memory.c | 147 +++++++++++++++++++++++++++----
> mm/memcontrol.c | 3 +-
> mm/memory.c | 40 ++++++++-
> mm/truncate.c | 125 ++++++++++++++++++++------
> mm/vmstat.c | 5 ++
> 25 files changed, 779 insertions(+), 108 deletions(-)
>
> --
> 1.8.4.rc3
>
>
[-- Attachment #2: Type: text/html, Size: 19253 bytes --]
next prev parent reply other threads:[~2013-09-25 0:12 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-23 12:05 Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-09-25 20:02 ` Ning Qu
2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
2013-09-24 23:48 ` Ning Qu
2013-09-24 23:49 ` Andi Kleen
2013-09-24 23:58 ` Andrew Morton
2013-09-25 11:15 ` Kirill A. Shutemov
2013-09-25 15:05 ` Andi Kleen
2013-09-26 18:30 ` Zach Brown
2013-09-26 19:05 ` Andi Kleen
2013-09-30 10:13 ` Mel Gorman
2013-09-30 16:05 ` Andi Kleen
2013-09-25 9:51 ` Kirill A. Shutemov
2013-09-25 23:29 ` Dave Chinner
2013-10-14 13:56 ` Kirill A. Shutemov
2013-09-30 10:02 ` Mel Gorman
2013-09-30 10:10 ` Mel Gorman
2013-09-30 18:07 ` Ning Qu
2013-09-30 18:51 ` Andi Kleen
2013-10-01 8:38 ` Mel Gorman
2013-10-01 17:11 ` Ning Qu
2013-10-14 14:27 ` Kirill A. Shutemov
2013-09-30 15:27 ` Dave Hansen
2013-09-30 18:05 ` Ning Qu
2013-09-25 0:12 ` Ning Qu [this message]
2013-09-25 9:23 ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
2013-09-25 18:11 Ning Qu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com \
--to=quning@google.com \
--cc=aarcange@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=dave@sr71.net \
--cc=dhillf@gmail.com \
--cc=fengguang.wu@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox