linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ning Qu <quning@google.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Hillf Danton <dhillf@gmail.com>, Dave Hansen <dave@sr71.net>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
Date: Tue, 24 Sep 2013 17:12:33 -0700	[thread overview]
Message-ID: <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com> (raw)
In-Reply-To: <1379937950-8411-1-git-send-email-kirill.shutemov@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 14023 bytes --]

Hi, Kirill,

Seems you dropped one patch in v5, is that intentional? Just wondering ...

  thp, mm: handle tail pages in page_cache_get_speculative()

Thanks!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.
>
> Please review and consider applying.
>
> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
>
> Design overview
> ---------------
>
> Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
> (512 on x86-64) entries. All entries points to head page -- refcounting for
> tail pages is pretty expensive.
>
> Radix tree manipulations are implemented in batched way: we add and remove
> whole huge page at once, under one tree_lock. To make it possible, we
> extended radix-tree interface to be able to pre-allocate memory enough to
> insert a number of *contiguous* elements (kudos to Matthew Wilcox).
>
> Huge pages can be added to page cache three ways:
>  - write(2) to file or page;
>  - read(2) from sparse file;
>  - fault sparse file.
>
> Potentially, one more way is collapsing small page, but it's outside
> initial
> implementation.
>
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
> some room for speed up later.
>
> Since mmap() isn't targeted for this patchset, we just split huge page on
> page fault.
>
> To minimize memory overhead for small files we aviod write-allocation in
> first huge page area (2M on x86-64) of the file.
>
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range. If a huge page is only partly in the range we zero out
> the part, exactly like we do for partial small pages.
>
> split_huge_page() for file pages works similar to anon pages, but we
> walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
> truncate_inode_pages() to drop small pages beyond i_size, if any.
>
> inode->i_split_sem taken on read will protect hugepages in inode's
> pagecache
> against splitting. We take it on write during splitting.
>
> Changes since v5
> ----------------
>  - change how hugepage stored in pagecache: head page for all relevant
>    indexes;
>  - introduce i_split_sem;
>  - do not create huge pages on write(2) into first hugepage area;
>  - compile-disabled by default;
>  - fix transparent_hugepage_pagecache();
>
> Benchmarks
> ----------
>
> Since the patchset doesn't include mmap() support, we should expect much
> change in performance. We just need to check that we don't introduce any
> major regression.
>
> On average read/write on ramfs with thp is a bit slower, but I don't think
> it's a stopper -- ramfs is a toy anyway, on real world filesystems I
> expect difference to be smaller.
>
> postmark
> ========
>
> workload1:
> chmod +x postmark
> mount -t ramfs none /mnt
> cat >/root/workload1 <<EOF
> set transactions 250000
> set size 5120 524288
> set number 500
> run
> quit
>
> workload2:
> set transactions 10000
> set size 2097152 10485760
> set number 100
> run
> quit
>
> throughput (transactions/sec)
>                 workload1       workload2
> baseline        8333            416
> patched         8333            454
>
> FS-Mark
> =======
>
> throughput (files/sec)
>
>                 2000 files by 1M        200 files by 10M
> baseline        5326.1                  548.1
> patched         5192.8                  528.4
>
> tiobench
> ========
>
> baseline:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2048 MBs |    0.2 s | 8667.792 MB/s | 445.2 %  | 5535.9 % |
> | Random Write   62 MBs |    0.0 s | 8341.118 MB/s |   0.0 %  | 2615.8 % |
> | Read         2048 MBs |    0.2 s | 11680.431 MB/s | 339.9 %  | 5470.6 % |
> | Random Read    62 MBs |    0.0 s | 9451.081 MB/s | 786.3 %  | 1451.7 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        0.006 ms |       28.019 ms |  0.00000 |   0.00000 |
> | Random Write |        0.002 ms |        5.574 ms |  0.00000 |   0.00000 |
> | Read         |        0.005 ms |       28.018 ms |  0.00000 |   0.00000 |
> | Random Read  |        0.002 ms |        4.852 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |        0.005 ms |       28.019 ms |  0.00000 |   0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> patched:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2048 MBs |    0.3 s | 7942.818 MB/s | 442.1 %  | 5533.6 % |
> | Random Write   62 MBs |    0.0 s | 9425.426 MB/s | 723.9 %  | 965.2 % |
> | Read         2048 MBs |    0.2 s | 11998.008 MB/s | 374.9 %  | 5485.8 % |
> | Random Read    62 MBs |    0.0 s | 9823.955 MB/s | 251.5 %  | 2011.9 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        0.007 ms |       28.020 ms |  0.00000 |   0.00000 |
> | Random Write |        0.001 ms |        0.022 ms |  0.00000 |   0.00000 |
> | Read         |        0.004 ms |       24.011 ms |  0.00000 |   0.00000 |
> | Random Read  |        0.001 ms |        0.019 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |        0.005 ms |       28.020 ms |  0.00000 |   0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> IOZone
> ======
>
> Syscalls, not mmap.
>
> ** Initial writers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           4741691    7986408    9149064    9898695    9868597
>  9629383    9469202   11605064    9507802   10641869   11360701   11040376
> patched:            4682864    7275535    8691034    8872887    8712492
>  8771912    8397216    7701346    7366853    8839736    8299893   10788439
> speed-up(times):       0.99       0.91       0.95       0.90       0.88
>     0.91       0.89       0.66       0.77       0.83       0.73       0.98
>
> ** Rewriters **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           5807891    9554869   12101083   13113533   12989751
> 14359910   16998236   16833861   24735659   17502634   17396706   20448655
> patched:            6161690    9981294   12285789   13428846   13610058
> 13669153   20060182   17328347   24109999   19247934   24225103   34686574
> speed-up(times):       1.06       1.04       1.02       1.02       1.05
>     0.95       1.18       1.03       0.97       1.10       1.39       1.70
>
> ** Readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           7978066   11825735   13808941   14049598   14765175
> 14422642   17322681   23209831   21386483   20060744   22032935   31166663
> patched:            7723293   11481500   13796383   14363808   14353966
> 14979865   17648225   18701258   29192810   23973723   22163317   23104638
> speed-up(times):       0.97       0.97       1.00       1.02       0.97
>     1.04       1.02       0.81       1.37       1.20       1.01       0.74
>
> ** Re-readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           7966269   11878323   14000782   14678206   14154235
> 14271991   15170829   20924052   27393344   19114990   12509316   18495597
> patched:            7719350   11410937   13710233   13232756   14040928
> 15895021   16279330   17256068   26023572   18364678   27834483   23288680
> speed-up(times):       0.97       0.96       0.98       0.90       0.99
>     1.11       1.07       0.82       0.95       0.96       2.23       1.26
>
> ** Reverse readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           6630795   10331013   12839501   13157433   12783323
> 13580283   15753068   15434572   21928982   17636994   14737489   19470679
> patched:            6502341    9887711   12639278   12979232   13212825
> 12928255   13961195   14695786   21370667   19873807   20902582   21892899
> speed-up(times):       0.98       0.96       0.98       0.99       1.03
>     0.95       0.89       0.95       0.97       1.13       1.42       1.12
>
> ** Random_readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           5152935    9043813   11752615   11996078   12283579
> 12484039   14588004   15781507   23847538   15748906   13698335   27195847
> patched:            5009089    8438137   11266015   11631218   12093650
> 12779308   17768691   13640378   30468890   19269033   23444358   22775908
> speed-up(times):       0.97       0.93       0.96       0.97       0.98
>     1.02       1.22       0.86       1.28       1.22       1.71       0.84
>
> ** Random_writers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           3886268    7405345   10531192   10858984   10994693
> 12758450   10729531    9656825   10370144   13139452    4528331   12615812
> patched:            4335323    7916132   10978892   11423247   11790932
> 11424525   11798171   11413452   12230616   13075887   11165314   16925679
> speed-up(times):       1.12       1.07       1.04       1.05       1.07
>     0.90       1.10       1.18       1.18       1.00       2.47       1.34
>
> Kirill A. Shutemov (22):
>   mm: implement zero_huge_user_segment and friends
>   radix-tree: implement preload for multiple contiguous elements
>   memcg, thp: charge huge cache pages
>   thp: compile-time and sysfs knob for thp pagecache
>   thp, mm: introduce mapping_can_have_hugepages() predicate
>   thp: represent file thp pages in meminfo and friends
>   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>   mm: trace filemap: dump page order
>   block: implement add_bdi_stat()
>   thp, mm: rewrite delete_from_page_cache() to support huge pages
>   thp, mm: warn if we try to use replace_page_cache_page() with THP
>   thp, mm: add event counters for huge page alloc on file write or read
>   mm, vfs: introduce i_split_sem
>   thp, mm: allocate huge pages in grab_cache_page_write_begin()
>   thp, mm: naive support of thp in generic_perform_write
>   thp, mm: handle transhuge pages in do_generic_file_read()
>   thp, libfs: initial thp support
>   truncate: support huge pages
>   thp: handle file pages in split_huge_page()
>   thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>   thp, mm: split huge page on mmap file page
>   ramfs: enable transparent huge page cache
>
>  Documentation/vm/transhuge.txt |  16 ++++
>  drivers/base/node.c            |   4 +
>  fs/inode.c                     |   3 +
>  fs/libfs.c                     |  58 +++++++++++-
>  fs/proc/meminfo.c              |   3 +
>  fs/ramfs/file-mmu.c            |   2 +-
>  fs/ramfs/inode.c               |   6 +-
>  include/linux/backing-dev.h    |  10 +++
>  include/linux/fs.h             |  11 +++
>  include/linux/huge_mm.h        |  68 +++++++++++++-
>  include/linux/mm.h             |  18 ++++
>  include/linux/mmzone.h         |   1 +
>  include/linux/page-flags.h     |  13 +++
>  include/linux/pagemap.h        |  31 +++++++
>  include/linux/radix-tree.h     |  11 +++
>  include/linux/vm_event_item.h  |   4 +
>  include/trace/events/filemap.h |   7 +-
>  lib/radix-tree.c               |  94 ++++++++++++++++++--
>  mm/Kconfig                     |  11 +++
>  mm/filemap.c                   | 196
> ++++++++++++++++++++++++++++++++---------
>  mm/huge_memory.c               | 147 +++++++++++++++++++++++++++----
>  mm/memcontrol.c                |   3 +-
>  mm/memory.c                    |  40 ++++++++-
>  mm/truncate.c                  | 125 ++++++++++++++++++++------
>  mm/vmstat.c                    |   5 ++
>  25 files changed, 779 insertions(+), 108 deletions(-)
>
> --
> 1.8.4.rc3
>
>

[-- Attachment #2: Type: text/html, Size: 19253 bytes --]

  parent reply	other threads:[~2013-09-25  0:12 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-23 12:05 Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-09-25 20:02   ` Ning Qu
2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
2013-09-24 23:48   ` Ning Qu
2013-09-24 23:49   ` Andi Kleen
2013-09-24 23:58     ` Andrew Morton
2013-09-25 11:15       ` Kirill A. Shutemov
2013-09-25 15:05         ` Andi Kleen
2013-09-26 18:30     ` Zach Brown
2013-09-26 19:05       ` Andi Kleen
2013-09-30 10:13     ` Mel Gorman
2013-09-30 16:05       ` Andi Kleen
2013-09-25  9:51   ` Kirill A. Shutemov
2013-09-25 23:29     ` Dave Chinner
2013-10-14 13:56       ` Kirill A. Shutemov
2013-09-30 10:02   ` Mel Gorman
2013-09-30 10:10     ` Mel Gorman
2013-09-30 18:07       ` Ning Qu
2013-09-30 18:51       ` Andi Kleen
2013-10-01  8:38         ` Mel Gorman
2013-10-01 17:11           ` Ning Qu
2013-10-14 14:27           ` Kirill A. Shutemov
2013-09-30 15:27     ` Dave Hansen
2013-09-30 18:05       ` Ning Qu
2013-09-25  0:12 ` Ning Qu [this message]
2013-09-25  9:23   ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
2013-09-25 18:11 Ning Qu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com \
    --to=quning@google.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=dave@sr71.net \
    --cc=dhillf@gmail.com \
    --cc=fengguang.wu@intel.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox