* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-25 18:11 Ning Qu
0 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-25 18:11 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Alexander Shishkin, linux-fsdevel, linux-kernel
Got you. THanks!
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Wed, Sep 25, 2013 at 2:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Ning Qu wrote:
>> Hi, Kirill,
>>
>> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>>
>> thp, mm: handle tail pages in page_cache_get_speculative()
>
> It's not needed anymore, since we don't have tail pages in radix tree.
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-23 12:05 Kirill A. Shutemov
2013-09-24 23:37 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
To: Andrea Arcangeli, Andrew Morton
Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
linux-fsdevel, linux-kernel, Kirill A. Shutemov
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.
Please review and consider applying.
Intro
-----
The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.
To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.
Design overview
---------------
Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries. All entries points to head page -- refcounting for
tail pages is pretty expensive.
Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).
Huge pages can be added to page cache three ways:
- write(2) to file or page;
- read(2) from sparse file;
- fault sparse file.
Potentially, one more way is collapsing small page, but it's outside initial
implementation.
For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.
Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.
To minimize memory overhead for small files we aviod write-allocation in
first huge page area (2M on x86-64) of the file.
truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.
split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.
inode->i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting. We take it on write during splitting.
Changes since v5
----------------
- change how hugepage stored in pagecache: head page for all relevant
indexes;
- introduce i_split_sem;
- do not create huge pages on write(2) into first hugepage area;
- compile-disabled by default;
- fix transparent_hugepage_pagecache();
Benchmarks
----------
Since the patchset doesn't include mmap() support, we should expect much
change in performance. We just need to check that we don't introduce any
major regression.
On average read/write on ramfs with thp is a bit slower, but I don't think
it's a stopper -- ramfs is a toy anyway, on real world filesystems I
expect difference to be smaller.
postmark
========
workload1:
chmod +x postmark
mount -t ramfs none /mnt
cat >/root/workload1 <<EOF
set transactions 250000
set size 5120 524288
set number 500
run
quit
workload2:
set transactions 10000
set size 2097152 10485760
set number 100
run
quit
throughput (transactions/sec)
workload1 workload2
baseline 8333 416
patched 8333 454
FS-Mark
=======
throughput (files/sec)
2000 files by 1M 200 files by 10M
baseline 5326.1 548.1
patched 5192.8 528.4
tiobench
========
baseline:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2048 MBs | 0.2 s | 8667.792 MB/s | 445.2 % | 5535.9 % |
| Random Write 62 MBs | 0.0 s | 8341.118 MB/s | 0.0 % | 2615.8 % |
| Read 2048 MBs | 0.2 s | 11680.431 MB/s | 339.9 % | 5470.6 % |
| Random Read 62 MBs | 0.0 s | 9451.081 MB/s | 786.3 % | 1451.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.006 ms | 28.019 ms | 0.00000 | 0.00000 |
| Random Write | 0.002 ms | 5.574 ms | 0.00000 | 0.00000 |
| Read | 0.005 ms | 28.018 ms | 0.00000 | 0.00000 |
| Random Read | 0.002 ms | 4.852 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.005 ms | 28.019 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'
patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2048 MBs | 0.3 s | 7942.818 MB/s | 442.1 % | 5533.6 % |
| Random Write 62 MBs | 0.0 s | 9425.426 MB/s | 723.9 % | 965.2 % |
| Read 2048 MBs | 0.2 s | 11998.008 MB/s | 374.9 % | 5485.8 % |
| Random Read 62 MBs | 0.0 s | 9823.955 MB/s | 251.5 % | 2011.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.007 ms | 28.020 ms | 0.00000 | 0.00000 |
| Random Write | 0.001 ms | 0.022 ms | 0.00000 | 0.00000 |
| Read | 0.004 ms | 24.011 ms | 0.00000 | 0.00000 |
| Random Read | 0.001 ms | 0.019 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.005 ms | 28.020 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'
IOZone
======
Syscalls, not mmap.
** Initial writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 4741691 7986408 9149064 9898695 9868597 9629383 9469202 11605064 9507802 10641869 11360701 11040376
patched: 4682864 7275535 8691034 8872887 8712492 8771912 8397216 7701346 7366853 8839736 8299893 10788439
speed-up(times): 0.99 0.91 0.95 0.90 0.88 0.91 0.89 0.66 0.77 0.83 0.73 0.98
** Rewriters **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 5807891 9554869 12101083 13113533 12989751 14359910 16998236 16833861 24735659 17502634 17396706 20448655
patched: 6161690 9981294 12285789 13428846 13610058 13669153 20060182 17328347 24109999 19247934 24225103 34686574
speed-up(times): 1.06 1.04 1.02 1.02 1.05 0.95 1.18 1.03 0.97 1.10 1.39 1.70
** Readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 7978066 11825735 13808941 14049598 14765175 14422642 17322681 23209831 21386483 20060744 22032935 31166663
patched: 7723293 11481500 13796383 14363808 14353966 14979865 17648225 18701258 29192810 23973723 22163317 23104638
speed-up(times): 0.97 0.97 1.00 1.02 0.97 1.04 1.02 0.81 1.37 1.20 1.01 0.74
** Re-readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 7966269 11878323 14000782 14678206 14154235 14271991 15170829 20924052 27393344 19114990 12509316 18495597
patched: 7719350 11410937 13710233 13232756 14040928 15895021 16279330 17256068 26023572 18364678 27834483 23288680
speed-up(times): 0.97 0.96 0.98 0.90 0.99 1.11 1.07 0.82 0.95 0.96 2.23 1.26
** Reverse readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 6630795 10331013 12839501 13157433 12783323 13580283 15753068 15434572 21928982 17636994 14737489 19470679
patched: 6502341 9887711 12639278 12979232 13212825 12928255 13961195 14695786 21370667 19873807 20902582 21892899
speed-up(times): 0.98 0.96 0.98 0.99 1.03 0.95 0.89 0.95 0.97 1.13 1.42 1.12
** Random_readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 5152935 9043813 11752615 11996078 12283579 12484039 14588004 15781507 23847538 15748906 13698335 27195847
patched: 5009089 8438137 11266015 11631218 12093650 12779308 17768691 13640378 30468890 19269033 23444358 22775908
speed-up(times): 0.97 0.93 0.96 0.97 0.98 1.02 1.22 0.86 1.28 1.22 1.71 0.84
** Random_writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 3886268 7405345 10531192 10858984 10994693 12758450 10729531 9656825 10370144 13139452 4528331 12615812
patched: 4335323 7916132 10978892 11423247 11790932 11424525 11798171 11413452 12230616 13075887 11165314 16925679
speed-up(times): 1.12 1.07 1.04 1.05 1.07 0.90 1.10 1.18 1.18 1.00 2.47 1.34
Kirill A. Shutemov (22):
mm: implement zero_huge_user_segment and friends
radix-tree: implement preload for multiple contiguous elements
memcg, thp: charge huge cache pages
thp: compile-time and sysfs knob for thp pagecache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp: represent file thp pages in meminfo and friends
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
mm: trace filemap: dump page order
block: implement add_bdi_stat()
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: warn if we try to use replace_page_cache_page() with THP
thp, mm: add event counters for huge page alloc on file write or read
mm, vfs: introduce i_split_sem
thp, mm: allocate huge pages in grab_cache_page_write_begin()
thp, mm: naive support of thp in generic_perform_write
thp, mm: handle transhuge pages in do_generic_file_read()
thp, libfs: initial thp support
truncate: support huge pages
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache
Documentation/vm/transhuge.txt | 16 ++++
drivers/base/node.c | 4 +
fs/inode.c | 3 +
fs/libfs.c | 58 +++++++++++-
fs/proc/meminfo.c | 3 +
fs/ramfs/file-mmu.c | 2 +-
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +++
include/linux/fs.h | 11 +++
include/linux/huge_mm.h | 68 +++++++++++++-
include/linux/mm.h | 18 ++++
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 13 +++
include/linux/pagemap.h | 31 +++++++
include/linux/radix-tree.h | 11 +++
include/linux/vm_event_item.h | 4 +
include/trace/events/filemap.h | 7 +-
lib/radix-tree.c | 94 ++++++++++++++++++--
mm/Kconfig | 11 +++
mm/filemap.c | 196 ++++++++++++++++++++++++++++++++---------
mm/huge_memory.c | 147 +++++++++++++++++++++++++++----
mm/memcontrol.c | 3 +-
mm/memory.c | 40 ++++++++-
mm/truncate.c | 125 ++++++++++++++++++++------
mm/vmstat.c | 5 ++
25 files changed, 779 insertions(+), 108 deletions(-)
--
1.8.4.rc3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-23 12:05 Kirill A. Shutemov
@ 2013-09-24 23:37 ` Andrew Morton
2013-09-24 23:48 ` Ning Qu
` (3 more replies)
2013-09-25 0:12 ` Ning Qu
2013-09-26 21:13 ` Dave Hansen
2 siblings, 4 replies; 27+ messages in thread
From: Andrew Morton @ 2013-09-24 23:37 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrea Arcangeli, Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara,
Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.
We were never going to do this :(
Has anyone reviewed these patches much yet?
> Please review and consider applying.
It appears rather too immature at this stage.
> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
At the very least we should get this done for a real filesystem to see
how intrusive the changes are and to evaluate the performance changes.
Sigh. A pox on whoever thought up huge pages. Words cannot express
how much of a godawful mess they have made of Linux MM. And it hasn't
ended yet :( My take is that we'd need to see some very attractive and
convincing real-world performance numbers before even thinking of
taking this on.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:37 ` Andrew Morton
@ 2013-09-24 23:48 ` Ning Qu
2013-09-24 23:49 ` Andi Kleen
` (2 subsequent siblings)
3 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-24 23:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Alexander Shishkin, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]
I am working on the tmpfs side on top of this patchset, which I assume has
better applications usage than ramfs.
However, I am working on 3.3 so far and will probably get my patches ported
to upstream pretty soon. I believe my patchset is also in early stage but
it does help to get some solid numbers in our own projects, which is very
convincing. However, I think it does depend on the characteristic of the
job .....
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Tue, Sep 24, 2013 at 4:37 PM, Andrew Morton <akpm@linux-foundation.org>wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <
> kirill.shutemov@linux.intel.com> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
>
> > Please review and consider applying.
>
> It appears rather too immature at this stage.
>
> > Intro
> > -----
> >
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> >
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
>
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.
>
>
> Sigh. A pox on whoever thought up huge pages. Words cannot express
> how much of a godawful mess they have made of Linux MM. And it hasn't
> ended yet :( My take is that we'd need to see some very attractive and
> convincing real-world performance numbers before even thinking of
> taking this on.
>
>
>
>
[-- Attachment #2: Type: text/html, Size: 4403 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:37 ` Andrew Morton
2013-09-24 23:48 ` Ning Qu
@ 2013-09-24 23:49 ` Andi Kleen
2013-09-24 23:58 ` Andrew Morton
` (2 more replies)
2013-09-25 9:51 ` Kirill A. Shutemov
2013-09-30 10:02 ` Mel Gorman
3 siblings, 3 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-24 23:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
There already was a lot of review by various people.
This is not the first post, just the latest refactoring.
> > Intro
> > -----
> >
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> >
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
>
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.
That would give even larger patches, and people already complain
the patchkit is too large.
The only good way to handle this is baby steps, and you
have to start somewhere.
> Sigh. A pox on whoever thought up huge pages.
managing 1TB+ of memory in 4K chunks is just insane.
The question of larger pages is not "if", but only "when".
-Andi
--
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:49 ` Andi Kleen
@ 2013-09-24 23:58 ` Andrew Morton
2013-09-25 11:15 ` Kirill A. Shutemov
2013-09-26 18:30 ` Zach Brown
2013-09-30 10:13 ` Mel Gorman
2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2013-09-24 23:58 UTC (permalink / raw)
To: Andi Kleen
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> > At the very least we should get this done for a real filesystem to see
> > how intrusive the changes are and to evaluate the performance changes.
>
> That would give even larger patches, and people already complain
> the patchkit is too large.
The thing is that merging an implementation for ramfs commits us to
doing it for the major real filesystems. Before making that commitment
we should at least have a pretty good understanding of what those
changes will look like.
Plus I don't see how we can realistically performance-test it without
having real physical backing store in the picture?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:58 ` Andrew Morton
@ 2013-09-25 11:15 ` Kirill A. Shutemov
2013-09-25 15:05 ` Andi Kleen
0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25 11:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
Andrew Morton wrote:
> On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:
>
> > > At the very least we should get this done for a real filesystem to see
> > > how intrusive the changes are and to evaluate the performance changes.
> >
> > That would give even larger patches, and people already complain
> > the patchkit is too large.
>
> The thing is that merging an implementation for ramfs commits us to
> doing it for the major real filesystems. Before making that commitment
> we should at least have a pretty good understanding of what those
> changes will look like.
>
> Plus I don't see how we can realistically performance-test it without
> having real physical backing store in the picture?
My plan for real filesystem is to get it first beneficial for read-mostly
files:
- allocate huge pages on read (or collapse small pages) only if nobody
has the inode opened on write;
- split huge page on write to avoid dealing with write back patch at
first and dirty only 4k pages;
This will will get most of elf executables and libraries mapped with huge
pages (it may require dynamic linker change to align length to huge page
boundary) which is not bad for start.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-25 11:15 ` Kirill A. Shutemov
@ 2013-09-25 15:05 ` Andi Kleen
0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-25 15:05 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
> (it may require dynamic linker change to align length to huge page
> boundary)
x86-64 binaries should be already padded for this.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:49 ` Andi Kleen
2013-09-24 23:58 ` Andrew Morton
@ 2013-09-26 18:30 ` Zach Brown
2013-09-26 19:05 ` Andi Kleen
2013-09-30 10:13 ` Mel Gorman
2 siblings, 1 reply; 27+ messages in thread
From: Zach Brown @ 2013-09-26 18:30 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
> > Sigh. A pox on whoever thought up huge pages.
>
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".
And "how"!
Sprinking a bunch of magical if (thp) {} else {} throughtout the code
looks like a stunningly bad idea to me. It'd take real work to
restructure the code such that the current paths are a degenerate case
of the larger thp page case, but that's the work that needs doing in my
estimation.
- z
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-26 18:30 ` Zach Brown
@ 2013-09-26 19:05 ` Andi Kleen
0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-26 19:05 UTC (permalink / raw)
To: Zach Brown
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
On Thu, Sep 26, 2013 at 11:30:22AM -0700, Zach Brown wrote:
> > > Sigh. A pox on whoever thought up huge pages.
> >
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
>
> And "how"!
>
> Sprinking a bunch of magical if (thp) {} else {} throughtout the code
> looks like a stunningly bad idea to me. It'd take real work to
> restructure the code such that the current paths are a degenerate case
> of the larger thp page case, but that's the work that needs doing in my
> estimation.
Sorry, but that is how all of large pages in the Linux VM works
(both THP and hugetlbfs)
Yes it would be nice if small pages and large pages all ran
in a unified VM. But that's not how Linux is designed today.
Yes having a Pony would be nice too.
Back when huge pages were originally proposed Linus came
up with the "separate hugetlbfs VM" design and that is what were
stuck with today.
Asking for a whole scale VM redesign is just not realistic.
VM is always changing in baby steps. And the only
known way to do that is to have if (thp) and if (hugetlbfs) .
-Andi
--
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:49 ` Andi Kleen
2013-09-24 23:58 ` Andrew Morton
2013-09-26 18:30 ` Zach Brown
@ 2013-09-30 10:13 ` Mel Gorman
2013-09-30 16:05 ` Andi Kleen
2 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:13 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > Sigh. A pox on whoever thought up huge pages.
>
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".
>
Remember that there are at least two separate issues there. One is the
handling data in larger granularities than a 4K page and the second is
the TLB, pagetable etc handling. They are not necessarily the same problem.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 10:13 ` Mel Gorman
@ 2013-09-30 16:05 ` Andi Kleen
0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-30 16:05 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Mon, Sep 30, 2013 at 11:13:00AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > > Sigh. A pox on whoever thought up huge pages.
> >
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> >
>
> Remember that there are at least two separate issues there. One is the
> handling data in larger granularities than a 4K page and the second is
> the TLB, pagetable etc handling. They are not necessarily the same problem.
It's the same problem in the end.
The hardware is struggling with 4K pages too (both i and d)
I expect longer term TLB/page optimization to have far more important
than all this NUMA placement work that people spend so much
time on.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:37 ` Andrew Morton
2013-09-24 23:48 ` Ning Qu
2013-09-24 23:49 ` Andi Kleen
@ 2013-09-25 9:51 ` Kirill A. Shutemov
2013-09-25 23:29 ` Dave Chinner
2013-09-30 10:02 ` Mel Gorman
3 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25 9:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
Dave did very good review. Few other people looked to separate patches.
See Reviewed-by/Acked-by tags in patches.
It looks like most mm experts are busy with numa balancing nowadays, so
it's hard to get more review.
The patchset was mostly ignored for few rounds and Dave suggested to split
to have less scary patch number.
> > Please review and consider applying.
>
> It appears rather too immature at this stage.
More review is always welcome and I'm committed to address issues.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-25 9:51 ` Kirill A. Shutemov
@ 2013-09-25 23:29 ` Dave Chinner
2013-10-14 13:56 ` Kirill A. Shutemov
0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2013-09-25 23:29 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> >
> > We were never going to do this :(
> >
> > Has anyone reviewed these patches much yet?
>
> Dave did very good review. Few other people looked to separate patches.
> See Reviewed-by/Acked-by tags in patches.
>
> It looks like most mm experts are busy with numa balancing nowadays, so
> it's hard to get more review.
Nobody has reviewed it from the filesystem side, though.
The changes that require special code paths for huge pages in the
write_begin/write_end paths are nasty. You're adding conditional
code that depends on the page size and then having to add checks to
ensure that large page operations don't step over small page
boundaries and other such corner cases. It's an extremely fragile
design, IMO.
In general, I don't like all the if (thp) {} else {}; code that this
series introduces - they are code paths that simply won't get tested
with any sort of regularity and make the code more complex for those
that aren't using THP to understand and debug...
Then there is a new per-inode lock that is used in
generic_perform_write() which is held across page faults and calls
to filesystem block mapping callbacks. This inserts into the middle
of an existing locking chain that needs to be strictly ordered, and
as such will lead to the same type of lock inversion problems that
the mmap_sem had. We do not want to introduce a new lock that has
this same problem just as we are getting rid of that long standing
nastiness from the page fault path...
I also note that you didn't convert invalidate_inode_pages2_range()
to support huge pages which is needed by real filesystems that
support direct IO. There are other truncate/invalidate interfaces
that you didn't convert, either, and some of them will present you
with interesting locking challenges as a result of adding that new
lock...
> The patchset was mostly ignored for few rounds and Dave suggested to split
> to have less scary patch number.
It's still being ignored by filesystem people because you haven't
actually tried to implement support into a real filesystem.....
> > > Please review and consider applying.
> >
> > It appears rather too immature at this stage.
>
> More review is always welcome and I'm committed to address issues.
IMO, supporting a real block based filesystem like ext4 or XFS and
demonstrating that everything works is necessary before we go any
further...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-25 23:29 ` Dave Chinner
@ 2013-10-14 13:56 ` Kirill A. Shutemov
0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 13:56 UTC (permalink / raw)
To: Dave Chinner
Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
Dave Hansen, Ning Qu, Alexander Shishkin, linux-fsdevel,
linux-kernel
Dave Chinner wrote:
> On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> > Andrew Morton wrote:
> > > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > > separately.
> > >
> > > We were never going to do this :(
> > >
> > > Has anyone reviewed these patches much yet?
> >
> > Dave did very good review. Few other people looked to separate patches.
> > See Reviewed-by/Acked-by tags in patches.
> >
> > It looks like most mm experts are busy with numa balancing nowadays, so
> > it's hard to get more review.
>
> Nobody has reviewed it from the filesystem side, though.
>
> The changes that require special code paths for huge pages in the
> write_begin/write_end paths are nasty. You're adding conditional
> code that depends on the page size and then having to add checks to
> ensure that large page operations don't step over small page
> boundaries and other such corner cases. It's an extremely fragile
> design, IMO.
>
> In general, I don't like all the if (thp) {} else {}; code that this
> series introduces - they are code paths that simply won't get tested
> with any sort of regularity and make the code more complex for those
> that aren't using THP to understand and debug...
Okay, I'll try to get rid of special cases where it's possible.
> Then there is a new per-inode lock that is used in
> generic_perform_write() which is held across page faults and calls
> to filesystem block mapping callbacks. This inserts into the middle
> of an existing locking chain that needs to be strictly ordered, and
> as such will lead to the same type of lock inversion problems that
> the mmap_sem had. We do not want to introduce a new lock that has
> this same problem just as we are getting rid of that long standing
> nastiness from the page fault path...
I don't see how we can protect against splitting with existing locks,
but I'll try find a way.
> I also note that you didn't convert invalidate_inode_pages2_range()
> to support huge pages which is needed by real filesystems that
> support direct IO. There are other truncate/invalidate interfaces
> that you didn't convert, either, and some of them will present you
> with interesting locking challenges as a result of adding that new
> lock...
Thanks. I'll take a look on these code paths.
> > The patchset was mostly ignored for few rounds and Dave suggested to split
> > to have less scary patch number.
>
> It's still being ignored by filesystem people because you haven't
> actually tried to implement support into a real filesystem.....
If it will support a real filesystem, wouldn't it be ignored due
patch count? ;)
> > > > Please review and consider applying.
> > >
> > > It appears rather too immature at this stage.
> >
> > More review is always welcome and I'm committed to address issues.
>
> IMO, supporting a real block based filesystem like ext4 or XFS and
> demonstrating that everything works is necessary before we go any
> further...
Will see what numbers I can bring in next iterations.
Thanks for your feedback. And sorry for late answer.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-24 23:37 ` Andrew Morton
` (2 preceding siblings ...)
2013-09-25 9:51 ` Kirill A. Shutemov
@ 2013-09-30 10:02 ` Mel Gorman
2013-09-30 10:10 ` Mel Gorman
2013-09-30 15:27 ` Dave Hansen
3 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
>
I am afraid I never looked too closely once I learned that the primary
motivation for this was relieving iTLB pressure in a very specific
case. AFAIK, this is not a problem in the vast majority of modern CPUs
and I found it very hard to be motivated to review the series as a result.
I suspected that in many cases that the cost of IO would continue to dominate
performance instead of TLB pressure. I also found it unlikely that there
was a workload that was tmpfs based that used enough memory to be hurt
by TLB pressure. My feedback was that a much more compelling case for the
series was needed but this discussion all happened on IRC unfortunately.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 10:02 ` Mel Gorman
@ 2013-09-30 10:10 ` Mel Gorman
2013-09-30 18:07 ` Ning Qu
2013-09-30 18:51 ` Andi Kleen
2013-09-30 15:27 ` Dave Hansen
1 sibling, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> >
> > We were never going to do this :(
> >
> > Has anyone reviewed these patches much yet?
> >
>
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.
>
Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
benefit I would expect that sysV shared memory workloads would potentially
benefit from this. hugetlbfs is still required for shared memory areas
but it is not a problem that is addressed by this series.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 10:10 ` Mel Gorman
@ 2013-09-30 18:07 ` Ning Qu
2013-09-30 18:51 ` Andi Kleen
1 sibling, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-30 18:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Alexander Shishkin, linux-fsdevel, linux-kernel
I suppose sysv shm and tmpfs share the same code base now, so both of
them will benefit from thp page cache?
And for Kirill's previous patchset (till v4), it contains mmap support
as well. I suppose the patchset got splitted into smaller group so
it's easier to review ....
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Mon, Sep 30, 2013 at 3:10 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
>> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
>> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > > It brings thp support for ramfs, but without mmap() -- it will be posted
>> > > separately.
>> >
>> > We were never going to do this :(
>> >
>> > Has anyone reviewed these patches much yet?
>> >
>>
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>>
>
> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this. hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.
>
> --
> Mel Gorman
> SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 10:10 ` Mel Gorman
2013-09-30 18:07 ` Ning Qu
@ 2013-09-30 18:51 ` Andi Kleen
2013-10-01 8:38 ` Mel Gorman
1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2013-09-30 18:51 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
> AFAIK, this is not a problem in the vast majority of modern CPUs
Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
That's around 2MB. There's more and more code whose footprint exceeds
that.
Besides iTLB is not the only target. It is also useful for
data of course.
> > and I found it very hard to be motivated to review the series as a result.
> > I suspected that in many cases that the cost of IO would continue to dominate
> > performance instead of TLB pressure
The trend is to larger and larger memories, keeping things in memory.
In fact there's a good argument that memory sizes are growing faster
than TLB capacities. And without large TLBs we're even further off
the curve.
> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this. hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.
Of course it's only the first step. But if noone does the babysteps
then the other usages will also not ever materialize.
I expect once ramfs works, extending it to tmpfs etc. should be
straight forward.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 18:51 ` Andi Kleen
@ 2013-10-01 8:38 ` Mel Gorman
2013-10-01 17:11 ` Ning Qu
2013-10-14 14:27 ` Kirill A. Shutemov
0 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-10-01 8:38 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
Alexander Shishkin, linux-fsdevel, linux-kernel
On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
> > AFAIK, this is not a problem in the vast majority of modern CPUs
>
> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
> That's around 2MB. There's more and more code whose footprint exceeds
> that.
>
With an expectation that it is read-mostly data, replicated between the
caches accessing it and TLB refills taking very little time. This is not
universally true and there are exceptions but even recent papers on TLB
behaviour have tended to dismiss the iTLB refill overhead as a negligible
portion of the overall workload of interest.
> Besides iTLB is not the only target. It is also useful for
> data of course.
>
True, but how useful? I have not seen an example of a workload showing that
dTLB pressure on file-backed data was a major component of the workload. I
would expect that sysV shared memory is an exception but does that require
generic support for all filesystems or can tmpfs be special cased when
it's used for shared memory?
For normal data, if it's read-only data then there would be some benefit to
using huge pages once the data is in page cache. How common are workloads
that mmap() large amounts of read-only data? Possibly some databases
depending on the workload although there I would expect that the data is
placed in shared memory.
If the mmap()s data is being written then the cost of IO is likely to
dominate, not TLB pressure. For write-mostly workloads there are greater
concerns because dirty tracking can only be done at the huge page boundary
potentially leading to greater amounts of IO and degraded performance
overall.
I could be completely wrong here but these were the concerns I had when
I first glanced through the patches. The changelogs had no information
to convince me otherwise so I never dedicated the time to reviewing the
patches in detail. I raised my concerns and then dropped it.
> > > and I found it very hard to be motivated to review the series as a result.
> > > I suspected that in many cases that the cost of IO would continue to dominate
> > > performance instead of TLB pressure
>
> The trend is to larger and larger memories, keeping things in memory.
>
Yes, but using huge pages is not *necessarily* the answer. For fault
scalability it probably would be a lot easier to batch handle faults if
readahead indicates accesses are sequential. Background zeroing of pages
could be revisited for fault intensive workloads. A potential alternative
is that a contiguous page is allocated, zerod as one lump, split the pages
and put onto a local per-task list although the details get messy. Reclaim
scanning could be heavily modified to use collections of pages instead of
single pages (although I'm not aware of the proper design of such a thing).
Again, this could be completely off the mark but if it was me that was
working on this problem, I would have some profile data from some workloads
to make sure the part I'm optimising was a noticable percentage of the
workload and included that in the patch leader. I would hope that the data
was compelling enough to convince reviewers to pay close attention to the
series as the complexity would then be justified. Based on how complex THP
was for anonymous pages, I would be tempted to treat THP for file-backed
data as a last resort.
> In fact there's a good argument that memory sizes are growing faster
> than TLB capacities. And without large TLBs we're even further off
> the curve.
>
I'll admit this is also true. It was considered to be true in the 90's
when huge pages were first being thrown around as a possible solution to
the problem. One paper recently suggested using segmentation for large
memory segments but the workloads they examined looked like they would
be dominated by anonymous access, not file-backed data with one exception
where the workload frequently accessed compile-time constants.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-10-01 8:38 ` Mel Gorman
@ 2013-10-01 17:11 ` Ning Qu
2013-10-14 14:27 ` Kirill A. Shutemov
1 sibling, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-10-01 17:11 UTC (permalink / raw)
To: Mel Gorman
Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Alexander Shishkin, linux-fsdevel, linux-kernel
I can throw in some numbers for one of the test case I am working on.
One of the workload is using sysv shm to load GB level files into
memory, which is shared with other worker processes for long term. We
could load as much file which fits all the physical memory available.
And also, the heap is pretty big (GB level as well) to handle those
data.
For the workload I just mentioned, with thp, we have about 8%
performance improvement, 5% from thp anonymous memory and 3% from thp
page cache. It might not look so good but it's pretty good without
changing one line of code in application, which is the beauty of thp.
Before that, we have been using hugetlbfs, then we have to reserve a
huge amount of memory at boot time, no matter those memory will be
used or not. It is working but no other major services could ever
share the server resources anymore.
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Tue, Oct 1, 2013 at 1:38 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
>> > AFAIK, this is not a problem in the vast majority of modern CPUs
>>
>> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
>> That's around 2MB. There's more and more code whose footprint exceeds
>> that.
>>
>
> With an expectation that it is read-mostly data, replicated between the
> caches accessing it and TLB refills taking very little time. This is not
> universally true and there are exceptions but even recent papers on TLB
> behaviour have tended to dismiss the iTLB refill overhead as a negligible
> portion of the overall workload of interest.
>
>> Besides iTLB is not the only target. It is also useful for
>> data of course.
>>
>
> True, but how useful? I have not seen an example of a workload showing that
> dTLB pressure on file-backed data was a major component of the workload. I
> would expect that sysV shared memory is an exception but does that require
> generic support for all filesystems or can tmpfs be special cased when
> it's used for shared memory?
>
> For normal data, if it's read-only data then there would be some benefit to
> using huge pages once the data is in page cache. How common are workloads
> that mmap() large amounts of read-only data? Possibly some databases
> depending on the workload although there I would expect that the data is
> placed in shared memory.
>
> If the mmap()s data is being written then the cost of IO is likely to
> dominate, not TLB pressure. For write-mostly workloads there are greater
> concerns because dirty tracking can only be done at the huge page boundary
> potentially leading to greater amounts of IO and degraded performance
> overall.
>
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.
>
>> > > and I found it very hard to be motivated to review the series as a result.
>> > > I suspected that in many cases that the cost of IO would continue to dominate
>> > > performance instead of TLB pressure
>>
>> The trend is to larger and larger memories, keeping things in memory.
>>
>
> Yes, but using huge pages is not *necessarily* the answer. For fault
> scalability it probably would be a lot easier to batch handle faults if
> readahead indicates accesses are sequential. Background zeroing of pages
> could be revisited for fault intensive workloads. A potential alternative
> is that a contiguous page is allocated, zerod as one lump, split the pages
> and put onto a local per-task list although the details get messy. Reclaim
> scanning could be heavily modified to use collections of pages instead of
> single pages (although I'm not aware of the proper design of such a thing).
>
> Again, this could be completely off the mark but if it was me that was
> working on this problem, I would have some profile data from some workloads
> to make sure the part I'm optimising was a noticable percentage of the
> workload and included that in the patch leader. I would hope that the data
> was compelling enough to convince reviewers to pay close attention to the
> series as the complexity would then be justified. Based on how complex THP
> was for anonymous pages, I would be tempted to treat THP for file-backed
> data as a last resort.
>
>> In fact there's a good argument that memory sizes are growing faster
>> than TLB capacities. And without large TLBs we're even further off
>> the curve.
>>
>
> I'll admit this is also true. It was considered to be true in the 90's
> when huge pages were first being thrown around as a possible solution to
> the problem. One paper recently suggested using segmentation for large
> memory segments but the workloads they examined looked like they would
> be dominated by anonymous access, not file-backed data with one exception
> where the workload frequently accessed compile-time constants.
>
> --
> Mel Gorman
> SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-10-01 8:38 ` Mel Gorman
2013-10-01 17:11 ` Ning Qu
@ 2013-10-14 14:27 ` Kirill A. Shutemov
1 sibling, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 14:27 UTC (permalink / raw)
To: Mel Gorman
Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel
Mel Gorman wrote:
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.
Okay. I got your point: more data from real-world workloads. I'll try to
bring some in next iteration.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 10:02 ` Mel Gorman
2013-09-30 10:10 ` Mel Gorman
@ 2013-09-30 15:27 ` Dave Hansen
2013-09-30 18:05 ` Ning Qu
1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2013-09-30 15:27 UTC (permalink / raw)
To: Mel Gorman, Andrew Morton
Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
Kirill A. Shutemov, Hillf Danton, Ning Qu, Alexander Shishkin,
linux-fsdevel, linux-kernel
On 09/30/2013 03:02 AM, Mel Gorman wrote:
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.
FWIW, I'm mostly intrigued by the possibilities of how this can speed up
_software_, and I'm rather uninterested in what it can do for the TLB.
Page cache is particularly painful today, precisely because hugetlbfs
and anonymous-thp aren't available there. If you have an app with
hundreds of GB of files that it wants to mmap(), even if it's in the
page cache, it takes _minutes_ to just fault in. One example:
https://lkml.org/lkml/2013/6/27/698
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-30 15:27 ` Dave Hansen
@ 2013-09-30 18:05 ` Ning Qu
0 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-30 18:05 UTC (permalink / raw)
To: Dave Hansen
Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
Alexander Shishkin, linux-fsdevel, linux-kernel
Yes, I agree. For our case, we have tens of GB files and thp with page
cache does improve the number as expected.
And compared to hugetlbfs (static huge page), it's more flexible and
beneficial to the system wide ....
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Mon, Sep 30, 2013 at 8:27 AM, Dave Hansen <dave@sr71.net> wrote:
> On 09/30/2013 03:02 AM, Mel Gorman wrote:
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>
> FWIW, I'm mostly intrigued by the possibilities of how this can speed up
> _software_, and I'm rather uninterested in what it can do for the TLB.
> Page cache is particularly painful today, precisely because hugetlbfs
> and anonymous-thp aren't available there. If you have an app with
> hundreds of GB of files that it wants to mmap(), even if it's in the
> page cache, it takes _minutes_ to just fault in. One example:
>
> https://lkml.org/lkml/2013/6/27/698
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-23 12:05 Kirill A. Shutemov
2013-09-24 23:37 ` Andrew Morton
@ 2013-09-25 0:12 ` Ning Qu
2013-09-25 9:23 ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
2 siblings, 1 reply; 27+ messages in thread
From: Ning Qu @ 2013-09-25 0:12 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
Alexander Shishkin, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 14023 bytes --]
Hi, Kirill,
Seems you dropped one patch in v5, is that intentional? Just wondering ...
thp, mm: handle tail pages in page_cache_get_speculative()
Thanks!
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:
> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.
>
> Please review and consider applying.
>
> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
>
> Design overview
> ---------------
>
> Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
> (512 on x86-64) entries. All entries points to head page -- refcounting for
> tail pages is pretty expensive.
>
> Radix tree manipulations are implemented in batched way: we add and remove
> whole huge page at once, under one tree_lock. To make it possible, we
> extended radix-tree interface to be able to pre-allocate memory enough to
> insert a number of *contiguous* elements (kudos to Matthew Wilcox).
>
> Huge pages can be added to page cache three ways:
> - write(2) to file or page;
> - read(2) from sparse file;
> - fault sparse file.
>
> Potentially, one more way is collapsing small page, but it's outside
> initial
> implementation.
>
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
> some room for speed up later.
>
> Since mmap() isn't targeted for this patchset, we just split huge page on
> page fault.
>
> To minimize memory overhead for small files we aviod write-allocation in
> first huge page area (2M on x86-64) of the file.
>
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range. If a huge page is only partly in the range we zero out
> the part, exactly like we do for partial small pages.
>
> split_huge_page() for file pages works similar to anon pages, but we
> walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
> truncate_inode_pages() to drop small pages beyond i_size, if any.
>
> inode->i_split_sem taken on read will protect hugepages in inode's
> pagecache
> against splitting. We take it on write during splitting.
>
> Changes since v5
> ----------------
> - change how hugepage stored in pagecache: head page for all relevant
> indexes;
> - introduce i_split_sem;
> - do not create huge pages on write(2) into first hugepage area;
> - compile-disabled by default;
> - fix transparent_hugepage_pagecache();
>
> Benchmarks
> ----------
>
> Since the patchset doesn't include mmap() support, we should expect much
> change in performance. We just need to check that we don't introduce any
> major regression.
>
> On average read/write on ramfs with thp is a bit slower, but I don't think
> it's a stopper -- ramfs is a toy anyway, on real world filesystems I
> expect difference to be smaller.
>
> postmark
> ========
>
> workload1:
> chmod +x postmark
> mount -t ramfs none /mnt
> cat >/root/workload1 <<EOF
> set transactions 250000
> set size 5120 524288
> set number 500
> run
> quit
>
> workload2:
> set transactions 10000
> set size 2097152 10485760
> set number 100
> run
> quit
>
> throughput (transactions/sec)
> workload1 workload2
> baseline 8333 416
> patched 8333 454
>
> FS-Mark
> =======
>
> throughput (files/sec)
>
> 2000 files by 1M 200 files by 10M
> baseline 5326.1 548.1
> patched 5192.8 528.4
>
> tiobench
> ========
>
> baseline:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2048 MBs | 0.2 s | 8667.792 MB/s | 445.2 % | 5535.9 % |
> | Random Write 62 MBs | 0.0 s | 8341.118 MB/s | 0.0 % | 2615.8 % |
> | Read 2048 MBs | 0.2 s | 11680.431 MB/s | 339.9 % | 5470.6 % |
> | Random Read 62 MBs | 0.0 s | 9451.081 MB/s | 786.3 % | 1451.7 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 0.006 ms | 28.019 ms | 0.00000 | 0.00000 |
> | Random Write | 0.002 ms | 5.574 ms | 0.00000 | 0.00000 |
> | Read | 0.005 ms | 28.018 ms | 0.00000 | 0.00000 |
> | Random Read | 0.002 ms | 4.852 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 0.005 ms | 28.019 ms | 0.00000 | 0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> patched:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2048 MBs | 0.3 s | 7942.818 MB/s | 442.1 % | 5533.6 % |
> | Random Write 62 MBs | 0.0 s | 9425.426 MB/s | 723.9 % | 965.2 % |
> | Read 2048 MBs | 0.2 s | 11998.008 MB/s | 374.9 % | 5485.8 % |
> | Random Read 62 MBs | 0.0 s | 9823.955 MB/s | 251.5 % | 2011.9 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 0.007 ms | 28.020 ms | 0.00000 | 0.00000 |
> | Random Write | 0.001 ms | 0.022 ms | 0.00000 | 0.00000 |
> | Read | 0.004 ms | 24.011 ms | 0.00000 | 0.00000 |
> | Random Read | 0.001 ms | 0.019 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 0.005 ms | 28.020 ms | 0.00000 | 0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> IOZone
> ======
>
> Syscalls, not mmap.
>
> ** Initial writers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 4741691 7986408 9149064 9898695 9868597
> 9629383 9469202 11605064 9507802 10641869 11360701 11040376
> patched: 4682864 7275535 8691034 8872887 8712492
> 8771912 8397216 7701346 7366853 8839736 8299893 10788439
> speed-up(times): 0.99 0.91 0.95 0.90 0.88
> 0.91 0.89 0.66 0.77 0.83 0.73 0.98
>
> ** Rewriters **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 5807891 9554869 12101083 13113533 12989751
> 14359910 16998236 16833861 24735659 17502634 17396706 20448655
> patched: 6161690 9981294 12285789 13428846 13610058
> 13669153 20060182 17328347 24109999 19247934 24225103 34686574
> speed-up(times): 1.06 1.04 1.02 1.02 1.05
> 0.95 1.18 1.03 0.97 1.10 1.39 1.70
>
> ** Readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 7978066 11825735 13808941 14049598 14765175
> 14422642 17322681 23209831 21386483 20060744 22032935 31166663
> patched: 7723293 11481500 13796383 14363808 14353966
> 14979865 17648225 18701258 29192810 23973723 22163317 23104638
> speed-up(times): 0.97 0.97 1.00 1.02 0.97
> 1.04 1.02 0.81 1.37 1.20 1.01 0.74
>
> ** Re-readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 7966269 11878323 14000782 14678206 14154235
> 14271991 15170829 20924052 27393344 19114990 12509316 18495597
> patched: 7719350 11410937 13710233 13232756 14040928
> 15895021 16279330 17256068 26023572 18364678 27834483 23288680
> speed-up(times): 0.97 0.96 0.98 0.90 0.99
> 1.11 1.07 0.82 0.95 0.96 2.23 1.26
>
> ** Reverse readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 6630795 10331013 12839501 13157433 12783323
> 13580283 15753068 15434572 21928982 17636994 14737489 19470679
> patched: 6502341 9887711 12639278 12979232 13212825
> 12928255 13961195 14695786 21370667 19873807 20902582 21892899
> speed-up(times): 0.98 0.96 0.98 0.99 1.03
> 0.95 0.89 0.95 0.97 1.13 1.42 1.12
>
> ** Random_readers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 5152935 9043813 11752615 11996078 12283579
> 12484039 14588004 15781507 23847538 15748906 13698335 27195847
> patched: 5009089 8438137 11266015 11631218 12093650
> 12779308 17768691 13640378 30468890 19269033 23444358 22775908
> speed-up(times): 0.97 0.93 0.96 0.97 0.98
> 1.02 1.22 0.86 1.28 1.22 1.71 0.84
>
> ** Random_writers **
> threads: 1 2 4 8 10
> 20 30 40 50 60 70 80
> baseline: 3886268 7405345 10531192 10858984 10994693
> 12758450 10729531 9656825 10370144 13139452 4528331 12615812
> patched: 4335323 7916132 10978892 11423247 11790932
> 11424525 11798171 11413452 12230616 13075887 11165314 16925679
> speed-up(times): 1.12 1.07 1.04 1.05 1.07
> 0.90 1.10 1.18 1.18 1.00 2.47 1.34
>
> Kirill A. Shutemov (22):
> mm: implement zero_huge_user_segment and friends
> radix-tree: implement preload for multiple contiguous elements
> memcg, thp: charge huge cache pages
> thp: compile-time and sysfs knob for thp pagecache
> thp, mm: introduce mapping_can_have_hugepages() predicate
> thp: represent file thp pages in meminfo and friends
> thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> mm: trace filemap: dump page order
> block: implement add_bdi_stat()
> thp, mm: rewrite delete_from_page_cache() to support huge pages
> thp, mm: warn if we try to use replace_page_cache_page() with THP
> thp, mm: add event counters for huge page alloc on file write or read
> mm, vfs: introduce i_split_sem
> thp, mm: allocate huge pages in grab_cache_page_write_begin()
> thp, mm: naive support of thp in generic_perform_write
> thp, mm: handle transhuge pages in do_generic_file_read()
> thp, libfs: initial thp support
> truncate: support huge pages
> thp: handle file pages in split_huge_page()
> thp: wait_split_huge_page(): serialize over i_mmap_mutex too
> thp, mm: split huge page on mmap file page
> ramfs: enable transparent huge page cache
>
> Documentation/vm/transhuge.txt | 16 ++++
> drivers/base/node.c | 4 +
> fs/inode.c | 3 +
> fs/libfs.c | 58 +++++++++++-
> fs/proc/meminfo.c | 3 +
> fs/ramfs/file-mmu.c | 2 +-
> fs/ramfs/inode.c | 6 +-
> include/linux/backing-dev.h | 10 +++
> include/linux/fs.h | 11 +++
> include/linux/huge_mm.h | 68 +++++++++++++-
> include/linux/mm.h | 18 ++++
> include/linux/mmzone.h | 1 +
> include/linux/page-flags.h | 13 +++
> include/linux/pagemap.h | 31 +++++++
> include/linux/radix-tree.h | 11 +++
> include/linux/vm_event_item.h | 4 +
> include/trace/events/filemap.h | 7 +-
> lib/radix-tree.c | 94 ++++++++++++++++++--
> mm/Kconfig | 11 +++
> mm/filemap.c | 196
> ++++++++++++++++++++++++++++++++---------
> mm/huge_memory.c | 147 +++++++++++++++++++++++++++----
> mm/memcontrol.c | 3 +-
> mm/memory.c | 40 ++++++++-
> mm/truncate.c | 125 ++++++++++++++++++++------
> mm/vmstat.c | 5 ++
> 25 files changed, 779 insertions(+), 108 deletions(-)
>
> --
> 1.8.4.rc3
>
>
[-- Attachment #2: Type: text/html, Size: 19253 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-25 0:12 ` Ning Qu
@ 2013-09-25 9:23 ` Kirill A. Shutemov
0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25 9:23 UTC (permalink / raw)
To: Ning Qu
Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
Dave Hansen, Alexander Shishkin, linux-fsdevel, linux-kernel
Ning Qu wrote:
> Hi, Kirill,
>
> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>
> thp, mm: handle tail pages in page_cache_get_speculative()
It's not needed anymore, since we don't have tail pages in radix tree.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
2013-09-23 12:05 Kirill A. Shutemov
2013-09-24 23:37 ` Andrew Morton
2013-09-25 0:12 ` Ning Qu
@ 2013-09-26 21:13 ` Dave Hansen
2 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2013-09-26 21:13 UTC (permalink / raw)
To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
Hillf Danton, Ning Qu, Alexander Shishkin, linux-fsdevel,
linux-kernel, Luck, Tony
On 09/23/2013 05:05 AM, Kirill A. Shutemov wrote:
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
This does, at the least, give us a shared memory mechanism that can move
between large and small pages. We don't have anything which can do that
today.
Tony Luck was just mentioning that if we have a small (say 1-bit) memory
failure in a hugetlbfs page, then we end up tossing out the entire 2MB.
The app gets a chance to recover the contents, but it has to do it for
the entire 2MB. Ideally, we'd like to break the 2M down in to 4k pages,
which lets us continue using the remaining 2M-4k, and leaves the app to
rebuild 4k of its data instead of 2M.
If you look at the diffstat, it's also pretty obvious that virtually
none of this code is actually specific to ramfs. It'll all get used as
the foundation for the "real" filesystems too. I'm very interested in
how those end up looking, too, but I think Kirill is selling his patches
a bit short calling this a toy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-10-14 14:27 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-25 18:11 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Ning Qu
-- strict thread matches above, loose matches on Subject: below --
2013-09-23 12:05 Kirill A. Shutemov
2013-09-24 23:37 ` Andrew Morton
2013-09-24 23:48 ` Ning Qu
2013-09-24 23:49 ` Andi Kleen
2013-09-24 23:58 ` Andrew Morton
2013-09-25 11:15 ` Kirill A. Shutemov
2013-09-25 15:05 ` Andi Kleen
2013-09-26 18:30 ` Zach Brown
2013-09-26 19:05 ` Andi Kleen
2013-09-30 10:13 ` Mel Gorman
2013-09-30 16:05 ` Andi Kleen
2013-09-25 9:51 ` Kirill A. Shutemov
2013-09-25 23:29 ` Dave Chinner
2013-10-14 13:56 ` Kirill A. Shutemov
2013-09-30 10:02 ` Mel Gorman
2013-09-30 10:10 ` Mel Gorman
2013-09-30 18:07 ` Ning Qu
2013-09-30 18:51 ` Andi Kleen
2013-10-01 8:38 ` Mel Gorman
2013-10-01 17:11 ` Ning Qu
2013-10-14 14:27 ` Kirill A. Shutemov
2013-09-30 15:27 ` Dave Hansen
2013-09-30 18:05 ` Ning Qu
2013-09-25 0:12 ` Ning Qu
2013-09-25 9:23 ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox