linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-25 18:11 Ning Qu
  0 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-25 18:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

Got you. THanks!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Wed, Sep 25, 2013 at 2:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Ning Qu wrote:
>> Hi, Kirill,
>>
>> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>>
>>   thp, mm: handle tail pages in page_cache_get_speculative()
>
> It's not needed anymore, since we don't have tail pages in radix tree.
>
> --
>  Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-10-01  8:38         ` Mel Gorman
  2013-10-01 17:11           ` Ning Qu
@ 2013-10-14 14:27           ` Kirill A. Shutemov
  1 sibling, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Mel Gorman wrote:
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.

Okay. I got your point: more data from real-world workloads. I'll try to
bring some in next iteration.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25 23:29     ` Dave Chinner
@ 2013-10-14 13:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 13:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, Alexander Shishkin, linux-fsdevel,
	linux-kernel

Dave Chinner wrote:
> On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> > Andrew Morton wrote:
> > > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > 
> > > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > > separately.
> > > 
> > > We were never going to do this :(
> > > 
> > > Has anyone reviewed these patches much yet?
> > 
> > Dave did very good review. Few other people looked to separate patches.
> > See Reviewed-by/Acked-by tags in patches.
> > 
> > It looks like most mm experts are busy with numa balancing nowadays, so
> > it's hard to get more review.
> 
> Nobody has reviewed it from the filesystem side, though.
> 
> The changes that require special code paths for huge pages in the
> write_begin/write_end paths are nasty. You're adding conditional
> code that depends on the page size and then having to add checks to
> ensure that large page operations don't step over small page
> boundaries and other such corner cases. It's an extremely fragile
> design, IMO.
> 
> In general, I don't like all the if (thp) {} else {}; code that this
> series introduces - they are code paths that simply won't get tested
> with any sort of regularity and make the code more complex for those
> that aren't using THP to understand and debug...

Okay, I'll try to get rid of special cases where it's possible.

> Then there is a new per-inode lock that is used in
> generic_perform_write() which is held across page faults and calls
> to filesystem block mapping callbacks. This inserts into the middle
> of an existing locking chain that needs to be strictly ordered, and
> as such will lead to the same type of lock inversion problems that
> the mmap_sem had.  We do not want to introduce a new lock that has
> this same problem just as we are getting rid of that long standing
> nastiness from the page fault path...

I don't see how we can protect against splitting with existing locks,
but I'll try find a way.

> I also note that you didn't convert invalidate_inode_pages2_range()
> to support huge pages which is needed by real filesystems that
> support direct IO. There are other truncate/invalidate interfaces
> that you didn't convert, either, and some of them will present you
> with interesting locking challenges as a result of adding that new
> lock...

Thanks. I'll take a look on these code paths.

> > The patchset was mostly ignored for few rounds and Dave suggested to split
> > to have less scary patch number.
> 
> It's still being ignored by filesystem people because you haven't
> actually tried to implement support into a real filesystem.....

If it will support a real filesystem, wouldn't it be ignored due
patch count? ;)

> > > > Please review and consider applying.
> > > 
> > > It appears rather too immature at this stage.
> > 
> > More review is always welcome and I'm committed to address issues.
> 
> IMO, supporting a real block based filesystem like ext4 or XFS and
> demonstrating that everything works is necessary before we go any
> further...

Will see what numbers I can bring in next iterations.

Thanks for your feedback. And sorry for late answer.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-10-01  8:38         ` Mel Gorman
@ 2013-10-01 17:11           ` Ning Qu
  2013-10-14 14:27           ` Kirill A. Shutemov
  1 sibling, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-10-01 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

I can throw in some numbers for one of the test case I am working on.

One of the workload is using sysv shm to load GB level files into
memory, which is shared with other worker processes for long term. We
could load as much file which fits all the physical memory available.
And also, the heap is pretty big (GB level as well) to handle those
data.

For the workload I just mentioned, with thp, we have about 8%
performance improvement, 5% from thp anonymous memory and 3% from thp
page cache. It might not look so good but it's pretty good without
changing one line of code in application, which is the beauty of thp.

Before that, we have been using hugetlbfs, then we have to reserve a
huge amount of memory at boot time, no matter those memory will be
used or not. It is working but no other major services could ever
share the server resources anymore.
Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Tue, Oct 1, 2013 at 1:38 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
>> > AFAIK, this is not a problem in the vast majority of modern CPUs
>>
>> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
>> That's around 2MB. There's more and more code whose footprint exceeds
>> that.
>>
>
> With an expectation that it is read-mostly data, replicated between the
> caches accessing it and TLB refills taking very little time. This is not
> universally true and there are exceptions but even recent papers on TLB
> behaviour have tended to dismiss the iTLB refill overhead as a negligible
> portion of the overall workload of interest.
>
>> Besides iTLB is not the only target. It is also useful for
>> data of course.
>>
>
> True, but how useful? I have not seen an example of a workload showing that
> dTLB pressure on file-backed data was a major component of the workload. I
> would expect that sysV shared memory is an exception but does that require
> generic support for all filesystems or can tmpfs be special cased when
> it's used for shared memory?
>
> For normal data, if it's read-only data then there would be some benefit to
> using huge pages once the data is in page cache. How common are workloads
> that mmap() large amounts of read-only data? Possibly some databases
> depending on the workload although there I would expect that the data is
> placed in shared memory.
>
> If the mmap()s data is being written then the cost of IO is likely to
> dominate, not TLB pressure. For write-mostly workloads there are greater
> concerns because dirty tracking can only be done at the huge page boundary
> potentially leading to greater amounts of IO and degraded performance
> overall.
>
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.
>
>> > > and I found it very hard to be motivated to review the series as a result.
>> > > I suspected that in many cases that the cost of IO would continue to dominate
>> > > performance instead of TLB pressure
>>
>> The trend is to larger and larger memories, keeping things in memory.
>>
>
> Yes, but using huge pages is not *necessarily* the answer. For fault
> scalability it probably would be a lot easier to batch handle faults if
> readahead indicates accesses are sequential. Background zeroing of pages
> could be revisited for fault intensive workloads. A potential alternative
> is that a contiguous page is allocated, zerod as one lump, split the pages
> and put onto a local per-task list although the details get messy. Reclaim
> scanning could be heavily modified to use collections of pages instead of
> single pages (although I'm not aware of the proper design of such a thing).
>
> Again, this could be completely off the mark but if it was me that was
> working on this problem, I would have some profile data from some workloads
> to make sure the part I'm optimising was a noticable percentage of the
> workload and included that in the patch leader. I would hope that the data
> was compelling enough to convince reviewers to pay close attention to the
> series as the complexity would then be justified. Based on how complex THP
> was for anonymous pages, I would be tempted to treat THP for file-backed
> data as a last resort.
>
>> In fact there's a good argument that memory sizes are growing faster
>> than TLB capacities. And without large TLBs we're even further off
>> the curve.
>>
>
> I'll admit this is also true. It was considered to be true in the 90's
> when huge pages were first being thrown around as a possible solution to
> the problem. One paper recently suggested using segmentation for large
> memory segments but the workloads they examined looked like they would
> be dominated by anonymous access, not file-backed data with one exception
> where the workload frequently accessed compile-time constants.
>
> --
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 18:51       ` Andi Kleen
@ 2013-10-01  8:38         ` Mel Gorman
  2013-10-01 17:11           ` Ning Qu
  2013-10-14 14:27           ` Kirill A. Shutemov
  0 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-10-01  8:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
> > AFAIK, this is not a problem in the vast majority of modern CPUs
> 
> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
> That's around 2MB. There's more and more code whose footprint exceeds
> that.
> 

With an expectation that it is read-mostly data, replicated between the
caches accessing it and TLB refills taking very little time. This is not
universally true and there are exceptions but even recent papers on TLB
behaviour have tended to dismiss the iTLB refill overhead as a negligible
portion of the overall workload of interest.

> Besides iTLB is not the only target. It is also useful for 
> data of course.
> 

True, but how useful? I have not seen an example of a workload showing that
dTLB pressure on file-backed data was a major component of the workload. I
would expect that sysV shared memory is an exception but does that require
generic support for all filesystems or can tmpfs be special cased when
it's used for shared memory?

For normal data, if it's read-only data then there would be some benefit to
using huge pages once the data is in page cache. How common are workloads
that mmap() large amounts of read-only data? Possibly some databases
depending on the workload although there I would expect that the data is
placed in shared memory.

If the mmap()s data is being written then the cost of IO is likely to
dominate, not TLB pressure. For write-mostly workloads there are greater
concerns because dirty tracking can only be done at the huge page boundary
potentially leading to greater amounts of IO and degraded performance
overall.

I could be completely wrong here but these were the concerns I had when
I first glanced through the patches. The changelogs had no information
to convince me otherwise so I never dedicated the time to reviewing the
patches in detail. I raised my concerns and then dropped it.

> > > and I found it very hard to be motivated to review the series as a result.
> > > I suspected that in many cases that the cost of IO would continue to dominate
> > > performance instead of TLB pressure
> 
> The trend is to larger and larger memories, keeping things in memory.
> 

Yes, but using huge pages is not *necessarily* the answer. For fault
scalability it probably would be a lot easier to batch handle faults if
readahead indicates accesses are sequential. Background zeroing of pages
could be revisited for fault intensive workloads. A potential alternative
is that a contiguous page is allocated, zerod as one lump, split the pages
and put onto a local per-task list although the details get messy. Reclaim
scanning could be heavily modified to use collections of pages instead of
single pages (although I'm not aware of the proper design of such a thing).

Again, this could be completely off the mark but if it was me that was
working on this problem, I would have some profile data from some workloads
to make sure the part I'm optimising was a noticable percentage of the
workload and included that in the patch leader. I would hope that the data
was compelling enough to convince reviewers to pay close attention to the
series as the complexity would then be justified. Based on how complex THP
was for anonymous pages, I would be tempted to treat THP for file-backed
data as a last resort.

> In fact there's a good argument that memory sizes are growing faster
> than TLB capacities. And without large TLBs we're even further off
> the curve.
> 

I'll admit this is also true. It was considered to be true in the 90's
when huge pages were first being thrown around as a possible solution to
the problem. One paper recently suggested using segmentation for large
memory segments but the workloads they examined looked like they would
be dominated by anonymous access, not file-backed data with one exception
where the workload frequently accessed compile-time constants.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:10     ` Mel Gorman
  2013-09-30 18:07       ` Ning Qu
@ 2013-09-30 18:51       ` Andi Kleen
  2013-10-01  8:38         ` Mel Gorman
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2013-09-30 18:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

> AFAIK, this is not a problem in the vast majority of modern CPUs

Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
That's around 2MB. There's more and more code whose footprint exceeds
that.

Besides iTLB is not the only target. It is also useful for 
data of course.

> > and I found it very hard to be motivated to review the series as a result.
> > I suspected that in many cases that the cost of IO would continue to dominate
> > performance instead of TLB pressure

The trend is to larger and larger memories, keeping things in memory.

In fact there's a good argument that memory sizes are growing faster
than TLB capacities. And without large TLBs we're even further off
the curve.

> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this.  hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.

Of course it's only the first step. But if noone does the babysteps
then the other usages will also not ever materialize.

I expect once ramfs works, extending it to tmpfs etc. should be
straight forward.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:10     ` Mel Gorman
@ 2013-09-30 18:07       ` Ning Qu
  2013-09-30 18:51       ` Andi Kleen
  1 sibling, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-30 18:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

I suppose sysv shm and tmpfs share the same code base now, so both of
them will benefit from thp page cache?

And for Kirill's previous patchset (till v4), it contains mmap support
as well. I suppose the patchset got splitted into smaller group so
it's easier to review ....

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 30, 2013 at 3:10 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
>> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
>> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > > It brings thp support for ramfs, but without mmap() -- it will be posted
>> > > separately.
>> >
>> > We were never going to do this :(
>> >
>> > Has anyone reviewed these patches much yet?
>> >
>>
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>>
>
> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this.  hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.
>
> --
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 15:27     ` Dave Hansen
@ 2013-09-30 18:05       ` Ning Qu
  0 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-30 18:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Alexander Shishkin, linux-fsdevel, linux-kernel

Yes, I agree. For our case, we have tens of GB files and thp with page
cache does improve the number as expected.

And compared to hugetlbfs (static huge page), it's more flexible and
beneficial to the system wide ....


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 30, 2013 at 8:27 AM, Dave Hansen <dave@sr71.net> wrote:
> On 09/30/2013 03:02 AM, Mel Gorman wrote:
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>
> FWIW, I'm mostly intrigued by the possibilities of how this can speed up
> _software_, and I'm rather uninterested in what it can do for the TLB.
> Page cache is particularly painful today, precisely because hugetlbfs
> and anonymous-thp aren't available there.  If you have an app with
> hundreds of GB of files that it wants to mmap(), even if it's in the
> page cache, it takes _minutes_ to just fault in.  One example:
>
>         https://lkml.org/lkml/2013/6/27/698

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:13     ` Mel Gorman
@ 2013-09-30 16:05       ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-30 16:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:13:00AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > > Sigh.  A pox on whoever thought up huge pages. 
> > 
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> > 
> 
> Remember that there are at least two separate issues there. One is the
> handling data in larger granularities than a 4K page and the second is
> the TLB, pagetable etc handling. They are not necessarily the same problem.

It's the same problem in the end.

The hardware is struggling with 4K pages too (both i and d)

I expect longer term TLB/page optimization to have far more important
than all this NUMA placement work that people spend so much
time on.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:02   ` Mel Gorman
  2013-09-30 10:10     ` Mel Gorman
@ 2013-09-30 15:27     ` Dave Hansen
  2013-09-30 18:05       ` Ning Qu
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2013-09-30 15:27 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel

On 09/30/2013 03:02 AM, Mel Gorman wrote:
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.

FWIW, I'm mostly intrigued by the possibilities of how this can speed up
_software_, and I'm rather uninterested in what it can do for the TLB.
Page cache is particularly painful today, precisely because hugetlbfs
and anonymous-thp aren't available there.  If you have an app with
hundreds of GB of files that it wants to mmap(), even if it's in the
page cache, it takes _minutes_ to just fault in.  One example:

	https://lkml.org/lkml/2013/6/27/698

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
  2013-09-26 18:30     ` Zach Brown
@ 2013-09-30 10:13     ` Mel Gorman
  2013-09-30 16:05       ` Andi Kleen
  2 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > Sigh.  A pox on whoever thought up huge pages. 
> 
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".
> 

Remember that there are at least two separate issues there. One is the
handling data in larger granularities than a 4K page and the second is
the TLB, pagetable etc handling. They are not necessarily the same problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:02   ` Mel Gorman
@ 2013-09-30 10:10     ` Mel Gorman
  2013-09-30 18:07       ` Ning Qu
  2013-09-30 18:51       ` Andi Kleen
  2013-09-30 15:27     ` Dave Hansen
  1 sibling, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> > 
> > We were never going to do this :(
> > 
> > Has anyone reviewed these patches much yet?
> > 
> 
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.
> 

Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
benefit I would expect that sysV shared memory workloads would potentially
benefit from this.  hugetlbfs is still required for shared memory areas
but it is not a problem that is addressed by this series.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` Andrew Morton
                     ` (2 preceding siblings ...)
  2013-09-25  9:51   ` Kirill A. Shutemov
@ 2013-09-30 10:02   ` Mel Gorman
  2013-09-30 10:10     ` Mel Gorman
  2013-09-30 15:27     ` Dave Hansen
  3 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2013-09-30 10:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?
> 

I am afraid I never looked too closely once I learned that the primary
motivation for this was relieving iTLB pressure in a very specific
case. AFAIK, this is not a problem in the vast majority of modern CPUs
and I found it very hard to be motivated to review the series as a result.
I suspected that in many cases that the cost of IO would continue to dominate
performance instead of TLB pressure. I also found it unlikely that there
was a workload that was tmpfs based that used enough memory to be hurt
by TLB pressure. My feedback was that a much more compelling case for the
series was needed but this discussion all happened on IRC unfortunately.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-23 12:05 Kirill A. Shutemov
  2013-09-24 23:37 ` Andrew Morton
  2013-09-25  0:12 ` Ning Qu
@ 2013-09-26 21:13 ` Dave Hansen
  2 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2013-09-26 21:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ning Qu, Alexander Shishkin, linux-fsdevel,
	linux-kernel, Luck, Tony

On 09/23/2013 05:05 AM, Kirill A. Shutemov wrote:
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

This does, at the least, give us a shared memory mechanism that can move
between large and small pages.  We don't have anything which can do that
today.

Tony Luck was just mentioning that if we have a small (say 1-bit) memory
failure in a hugetlbfs page, then we end up tossing out the entire 2MB.
 The app gets a chance to recover the contents, but it has to do it for
the entire 2MB.  Ideally, we'd like to break the 2M down in to 4k pages,
which lets us continue using the remaining 2M-4k, and leaves the app to
rebuild 4k of its data instead of 2M.

If you look at the diffstat, it's also pretty obvious that virtually
none of this code is actually specific to ramfs.  It'll all get used as
the foundation for the "real" filesystems too.  I'm very interested in
how those end up looking, too, but I think Kirill is selling his patches
a bit short calling this a toy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-26 18:30     ` Zach Brown
@ 2013-09-26 19:05       ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-26 19:05 UTC (permalink / raw)
  To: Zach Brown
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

On Thu, Sep 26, 2013 at 11:30:22AM -0700, Zach Brown wrote:
> > > Sigh.  A pox on whoever thought up huge pages. 
> > 
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> 
> And "how"!
> 
> Sprinking a bunch of magical if (thp) {} else {} throughtout the code
> looks like a stunningly bad idea to me.  It'd take real work to
> restructure the code such that the current paths are a degenerate case
> of the larger thp page case, but that's the work that needs doing in my
> estimation.

Sorry, but that is how all of large pages in the Linux VM works
(both THP and hugetlbfs) 

Yes it would be nice if small pages and large pages all ran
in a unified VM. But that's not how Linux is designed today.

Yes having a Pony would be nice too.

Back when huge pages were originally proposed Linus came
up with the "separate hugetlbfs VM" design and that is what were
stuck with today.

Asking for a whole scale VM redesign is just not realistic.

VM is always changing in baby steps. And the only 
known way to do that is to have if (thp) and if (hugetlbfs) .

-Andi 

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
@ 2013-09-26 18:30     ` Zach Brown
  2013-09-26 19:05       ` Andi Kleen
  2013-09-30 10:13     ` Mel Gorman
  2 siblings, 1 reply; 27+ messages in thread
From: Zach Brown @ 2013-09-26 18:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

> > Sigh.  A pox on whoever thought up huge pages. 
> 
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".

And "how"!

Sprinking a bunch of magical if (thp) {} else {} throughtout the code
looks like a stunningly bad idea to me.  It'd take real work to
restructure the code such that the current paths are a degenerate case
of the larger thp page case, but that's the work that needs doing in my
estimation.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25  9:51   ` Kirill A. Shutemov
@ 2013-09-25 23:29     ` Dave Chinner
  2013-10-14 13:56       ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2013-09-25 23:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> > 
> > We were never going to do this :(
> > 
> > Has anyone reviewed these patches much yet?
> 
> Dave did very good review. Few other people looked to separate patches.
> See Reviewed-by/Acked-by tags in patches.
> 
> It looks like most mm experts are busy with numa balancing nowadays, so
> it's hard to get more review.

Nobody has reviewed it from the filesystem side, though.

The changes that require special code paths for huge pages in the
write_begin/write_end paths are nasty. You're adding conditional
code that depends on the page size and then having to add checks to
ensure that large page operations don't step over small page
boundaries and other such corner cases. It's an extremely fragile
design, IMO.

In general, I don't like all the if (thp) {} else {}; code that this
series introduces - they are code paths that simply won't get tested
with any sort of regularity and make the code more complex for those
that aren't using THP to understand and debug...

Then there is a new per-inode lock that is used in
generic_perform_write() which is held across page faults and calls
to filesystem block mapping callbacks. This inserts into the middle
of an existing locking chain that needs to be strictly ordered, and
as such will lead to the same type of lock inversion problems that
the mmap_sem had.  We do not want to introduce a new lock that has
this same problem just as we are getting rid of that long standing
nastiness from the page fault path...

I also note that you didn't convert invalidate_inode_pages2_range()
to support huge pages which is needed by real filesystems that
support direct IO. There are other truncate/invalidate interfaces
that you didn't convert, either, and some of them will present you
with interesting locking challenges as a result of adding that new
lock...

> The patchset was mostly ignored for few rounds and Dave suggested to split
> to have less scary patch number.

It's still being ignored by filesystem people because you haven't
actually tried to implement support into a real filesystem.....

> > > Please review and consider applying.
> > 
> > It appears rather too immature at this stage.
> 
> More review is always welcome and I'm committed to address issues.

IMO, supporting a real block based filesystem like ext4 or XFS and
demonstrating that everything works is necessary before we go any
further...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25 11:15       ` Kirill A. Shutemov
@ 2013-09-25 15:05         ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-25 15:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

> (it may require dynamic linker change to align length to huge page
> boundary) 

x86-64 binaries should be already padded for this.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:58     ` Andrew Morton
@ 2013-09-25 11:15       ` Kirill A. Shutemov
  2013-09-25 15:05         ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Andrew Morton wrote:
> On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> 
> > > At the very least we should get this done for a real filesystem to see
> > > how intrusive the changes are and to evaluate the performance changes.
> > 
> > That would give even larger patches, and people already complain
> > the patchkit is too large.
> 
> The thing is that merging an implementation for ramfs commits us to
> doing it for the major real filesystems.  Before making that commitment
> we should at least have a pretty good understanding of what those
> changes will look like.
> 
> Plus I don't see how we can realistically performance-test it without
> having real physical backing store in the picture?

My plan for real filesystem is to get it first beneficial for read-mostly
files:
 - allocate huge pages on read (or collapse small pages) only if nobody
   has the inode opened on write;
 - split huge page on write to avoid dealing with write back patch at
   first and dirty only 4k pages;

This will will get most of elf executables and libraries mapped with huge
pages (it may require dynamic linker change to align length to huge page
boundary) which is not bad for start.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` Andrew Morton
  2013-09-24 23:48   ` Ning Qu
  2013-09-24 23:49   ` Andi Kleen
@ 2013-09-25  9:51   ` Kirill A. Shutemov
  2013-09-25 23:29     ` Dave Chinner
  2013-09-30 10:02   ` Mel Gorman
  3 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25  9:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?

Dave did very good review. Few other people looked to separate patches.
See Reviewed-by/Acked-by tags in patches.

It looks like most mm experts are busy with numa balancing nowadays, so
it's hard to get more review.

The patchset was mostly ignored for few rounds and Dave suggested to split
to have less scary patch number.

> > Please review and consider applying.
> 
> It appears rather too immature at this stage.

More review is always welcome and I'm committed to address issues.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25  0:12 ` Ning Qu
@ 2013-09-25  9:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25  9:23 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Alexander Shishkin, linux-fsdevel, linux-kernel

Ning Qu wrote:
> Hi, Kirill,
> 
> Seems you dropped one patch in v5, is that intentional? Just wondering ...
> 
>   thp, mm: handle tail pages in page_cache_get_speculative()

It's not needed anymore, since we don't have tail pages in radix tree.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-23 12:05 Kirill A. Shutemov
  2013-09-24 23:37 ` Andrew Morton
@ 2013-09-25  0:12 ` Ning Qu
  2013-09-25  9:23   ` Kirill A. Shutemov
  2013-09-26 21:13 ` Dave Hansen
  2 siblings, 1 reply; 27+ messages in thread
From: Ning Qu @ 2013-09-25  0:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 14023 bytes --]

Hi, Kirill,

Seems you dropped one patch in v5, is that intentional? Just wondering ...

  thp, mm: handle tail pages in page_cache_get_speculative()

Thanks!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.
>
> Please review and consider applying.
>
> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.
>
> Design overview
> ---------------
>
> Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
> (512 on x86-64) entries. All entries points to head page -- refcounting for
> tail pages is pretty expensive.
>
> Radix tree manipulations are implemented in batched way: we add and remove
> whole huge page at once, under one tree_lock. To make it possible, we
> extended radix-tree interface to be able to pre-allocate memory enough to
> insert a number of *contiguous* elements (kudos to Matthew Wilcox).
>
> Huge pages can be added to page cache three ways:
>  - write(2) to file or page;
>  - read(2) from sparse file;
>  - fault sparse file.
>
> Potentially, one more way is collapsing small page, but it's outside
> initial
> implementation.
>
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
> some room for speed up later.
>
> Since mmap() isn't targeted for this patchset, we just split huge page on
> page fault.
>
> To minimize memory overhead for small files we aviod write-allocation in
> first huge page area (2M on x86-64) of the file.
>
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range. If a huge page is only partly in the range we zero out
> the part, exactly like we do for partial small pages.
>
> split_huge_page() for file pages works similar to anon pages, but we
> walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
> truncate_inode_pages() to drop small pages beyond i_size, if any.
>
> inode->i_split_sem taken on read will protect hugepages in inode's
> pagecache
> against splitting. We take it on write during splitting.
>
> Changes since v5
> ----------------
>  - change how hugepage stored in pagecache: head page for all relevant
>    indexes;
>  - introduce i_split_sem;
>  - do not create huge pages on write(2) into first hugepage area;
>  - compile-disabled by default;
>  - fix transparent_hugepage_pagecache();
>
> Benchmarks
> ----------
>
> Since the patchset doesn't include mmap() support, we should expect much
> change in performance. We just need to check that we don't introduce any
> major regression.
>
> On average read/write on ramfs with thp is a bit slower, but I don't think
> it's a stopper -- ramfs is a toy anyway, on real world filesystems I
> expect difference to be smaller.
>
> postmark
> ========
>
> workload1:
> chmod +x postmark
> mount -t ramfs none /mnt
> cat >/root/workload1 <<EOF
> set transactions 250000
> set size 5120 524288
> set number 500
> run
> quit
>
> workload2:
> set transactions 10000
> set size 2097152 10485760
> set number 100
> run
> quit
>
> throughput (transactions/sec)
>                 workload1       workload2
> baseline        8333            416
> patched         8333            454
>
> FS-Mark
> =======
>
> throughput (files/sec)
>
>                 2000 files by 1M        200 files by 10M
> baseline        5326.1                  548.1
> patched         5192.8                  528.4
>
> tiobench
> ========
>
> baseline:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2048 MBs |    0.2 s | 8667.792 MB/s | 445.2 %  | 5535.9 % |
> | Random Write   62 MBs |    0.0 s | 8341.118 MB/s |   0.0 %  | 2615.8 % |
> | Read         2048 MBs |    0.2 s | 11680.431 MB/s | 339.9 %  | 5470.6 % |
> | Random Read    62 MBs |    0.0 s | 9451.081 MB/s | 786.3 %  | 1451.7 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        0.006 ms |       28.019 ms |  0.00000 |   0.00000 |
> | Random Write |        0.002 ms |        5.574 ms |  0.00000 |   0.00000 |
> | Read         |        0.005 ms |       28.018 ms |  0.00000 |   0.00000 |
> | Random Read  |        0.002 ms |        4.852 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |        0.005 ms |       28.019 ms |  0.00000 |   0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> patched:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2048 MBs |    0.3 s | 7942.818 MB/s | 442.1 %  | 5533.6 % |
> | Random Write   62 MBs |    0.0 s | 9425.426 MB/s | 723.9 %  | 965.2 % |
> | Read         2048 MBs |    0.2 s | 11998.008 MB/s | 374.9 %  | 5485.8 % |
> | Random Read    62 MBs |    0.0 s | 9823.955 MB/s | 251.5 %  | 2011.9 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        0.007 ms |       28.020 ms |  0.00000 |   0.00000 |
> | Random Write |        0.001 ms |        0.022 ms |  0.00000 |   0.00000 |
> | Read         |        0.004 ms |       24.011 ms |  0.00000 |   0.00000 |
> | Random Read  |        0.001 ms |        0.019 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |        0.005 ms |       28.020 ms |  0.00000 |   0.00000 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> IOZone
> ======
>
> Syscalls, not mmap.
>
> ** Initial writers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           4741691    7986408    9149064    9898695    9868597
>  9629383    9469202   11605064    9507802   10641869   11360701   11040376
> patched:            4682864    7275535    8691034    8872887    8712492
>  8771912    8397216    7701346    7366853    8839736    8299893   10788439
> speed-up(times):       0.99       0.91       0.95       0.90       0.88
>     0.91       0.89       0.66       0.77       0.83       0.73       0.98
>
> ** Rewriters **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           5807891    9554869   12101083   13113533   12989751
> 14359910   16998236   16833861   24735659   17502634   17396706   20448655
> patched:            6161690    9981294   12285789   13428846   13610058
> 13669153   20060182   17328347   24109999   19247934   24225103   34686574
> speed-up(times):       1.06       1.04       1.02       1.02       1.05
>     0.95       1.18       1.03       0.97       1.10       1.39       1.70
>
> ** Readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           7978066   11825735   13808941   14049598   14765175
> 14422642   17322681   23209831   21386483   20060744   22032935   31166663
> patched:            7723293   11481500   13796383   14363808   14353966
> 14979865   17648225   18701258   29192810   23973723   22163317   23104638
> speed-up(times):       0.97       0.97       1.00       1.02       0.97
>     1.04       1.02       0.81       1.37       1.20       1.01       0.74
>
> ** Re-readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           7966269   11878323   14000782   14678206   14154235
> 14271991   15170829   20924052   27393344   19114990   12509316   18495597
> patched:            7719350   11410937   13710233   13232756   14040928
> 15895021   16279330   17256068   26023572   18364678   27834483   23288680
> speed-up(times):       0.97       0.96       0.98       0.90       0.99
>     1.11       1.07       0.82       0.95       0.96       2.23       1.26
>
> ** Reverse readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           6630795   10331013   12839501   13157433   12783323
> 13580283   15753068   15434572   21928982   17636994   14737489   19470679
> patched:            6502341    9887711   12639278   12979232   13212825
> 12928255   13961195   14695786   21370667   19873807   20902582   21892899
> speed-up(times):       0.98       0.96       0.98       0.99       1.03
>     0.95       0.89       0.95       0.97       1.13       1.42       1.12
>
> ** Random_readers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           5152935    9043813   11752615   11996078   12283579
> 12484039   14588004   15781507   23847538   15748906   13698335   27195847
> patched:            5009089    8438137   11266015   11631218   12093650
> 12779308   17768691   13640378   30468890   19269033   23444358   22775908
> speed-up(times):       0.97       0.93       0.96       0.97       0.98
>     1.02       1.22       0.86       1.28       1.22       1.71       0.84
>
> ** Random_writers **
> threads:                  1          2          4          8         10
>       20         30         40         50         60         70         80
> baseline:           3886268    7405345   10531192   10858984   10994693
> 12758450   10729531    9656825   10370144   13139452    4528331   12615812
> patched:            4335323    7916132   10978892   11423247   11790932
> 11424525   11798171   11413452   12230616   13075887   11165314   16925679
> speed-up(times):       1.12       1.07       1.04       1.05       1.07
>     0.90       1.10       1.18       1.18       1.00       2.47       1.34
>
> Kirill A. Shutemov (22):
>   mm: implement zero_huge_user_segment and friends
>   radix-tree: implement preload for multiple contiguous elements
>   memcg, thp: charge huge cache pages
>   thp: compile-time and sysfs knob for thp pagecache
>   thp, mm: introduce mapping_can_have_hugepages() predicate
>   thp: represent file thp pages in meminfo and friends
>   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>   mm: trace filemap: dump page order
>   block: implement add_bdi_stat()
>   thp, mm: rewrite delete_from_page_cache() to support huge pages
>   thp, mm: warn if we try to use replace_page_cache_page() with THP
>   thp, mm: add event counters for huge page alloc on file write or read
>   mm, vfs: introduce i_split_sem
>   thp, mm: allocate huge pages in grab_cache_page_write_begin()
>   thp, mm: naive support of thp in generic_perform_write
>   thp, mm: handle transhuge pages in do_generic_file_read()
>   thp, libfs: initial thp support
>   truncate: support huge pages
>   thp: handle file pages in split_huge_page()
>   thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>   thp, mm: split huge page on mmap file page
>   ramfs: enable transparent huge page cache
>
>  Documentation/vm/transhuge.txt |  16 ++++
>  drivers/base/node.c            |   4 +
>  fs/inode.c                     |   3 +
>  fs/libfs.c                     |  58 +++++++++++-
>  fs/proc/meminfo.c              |   3 +
>  fs/ramfs/file-mmu.c            |   2 +-
>  fs/ramfs/inode.c               |   6 +-
>  include/linux/backing-dev.h    |  10 +++
>  include/linux/fs.h             |  11 +++
>  include/linux/huge_mm.h        |  68 +++++++++++++-
>  include/linux/mm.h             |  18 ++++
>  include/linux/mmzone.h         |   1 +
>  include/linux/page-flags.h     |  13 +++
>  include/linux/pagemap.h        |  31 +++++++
>  include/linux/radix-tree.h     |  11 +++
>  include/linux/vm_event_item.h  |   4 +
>  include/trace/events/filemap.h |   7 +-
>  lib/radix-tree.c               |  94 ++++++++++++++++++--
>  mm/Kconfig                     |  11 +++
>  mm/filemap.c                   | 196
> ++++++++++++++++++++++++++++++++---------
>  mm/huge_memory.c               | 147 +++++++++++++++++++++++++++----
>  mm/memcontrol.c                |   3 +-
>  mm/memory.c                    |  40 ++++++++-
>  mm/truncate.c                  | 125 ++++++++++++++++++++------
>  mm/vmstat.c                    |   5 ++
>  25 files changed, 779 insertions(+), 108 deletions(-)
>
> --
> 1.8.4.rc3
>
>

[-- Attachment #2: Type: text/html, Size: 19253 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
@ 2013-09-24 23:58     ` Andrew Morton
  2013-09-25 11:15       ` Kirill A. Shutemov
  2013-09-26 18:30     ` Zach Brown
  2013-09-30 10:13     ` Mel Gorman
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2013-09-24 23:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:

> > At the very least we should get this done for a real filesystem to see
> > how intrusive the changes are and to evaluate the performance changes.
> 
> That would give even larger patches, and people already complain
> the patchkit is too large.

The thing is that merging an implementation for ramfs commits us to
doing it for the major real filesystems.  Before making that commitment
we should at least have a pretty good understanding of what those
changes will look like.

Plus I don't see how we can realistically performance-test it without
having real physical backing store in the picture?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` Andrew Morton
  2013-09-24 23:48   ` Ning Qu
@ 2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
                       ` (2 more replies)
  2013-09-25  9:51   ` Kirill A. Shutemov
  2013-09-30 10:02   ` Mel Gorman
  3 siblings, 3 replies; 27+ messages in thread
From: Andi Kleen @ 2013-09-24 23:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?

There already was a lot of review by various people.

This is not the first post, just the latest refactoring.

> > Intro
> > -----
> > 
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> > 
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
> 
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.

That would give even larger patches, and people already complain
the patchkit is too large.

The only good way to handle this is baby steps, and you 
have to start somewhere.

> Sigh.  A pox on whoever thought up huge pages. 

managing 1TB+ of memory in 4K chunks is just insane.
The question of larger pages is not "if", but only "when".

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` Andrew Morton
@ 2013-09-24 23:48   ` Ning Qu
  2013-09-24 23:49   ` Andi Kleen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Ning Qu @ 2013-09-24 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]

I am working on the tmpfs side on top of this patchset, which I assume has
better applications usage than ramfs.

However, I am working on 3.3 so far and will probably get my patches ported
to upstream pretty soon. I believe my patchset is also in early stage but
it does help to get some solid numbers in our own projects, which is very
convincing. However, I think it does depend on the characteristic of the
job .....



Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Tue, Sep 24, 2013 at 4:37 PM, Andrew Morton <akpm@linux-foundation.org>wrote:

> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <
> kirill.shutemov@linux.intel.com> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
>
> > Please review and consider applying.
>
> It appears rather too immature at this stage.
>
> > Intro
> > -----
> >
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> >
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
>
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.
>
>
> Sigh.  A pox on whoever thought up huge pages.  Words cannot express
> how much of a godawful mess they have made of Linux MM.  And it hasn't
> ended yet :( My take is that we'd need to see some very attractive and
> convincing real-world performance numbers before even thinking of
> taking this on.
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 4403 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-23 12:05 Kirill A. Shutemov
@ 2013-09-24 23:37 ` Andrew Morton
  2013-09-24 23:48   ` Ning Qu
                     ` (3 more replies)
  2013-09-25  0:12 ` Ning Qu
  2013-09-26 21:13 ` Dave Hansen
  2 siblings, 4 replies; 27+ messages in thread
From: Andrew Morton @ 2013-09-24 23:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.

We were never going to do this :(

Has anyone reviewed these patches much yet?

> Please review and consider applying.

It appears rather too immature at this stage.

> Intro
> -----
> 
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
> 
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

At the very least we should get this done for a real filesystem to see
how intrusive the changes are and to evaluate the performance changes.


Sigh.  A pox on whoever thought up huge pages.  Words cannot express
how much of a godawful mess they have made of Linux MM.  And it hasn't
ended yet :( My take is that we'd need to see some very attractive and
convincing real-world performance numbers before even thinking of
taking this on.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-23 12:05 Kirill A. Shutemov
  2013-09-24 23:37 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Please review and consider applying.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries. All entries points to head page -- refcounting for
tail pages is pretty expensive.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small files we aviod write-allocation in
first huge page area (2M on x86-64) of the file.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

inode->i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting. We take it on write during splitting.

Changes since v5
----------------
 - change how hugepage stored in pagecache: head page for all relevant
   indexes;
 - introduce i_split_sem;
 - do not create huge pages on write(2) into first hugepage area;
 - compile-disabled by default;
 - fix transparent_hugepage_pagecache();

Benchmarks
----------

Since the patchset doesn't include mmap() support, we should expect much
change in performance. We just need to check that we don't introduce any
major regression.

On average read/write on ramfs with thp is a bit slower, but I don't think
it's a stopper -- ramfs is a toy anyway, on real world filesystems I
expect difference to be smaller.

postmark
========

workload1:
chmod +x postmark
mount -t ramfs none /mnt
cat >/root/workload1 <<EOF
set transactions 250000
set size 5120 524288
set number 500
run
quit

workload2:
set transactions 10000
set size 2097152 10485760
set number 100
run
quit

throughput (transactions/sec)
                workload1       workload2
baseline        8333            416
patched         8333            454

FS-Mark
=======

throughput (files/sec)

                2000 files by 1M        200 files by 10M
baseline        5326.1                  548.1
patched         5192.8                  528.4

tiobench
========

baseline:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.2 s | 8667.792 MB/s | 445.2 %  | 5535.9 % |
| Random Write   62 MBs |    0.0 s | 8341.118 MB/s |   0.0 %  | 2615.8 % |
| Read         2048 MBs |    0.2 s | 11680.431 MB/s | 339.9 %  | 5470.6 % |
| Random Read    62 MBs |    0.0 s | 9451.081 MB/s | 786.3 %  | 1451.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.006 ms |       28.019 ms |  0.00000 |   0.00000 |
| Random Write |        0.002 ms |        5.574 ms |  0.00000 |   0.00000 |
| Read         |        0.005 ms |       28.018 ms |  0.00000 |   0.00000 |
| Random Read  |        0.002 ms |        4.852 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.019 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.3 s | 7942.818 MB/s | 442.1 %  | 5533.6 % |
| Random Write   62 MBs |    0.0 s | 9425.426 MB/s | 723.9 %  | 965.2 % |
| Read         2048 MBs |    0.2 s | 11998.008 MB/s | 374.9 %  | 5485.8 % |
| Random Read    62 MBs |    0.0 s | 9823.955 MB/s | 251.5 %  | 2011.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.007 ms |       28.020 ms |  0.00000 |   0.00000 |
| Random Write |        0.001 ms |        0.022 ms |  0.00000 |   0.00000 |
| Read         |        0.004 ms |       24.011 ms |  0.00000 |   0.00000 |
| Random Read  |        0.001 ms |        0.019 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.020 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

IOZone
======

Syscalls, not mmap.

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    4741691    7986408    9149064    9898695    9868597    9629383    9469202   11605064    9507802   10641869   11360701   11040376
patched:	    4682864    7275535    8691034    8872887    8712492    8771912    8397216    7701346    7366853    8839736    8299893   10788439
speed-up(times):       0.99       0.91       0.95       0.90       0.88       0.91       0.89       0.66       0.77       0.83       0.73       0.98

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5807891    9554869   12101083   13113533   12989751   14359910   16998236   16833861   24735659   17502634   17396706   20448655
patched:	    6161690    9981294   12285789   13428846   13610058   13669153   20060182   17328347   24109999   19247934   24225103   34686574
speed-up(times):       1.06       1.04       1.02       1.02       1.05       0.95       1.18       1.03       0.97       1.10       1.39       1.70

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7978066   11825735   13808941   14049598   14765175   14422642   17322681   23209831   21386483   20060744   22032935   31166663
patched:	    7723293   11481500   13796383   14363808   14353966   14979865   17648225   18701258   29192810   23973723   22163317   23104638
speed-up(times):       0.97       0.97       1.00       1.02       0.97       1.04       1.02       0.81       1.37       1.20       1.01       0.74

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7966269   11878323   14000782   14678206   14154235   14271991   15170829   20924052   27393344   19114990   12509316   18495597
patched:	    7719350   11410937   13710233   13232756   14040928   15895021   16279330   17256068   26023572   18364678   27834483   23288680
speed-up(times):       0.97       0.96       0.98       0.90       0.99       1.11       1.07       0.82       0.95       0.96       2.23       1.26

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    6630795   10331013   12839501   13157433   12783323   13580283   15753068   15434572   21928982   17636994   14737489   19470679
patched:	    6502341    9887711   12639278   12979232   13212825   12928255   13961195   14695786   21370667   19873807   20902582   21892899
speed-up(times):       0.98       0.96       0.98       0.99       1.03       0.95       0.89       0.95       0.97       1.13       1.42       1.12

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5152935    9043813   11752615   11996078   12283579   12484039   14588004   15781507   23847538   15748906   13698335   27195847
patched:	    5009089    8438137   11266015   11631218   12093650   12779308   17768691   13640378   30468890   19269033   23444358   22775908
speed-up(times):       0.97       0.93       0.96       0.97       0.98       1.02       1.22       0.86       1.28       1.22       1.71       0.84

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    3886268    7405345   10531192   10858984   10994693   12758450   10729531    9656825   10370144   13139452    4528331   12615812
patched:	    4335323    7916132   10978892   11423247   11790932   11424525   11798171   11413452   12230616   13075887   11165314   16925679
speed-up(times):       1.12       1.07       1.04       1.05       1.07       0.90       1.10       1.18       1.18       1.00       2.47       1.34

Kirill A. Shutemov (22):
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: add event counters for huge page alloc on file write or read
  mm, vfs: introduce i_split_sem
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/inode.c                     |   3 +
 fs/libfs.c                     |  58 +++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   2 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  11 +++
 include/linux/huge_mm.h        |  68 +++++++++++++-
 include/linux/mm.h             |  18 ++++
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  13 +++
 include/linux/pagemap.h        |  31 +++++++
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  94 ++++++++++++++++++--
 mm/Kconfig                     |  11 +++
 mm/filemap.c                   | 196 ++++++++++++++++++++++++++++++++---------
 mm/huge_memory.c               | 147 +++++++++++++++++++++++++++----
 mm/memcontrol.c                |   3 +-
 mm/memory.c                    |  40 ++++++++-
 mm/truncate.c                  | 125 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 25 files changed, 779 insertions(+), 108 deletions(-)

-- 
1.8.4.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-10-14 14:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-25 18:11 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Ning Qu
  -- strict thread matches above, loose matches on Subject: below --
2013-09-23 12:05 Kirill A. Shutemov
2013-09-24 23:37 ` Andrew Morton
2013-09-24 23:48   ` Ning Qu
2013-09-24 23:49   ` Andi Kleen
2013-09-24 23:58     ` Andrew Morton
2013-09-25 11:15       ` Kirill A. Shutemov
2013-09-25 15:05         ` Andi Kleen
2013-09-26 18:30     ` Zach Brown
2013-09-26 19:05       ` Andi Kleen
2013-09-30 10:13     ` Mel Gorman
2013-09-30 16:05       ` Andi Kleen
2013-09-25  9:51   ` Kirill A. Shutemov
2013-09-25 23:29     ` Dave Chinner
2013-10-14 13:56       ` Kirill A. Shutemov
2013-09-30 10:02   ` Mel Gorman
2013-09-30 10:10     ` Mel Gorman
2013-09-30 18:07       ` Ning Qu
2013-09-30 18:51       ` Andi Kleen
2013-10-01  8:38         ` Mel Gorman
2013-10-01 17:11           ` Ning Qu
2013-10-14 14:27           ` Kirill A. Shutemov
2013-09-30 15:27     ` Dave Hansen
2013-09-30 18:05       ` Ning Qu
2013-09-25  0:12 ` Ning Qu
2013-09-25  9:23   ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox