From: Avi Kivity <avi@redhat.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>, Mike Galbraith <efault@gmx.de>,
Jason Garrett-Glaser <darkshikari@gmail.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Pekka Enberg <penberg@cs.helsinki.fi>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
Adam Litke <agl@us.ibm.com>, Izik Eidus <ieidus@redhat.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Rik van Riel <riel@redhat.com>, Mel Gorman <mel@csn.ul.ie>,
Dave Hansen <dave@linux.vnet.ibm.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Mike Travis <travis@sgi.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Christoph Lameter <cl@linux-foundation.org>,
Chris Wright <chrisw@sous-sol.org>,
bpicco@redhat.com,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Arnd Bergmann <arnd@arndb.de>,
"Michael S. Tsirkin" <mst@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17
Date: Mon, 12 Apr 2010 13:02:34 +0300 [thread overview]
Message-ID: <4BC2EFBA.5080404@redhat.com> (raw)
In-Reply-To: <20100412092615.GY5683@laptop>
On 04/12/2010 12:26 PM, Nick Piggin wrote:
> On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote:
>
>> On 04/12/2010 11:28 AM, Nick Piggin wrote:
>>
>>>
>>>> We use the "try" tactic extensively. So long as there's a
>>>> reasonable chance of success, and a reasonable fallback on failure,
>>>> it's fine.
>>>>
>>>> Do you think we won't have reasonable success rates? Why?
>>>>
>>> After the memory is fragmented? It's more or less irriversable. So
>>> success rates (to fill a specific number of huges pages) will be fine
>>> up to a point. Then it will be a continual failure.
>>>
>> So we get just a part of the win, not all of it.
>>
> It can degrade over time. This is the difference. Two idencial workloads
> may have performance X and Y depending on whether uptime is 1 day or 20
> days.
>
I don't see why it will degrade. Antifrag will prefer to allocate
dcache near existing dcache.
The only scenario I can see where it degrades is that you have a dcache
load that spills over to all of memory, then falls back leaving a pinned
page in every huge frame. It can happen, but I don't see it as a likely
scenario. But maybe I'm missing something.
>>> Sure, some workloads simply won't trigger fragmentation problems.
>>> Others will.
>>>
>> Some workloads benefit from readahead. Some don't. In fact,
>> readahead has a higher potential to reduce performance.
>>
>> Same as with many other optimizations.
>>
> Do you see any difference with your examples and this issue?
>
Memory layout is more persistent. Well, disk layout is even more
persistent. Still we do extents, and if our disk is fragmented, we take
the hit.
>> Well, I'll accept what you say since I'm nowhere near as familiar
>> with the code. But maybe someone insane will come along and do it.
>>
> And it'll get nacked :) And it's not only dcache that can cause a
> problem. This is part of the whole reason it is insane. It is insane
> to only fix the dcache, because if you accept the dcache is a problem
> that needs such complexity to fix, then you must accept the same for
> the inode caches, the buffer head caches, vmas, radix tree nodes, files
> etc. no?
>
inodes come with dcache, yes. I thought buffer heads are now a much
smaller load. vmas usually don't scale up with memory. If you have a
lot of radix tree nodes, then you also have a lot of pagecache, so the
radix tree nodes can be contained. Open files also don't scale with memory.
>> Yet your effective cache size can be reduced by unhappy aliasing of
>> physical pages in your working set. It's unlikely but it can
>> happen.
>>
>> For a statistical mix of workloads, huge pages will also work just
>> fine. Perhaps not all of them, but most (those that don't fill
>> _all_ of memory with dentries).
>>
> Like I said, you don't need to fill all memory with dentries, you
> just need to be allocating higher order kernel memory and end up
> fragmenting your reclaimable pools.
>
Allocate those higher order pages from the same huge frame.
> And it's not a statistical mix that is the problem. The problem is
> that the workloads that do cause fragmentation problems will run well
> for 1 day or 5 days and then degrade. And it is impossible to know
> what will degrade and what won't and by how much.
>
> I'm not saying this is a showstopper, but it does really suck.
>
>
Can you suggest a real life test workload so we can investigate it?
>> These are all anonymous/pagecache loads, which we deal with well.
>>
> Huh? They also involve sockets, files, and involve all of the above
> data structures I listed and many more.
>
A few thousand sockets and open files is chickenfeed for a server.
They'll kill a few huge frames but won't significantly affect the rest
of memory.
>
>
>>> And yes, Linux works pretty well for a multi-workload platform. You
>>> might be thinking too much about virtualization where you put things
>>> in sterile little boxes and take the performance hit.
>>>
>>>
>> People do it for a reason.
>>
> The reasoning is not always sound though. And also people do other
> things. Including increasingly better containers and workload
> management in the single kernel.
>
Containers are wonderful but still a future thing, and even when fully
implemented they still don't offer the same isolation as
virtualization. For example, the owner of workload A might want to
upgrade the kernel to fix a bug he's hitting, while the owner of
workload B needs three months to test it.
>> The whole point behind kvm is to reuse the Linux core. If we have
>> to reimplement Linux memory management and scheduling, then it's a
>> failure.
>>
> And if you need to add complexity to the Linux core for it, it's
> also a failure.
>
Well, we need to add complexity, and we already have. If the acceptance
criteria for a feature would be 'no new complexity', then the kernel
would be a lot smaller than it is now.
Everything has to be evaluated on the basis of its generality, the
benefit, the importance of the subsystem that needs it, and impact on
the code. Huge pages are already used in server loads so they're not
specific to kvm. The benefit, 5-15%, is significant. You and Linus
might not be interested in virtualization, but a significant and growing
fraction of hosts are virtualized, it's up to us if they run Linux or
something else. And I trust Andrea and the reviewers here to keep the
code impact sane.
> I'm not saying to reimplement things, but if you had a little bit
> more support perhaps. Anyway it's just ideas, I'm not saying that
> transparent hugepages is wrong simply because KVM is a big user and it
> could be implemented in another way.
>
What do you mean by 'more support'?
> But if it is possible for KVM to use libhugetlb with just a bit of
> support from the kernel, then it goes some way to reducing the
> need for transparent hugepages.
>
kvm already works with hugetlbfs. But it's brittle, it means we have to
choose between performance and overcommit.
>> Not everything, just the major users that can scale with the amount
>> of memory in the machine.
>>
> Well you need to audit, to determine if it is going to be a problem or
> not, and it is more than only dentries. (but even dentries would be a
> nightmare considering how widely they're used and how much they're
> passed around the vfs and filesystems).
>
pages are passed around everywhere as well. When something is locked or
its reference count doesn't match the reachable pointer count, you give
up. Only a small number of objects are in active use at any one time.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-04-12 10:03 UTC|newest]
Thread overview: 205+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-02 0:41 Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
2010-04-02 0:41 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 35 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 36 of 41] remove PG_buddy Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-04-02 0:42 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
2010-04-05 19:09 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Andrew Morton
2010-04-05 19:36 ` Ingo Molnar
2010-04-05 20:26 ` Pekka Enberg
2010-04-05 20:32 ` Linus Torvalds
2010-04-05 20:46 ` Pekka Enberg
2010-04-05 20:58 ` Linus Torvalds
2010-04-05 21:54 ` Ingo Molnar
2010-04-05 23:21 ` Andrea Arcangeli
2010-04-06 0:26 ` Linus Torvalds
2010-04-06 1:08 ` [RFD] " Linus Torvalds
2010-04-06 1:26 ` Andrea Arcangeli
2010-04-06 1:35 ` Linus Torvalds
2010-04-06 1:13 ` Andrea Arcangeli
2010-04-06 1:38 ` Linus Torvalds
2010-04-06 2:23 ` Linus Torvalds
2010-04-06 5:25 ` Nick Piggin
2010-04-06 9:08 ` Ingo Molnar
2010-04-06 9:13 ` Ingo Molnar
2010-04-10 18:47 ` Andrea Arcangeli
2010-04-10 19:02 ` Ingo Molnar
2010-04-10 19:22 ` Avi Kivity
2010-04-10 19:47 ` Ingo Molnar
2010-04-10 20:00 ` Andrea Arcangeli
2010-04-10 20:10 ` Andrea Arcangeli
2010-04-10 20:21 ` Jason Garrett-Glaser
2010-04-10 20:24 ` Avi Kivity
2010-04-10 20:42 ` Avi Kivity
2010-04-10 20:47 ` Andrea Arcangeli
2010-04-10 21:00 ` Avi Kivity
2010-04-10 21:47 ` Andrea Arcangeli
2010-04-11 1:05 ` Andrea Arcangeli
2010-04-11 11:24 ` Ingo Molnar
2010-04-11 11:33 ` Avi Kivity
2010-04-11 12:11 ` Ingo Molnar
2010-04-25 19:27 ` Andrea Arcangeli
2010-04-26 18:01 ` Andrea Arcangeli
2010-04-30 9:55 ` Ingo Molnar
2010-04-30 15:19 ` Andrea Arcangeli
2010-05-02 12:17 ` Ingo Molnar
2010-04-10 20:49 ` Jason Garrett-Glaser
2010-04-10 20:53 ` Avi Kivity
2010-04-10 20:58 ` Jason Garrett-Glaser
2010-04-11 9:29 ` Avi Kivity
2010-04-11 9:37 ` Jason Garrett-Glaser
2010-04-11 9:40 ` Avi Kivity
2010-04-11 10:22 ` Jason Garrett-Glaser
2010-04-11 11:00 ` Ingo Molnar
2010-04-11 11:19 ` Avi Kivity
2010-04-11 11:30 ` Jason Garrett-Glaser
2010-04-11 11:52 ` hugepages will matter more in the future Ingo Molnar
2010-04-11 12:01 ` Avi Kivity
2010-04-11 12:35 ` Ingo Molnar
2010-04-11 15:22 ` Linus Torvalds
2010-04-11 15:43 ` Avi Kivity
2010-04-11 15:52 ` Linus Torvalds
2010-04-11 16:04 ` Avi Kivity
2010-04-12 7:45 ` Ingo Molnar
2010-04-12 8:14 ` Nick Piggin
2010-04-12 8:22 ` Ingo Molnar
2010-04-12 8:34 ` Nick Piggin
2010-04-12 8:47 ` Avi Kivity
2010-04-12 8:45 ` Andrea Arcangeli
2010-04-11 19:35 ` Andrea Arcangeli
2010-04-12 16:20 ` Rik van Riel
2010-04-12 16:40 ` Linus Torvalds
2010-04-12 16:56 ` Linus Torvalds
2010-04-12 17:06 ` Randy Dunlap
2010-04-12 17:36 ` Andrea Arcangeli
2010-04-12 17:46 ` Rik van Riel
2010-04-11 19:40 ` Andrea Arcangeli
2010-04-12 15:41 ` Linus Torvalds
2010-04-12 11:22 ` Arjan van de Ven
2010-04-12 11:29 ` Avi Kivity
2010-04-17 15:12 ` Arjan van de Ven
2010-04-17 18:18 ` Avi Kivity
2010-04-17 19:05 ` Arjan van de Ven
2010-04-17 19:05 ` Avi Kivity
2010-04-17 19:18 ` Arjan van de Ven
2010-04-17 19:20 ` Avi Kivity
2010-04-12 13:30 ` Andrea Arcangeli
2010-04-12 13:33 ` Avi Kivity
2010-04-12 13:39 ` Andrea Arcangeli
2010-04-12 13:53 ` Avi Kivity
2010-04-13 11:38 ` Ingo Molnar
2010-04-13 13:17 ` Andrea Arcangeli
2010-04-11 10:46 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
2010-04-11 10:49 ` Ingo Molnar
2010-04-11 11:30 ` Avi Kivity
2010-04-11 12:08 ` Ingo Molnar
2010-04-11 12:24 ` Avi Kivity
2010-04-11 12:46 ` Ingo Molnar
2010-04-12 6:09 ` Nick Piggin
2010-04-12 6:18 ` Pekka Enberg
2010-04-12 6:48 ` Nick Piggin
2010-04-12 14:29 ` Christoph Lameter
2010-04-12 16:06 ` Nick Piggin
2010-04-12 6:36 ` Avi Kivity
2010-04-12 6:55 ` Ingo Molnar
2010-04-12 7:15 ` Nick Piggin
2010-04-12 7:45 ` Avi Kivity
2010-04-12 8:28 ` Nick Piggin
2010-04-12 9:01 ` Andrea Arcangeli
2010-04-12 9:03 ` Avi Kivity
2010-04-12 9:26 ` Nick Piggin
2010-04-12 9:39 ` Andrea Arcangeli
2010-04-12 10:02 ` Avi Kivity [this message]
2010-04-12 10:08 ` Andrea Arcangeli
2010-04-12 10:10 ` Avi Kivity
2010-04-12 10:23 ` Andrea Arcangeli
2010-04-12 10:37 ` Nick Piggin
2010-04-12 10:59 ` Avi Kivity
2010-04-12 12:23 ` Avi Kivity
2010-04-12 13:25 ` Andrea Arcangeli
2010-04-13 0:38 ` Andrew Morton
2010-04-13 6:18 ` Neil Brown
2010-04-13 13:31 ` Andrea Arcangeli
2010-04-13 13:40 ` Mel Gorman
2010-04-13 13:44 ` Andrea Arcangeli
2010-04-13 13:55 ` Mel Gorman
2010-04-13 14:03 ` Andrea Arcangeli
2010-04-12 7:51 ` Ingo Molnar
2010-04-12 7:18 ` Andrea Arcangeli
2010-04-12 6:49 ` Ingo Molnar
2010-04-12 7:35 ` Andrea Arcangeli
2010-04-12 7:08 ` Andrea Arcangeli
2010-04-12 7:21 ` Nick Piggin
2010-04-12 7:50 ` Avi Kivity
2010-04-12 8:07 ` Ingo Molnar
2010-04-12 8:21 ` Andrea Arcangeli
2010-04-12 10:27 ` Mel Gorman
2010-04-12 8:18 ` Andrea Arcangeli
2010-04-12 8:06 ` Andrea Arcangeli
2010-04-12 10:44 ` Mel Gorman
2010-04-12 11:12 ` Avi Kivity
2010-04-12 13:17 ` Andrea Arcangeli
2010-04-12 14:24 ` Christoph Lameter
2010-04-12 14:49 ` Avi Kivity
2010-04-06 9:55 ` Avi Kivity
2010-04-06 9:57 ` Avi Kivity
2010-04-06 11:55 ` Avi Kivity
2010-04-06 13:10 ` Nick Piggin
2010-04-06 13:22 ` Avi Kivity
2010-04-06 13:45 ` Nick Piggin
2010-04-06 13:57 ` Avi Kivity
2010-04-06 16:50 ` Andrea Arcangeli
2010-04-06 17:31 ` Avi Kivity
2010-04-06 18:00 ` Christoph Lameter
2010-04-06 18:04 ` Avi Kivity
2010-04-06 18:47 ` Avi Kivity
2010-04-06 14:44 ` Rik van Riel
2010-04-06 16:43 ` Andrea Arcangeli
2010-04-06 9:30 ` Mel Gorman
2010-04-06 10:32 ` Theodore Tso
2010-04-06 11:16 ` Mel Gorman
2010-04-06 13:13 ` Theodore Tso
2010-04-06 14:55 ` Mel Gorman
2010-04-06 16:46 ` Andrea Arcangeli
2010-04-05 21:01 ` Chris Mason
2010-04-05 21:18 ` Avi Kivity
2010-04-05 21:33 ` Linus Torvalds
2010-04-05 22:33 ` Chris Mason
2010-04-06 8:30 ` Mel Gorman
2010-04-06 11:35 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4BC2EFBA.5080404@redhat.com \
--to=avi@redhat.com \
--cc=aarcange@redhat.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=balbir@linux.vnet.ibm.com \
--cc=benh@kernel.crashing.org \
--cc=bpicco@redhat.com \
--cc=chrisw@sous-sol.org \
--cc=cl@linux-foundation.org \
--cc=darkshikari@gmail.com \
--cc=dave@linux.vnet.ibm.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=hugh.dickins@tiscali.co.uk \
--cc=ieidus@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mingo@elte.hu \
--cc=mst@redhat.com \
--cc=mtosatti@redhat.com \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=npiggin@suse.de \
--cc=penberg@cs.helsinki.fi \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox