From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id A4E726B01EF for ; Mon, 12 Apr 2010 05:26:25 -0400 (EDT) Date: Mon, 12 Apr 2010 19:26:15 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412092615.GY5683@laptop> References: <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2E1D6.9040702@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote: > On 04/12/2010 11:28 AM, Nick Piggin wrote: > > > >>We use the "try" tactic extensively. So long as there's a > >>reasonable chance of success, and a reasonable fallback on failure, > >>it's fine. > >> > >>Do you think we won't have reasonable success rates? Why? > >After the memory is fragmented? It's more or less irriversable. So > >success rates (to fill a specific number of huges pages) will be fine > >up to a point. Then it will be a continual failure. > > So we get just a part of the win, not all of it. It can degrade over time. This is the difference. Two idencial workloads may have performance X and Y depending on whether uptime is 1 day or 20 days. > >Sure, some workloads simply won't trigger fragmentation problems. > >Others will. > > Some workloads benefit from readahead. Some don't. In fact, > readahead has a higher potential to reduce performance. > > Same as with many other optimizations. Do you see any difference with your examples and this issue? > >>Why? If you can isolate all the pointers into the dentry, allocate > >>the new dentry, make the old one point into the new one, hash it, > >>move the pointers, drop the old dentry. > >> > >>Difficult, yes, but insane? > >Yes. > > Well, I'll accept what you say since I'm nowhere near as familiar > with the code. But maybe someone insane will come along and do it. And it'll get nacked :) And it's not only dcache that can cause a problem. This is part of the whole reason it is insane. It is insane to only fix the dcache, because if you accept the dcache is a problem that needs such complexity to fix, then you must accept the same for the inode caches, the buffer head caches, vmas, radix tree nodes, files etc. no? > >>Caches have statistical performance. In the long run they average > >>out. In the short run they can behave badly. Same thing with large > >>pages, except the runs are longer and the wins are smaller. > >You don't understand. Caches don't suddenly or slowly stop working. > >For a particular pattern of workload, they statistically pretty much > >work the same all the time. > > Yet your effective cache size can be reduced by unhappy aliasing of > physical pages in your working set. It's unlikely but it can > happen. > > For a statistical mix of workloads, huge pages will also work just > fine. Perhaps not all of them, but most (those that don't fill > _all_ of memory with dentries). Like I said, you don't need to fill all memory with dentries, you just need to be allocating higher order kernel memory and end up fragmenting your reclaimable pools. And it's not a statistical mix that is the problem. The problem is that the workloads that do cause fragmentation problems will run well for 1 day or 5 days and then degrade. And it is impossible to know what will degrade and what won't and by how much. I'm not saying this is a showstopper, but it does really suck. > >>Database are the easiest case, they allocate memory up front and > >>don't give it up. We'll coalesce their memory immediately and > >>they'll run happily ever after. > >Again, you're thinking about a benchmark setup. If you've got various > >admin things, backups, scripts running, probably web servers, > >application servers etc. Then it's not all that simple. > > These are all anonymous/pagecache loads, which we deal with well. Huh? They also involve sockets, files, and involve all of the above data structures I listed and many more. > >And yes, Linux works pretty well for a multi-workload platform. You > >might be thinking too much about virtualization where you put things > >in sterile little boxes and take the performance hit. > > > > People do it for a reason. The reasoning is not always sound though. And also people do other things. Including increasingly better containers and workload management in the single kernel. > >>Virtualization will fragment on overcommit, but the load is all > >>anonymous memory, so it's easy to defragment. Very little dcache on > >>the host. > >If virtualization is the main worry (which it seems that it is > >seeing as your TLB misses cost like 6 times more cachelines), > > (just 2x) > > >then complexity should be pushed into the hypervisor, not the > >core kernel. > > The whole point behind kvm is to reuse the Linux core. If we have > to reimplement Linux memory management and scheduling, then it's a > failure. And if you need to add complexity to the Linux core for it, it's also a failure. I'm not saying to reimplement things, but if you had a little bit more support perhaps. Anyway it's just ideas, I'm not saying that transparent hugepages is wrong simply because KVM is a big user and it could be implemented in another way. But if it is possible for KVM to use libhugetlb with just a bit of support from the kernel, then it goes some way to reducing the need for transparent hugepages. > >>Well, I'm not against it, but that would be a much more intrusive > >>change than what this thread is about. Also, you'd need 4K dentries > >>etc, no? > >No. You'd just be defragmenting 4K worth of dentries at a time. > >Dentries (and anything that doesn't care about untranslated KVA) > >are trivial. Zero change for users of the code. > > I see. > > >This is going off-topic though, I don't want to hijack the thread > >with talk of nonlinear kernel. > > Too bad, it's interesting. It sure is, we can start another thread. > >>Mostly we need a way of identifying pointers into a data structure, > >>like rmap (after all that's what makes transparent hugepages work). > >And that involves auditing and rewriting anything that allocates > >and pins kernel memory. It's not only dentries. > > Not everything, just the major users that can scale with the amount > of memory in the machine. Well you need to audit, to determine if it is going to be a problem or not, and it is more than only dentries. (but even dentries would be a nightmare considering how widely they're used and how much they're passed around the vfs and filesystems). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org