From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E6A7D6B01F1 for ; Mon, 12 Apr 2010 05:04:01 -0400 (EDT) Message-ID: <4BC2E1D6.9040702@redhat.com> Date: Mon, 12 Apr 2010 12:03:18 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> In-Reply-To: <20100412082844.GU5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 11:28 AM, Nick Piggin wrote: > >> We use the "try" tactic extensively. So long as there's a >> reasonable chance of success, and a reasonable fallback on failure, >> it's fine. >> >> Do you think we won't have reasonable success rates? Why? >> > After the memory is fragmented? It's more or less irriversable. So > success rates (to fill a specific number of huges pages) will be fine > up to a point. Then it will be a continual failure. > So we get just a part of the win, not all of it. > Sure, some workloads simply won't trigger fragmentation problems. > Others will. > Some workloads benefit from readahead. Some don't. In fact, readahead has a higher potential to reduce performance. Same as with many other optimizations. >> Why? If you can isolate all the pointers into the dentry, allocate >> the new dentry, make the old one point into the new one, hash it, >> move the pointers, drop the old dentry. >> >> Difficult, yes, but insane? >> > Yes. > Well, I'll accept what you say since I'm nowhere near as familiar with the code. But maybe someone insane will come along and do it. >> Caches have statistical performance. In the long run they average >> out. In the short run they can behave badly. Same thing with large >> pages, except the runs are longer and the wins are smaller. >> > You don't understand. Caches don't suddenly or slowly stop working. > For a particular pattern of workload, they statistically pretty much > work the same all the time. > Yet your effective cache size can be reduced by unhappy aliasing of physical pages in your working set. It's unlikely but it can happen. For a statistical mix of workloads, huge pages will also work just fine. Perhaps not all of them, but most (those that don't fill _all_ of memory with dentries). >> Database are the easiest case, they allocate memory up front and >> don't give it up. We'll coalesce their memory immediately and >> they'll run happily ever after. >> > Again, you're thinking about a benchmark setup. If you've got various > admin things, backups, scripts running, probably web servers, > application servers etc. Then it's not all that simple. > These are all anonymous/pagecache loads, which we deal with well. > And yes, Linux works pretty well for a multi-workload platform. You > might be thinking too much about virtualization where you put things > in sterile little boxes and take the performance hit. > > People do it for a reason. >> Virtualization will fragment on overcommit, but the load is all >> anonymous memory, so it's easy to defragment. Very little dcache on >> the host. >> > If virtualization is the main worry (which it seems that it is > seeing as your TLB misses cost like 6 times more cachelines), > (just 2x) > then complexity should be pushed into the hypervisor, not the > core kernel. > The whole point behind kvm is to reuse the Linux core. If we have to reimplement Linux memory management and scheduling, then it's a failure. >> Well, I'm not against it, but that would be a much more intrusive >> change than what this thread is about. Also, you'd need 4K dentries >> etc, no? >> > No. You'd just be defragmenting 4K worth of dentries at a time. > Dentries (and anything that doesn't care about untranslated KVA) > are trivial. Zero change for users of the code. > I see. > This is going off-topic though, I don't want to hijack the thread > with talk of nonlinear kernel. > Too bad, it's interesting. >> Mostly we need a way of identifying pointers into a data structure, >> like rmap (after all that's what makes transparent hugepages work). >> > And that involves auditing and rewriting anything that allocates > and pins kernel memory. It's not only dentries. > Not everything, just the major users that can scale with the amount of memory in the machine. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org