From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id E69936B004D for ; Tue, 3 Nov 2009 11:04:15 -0500 (EST) Date: Tue, 3 Nov 2009 12:18:29 +0100 From: Andrea Arcangeli Subject: Re: RFC: Transparent Hugepage support Message-ID: <20091103111829.GJ11981@random.random> References: <20091026185130.GC4868@random.random> <1257024567.7907.17.camel@pasglop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1257024567.7907.17.camel@pasglop> Sender: owner-linux-mm@kvack.org To: Benjamin Herrenschmidt Cc: linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Andrew Morton List-ID: On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote: > This isn't possible on all architectures. Some archs have "segment" > constraints which mean only one page size per such "segment". Server > ppc's for example (segment size being either 256M or 1T depending on the > CPU). Hmm 256M is already too large for a transparent allocation. It will require reservation and hugetlbfs to me actually seems a perfect fit for this hardware limitation. The software limits of hugetlbfs matches the hardware limit perfectly and it already provides all necessary permission and reservation features needed to deal with extremely huge page sizes that probabilistically would never be found in the buddy (even if we were to extend it to make it not impossible). That are hugely expensive to defrag dynamically even if we could [and we can't hope to defrag many of those because of slab]. Just in case it's not obvious the probability we can defrag degrades exponentially with the increase of the hugepagesize (which also means 256M is already orders of magnitude more realistic to function than than 1G). Clearly if we increase slab to allocate with a front allocator in 256M chunk then our probability increases substantially, but to make something realistic there's at minimum an order of 10000 times between hugepagesize and total ram size. I.e. if 2M page makes some probabilistic sense with slab front-allocating 2M pages on a 64G system, for 256M pages to make an equivalent sense, system would require minimum 8Terabyte of ram. If pages were 1G sized system would require 32 Terabyte of ram (and the bigger overhead and trouble we would have considering some allocation would still happen in 4k ptes and the fixed overhead of relocating those 4k ranges would be much bigger if the hugepage size is a lot bigger than 2M and the regular page size is still 4k). > > The most important design choice is: always fallback to 4k allocation > > if the hugepage allocation fails! This is the _very_ opposite of some > > large pagecache patches that failed with -EIO back then if a 64k (or > > similar) allocation failed... > > Precisely because the approach cannot work on all architectures ? I thought the main reason for those patches was to allow a fs blocksize bigger than PAGE_SIZE, a PAGE_CACHE_SIZE of 64k would allow for a 64k fs blocksize without much fs changes. But yes, if the mmu can't fallback, then software can't fallback either and so it impedes the transparent design on those architectures... To me hugetlbfs looks as best as you can get on those mmu. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org