Re: RFC: Transparent Hugepage support - Benjamin Herrenschmidt

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: RFC: Transparent Hugepage support
Date: Wed, 04 Nov 2009 15:10:26 +1100	[thread overview]
Message-ID: <1257307826.13611.45.camel@pasglop> (raw)
In-Reply-To: <20091103111829.GJ11981@random.random>

On Tue, 2009-11-03 at 12:18 +0100, Andrea Arcangeli wrote:
> On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote:
> > This isn't possible on all architectures. Some archs have "segment"
> > constraints which mean only one page size per such "segment". Server
> > ppc's for example (segment size being either 256M or 1T depending on the
> > CPU).
> 
> Hmm 256M is already too large for a transparent allocation. 

Right.

> It will
> require reservation and hugetlbfs to me actually seems a perfect fit
> for this hardware limitation. The software limits of hugetlbfs matches
> the hardware limit perfectly and it already provides all necessary
> permission and reservation features needed to deal with extremely huge
> page sizes that probabilistically would never be found in the buddy
> (even if we were to extend it to make it not impossible). 

Yes. Note that powerpc -embedded- processors don't have that limitation
though (in large part because they are mostly SW loaded TLBs and they
support a wider collection of page sizes). So it would be possible to
implement your transparent scheme on those.

> That are
> hugely expensive to defrag dynamically even if we could [and we can't
> hope to defrag many of those because of slab]. Just in case it's not
> obvious the probability we can defrag degrades exponentially with the
> increase of the hugepagesize (which also means 256M is already orders
> of magnitude more realistic to function than than 1G).

True. 256M might even be worth toying with as an experiment on huge
machines with TBs of memory in fact :-)

>  Clearly if we
> increase slab to allocate with a front allocator in 256M chunk then
> our probability increases substantially, but to make something
> realistic there's at minimum an order of 10000 times between
> hugepagesize and total ram size. I.e. if 2M page makes some
> probabilistic sense with slab front-allocating 2M pages on a 64G
> system, for 256M pages to make an equivalent sense, system would
> require minimum 8Terabyte of ram.

Well... such systems aren't that far around the corner, so as I said, it
might still make sense to toy a bit with it. That would definitely -not-
include my G5 workstation though :-)

>  If pages were 1G sized system would
> require 32 Terabyte of ram (and the bigger overhead and trouble we
> would have considering some allocation would still happen in 4k ptes
> and the fixed overhead of relocating those 4k ranges would be much
> bigger if the hugepage size is a lot bigger than 2M and the regular
> page size is still 4k).
> 
> > > The most important design choice is: always fallback to 4k allocation
> > > if the hugepage allocation fails! This is the _very_ opposite of some
> > > large pagecache patches that failed with -EIO back then if a 64k (or
> > > similar) allocation failed...
> > 
> > Precisely because the approach cannot work on all architectures ?
> 
> I thought the main reason for those patches was to allow a fs
> blocksize bigger than PAGE_SIZE, a PAGE_CACHE_SIZE of 64k would allow
> for a 64k fs blocksize without much fs changes. But yes, if the mmu
> can't fallback, then software can't fallback either and so it impedes
> the transparent design on those architectures... To me hugetlbfs looks
> as best as you can get on those mmu.

Right.

I need to look whether your patch would work "better" for us with our
embedded processors though.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2009-11-04  4:10 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-26 18:51 Andrea Arcangeli
2009-10-27 15:41 ` Rik van Riel
2009-10-27 18:18 ` Andi Kleen
2009-10-27 19:30   ` Andrea Arcangeli
2009-10-28  4:28     ` Andi Kleen
2009-10-28 12:00       ` Andrea Arcangeli
2009-10-28 14:18         ` Andi Kleen
2009-10-28 14:54           ` Adam Litke
2009-10-28 15:13             ` Andi Kleen
2009-10-28 15:30               ` Andrea Arcangeli
2009-10-29 15:59             ` Dave Hansen
2009-10-31 21:32             ` Benjamin Herrenschmidt
2009-10-28 15:48           ` Andrea Arcangeli
2009-10-28 16:03             ` Andi Kleen
2009-10-28 16:22               ` Andrea Arcangeli
2009-10-28 16:34                 ` Andi Kleen
2009-10-28 16:56                   ` Adam Litke
2009-10-28 17:18                     ` Andi Kleen
2009-10-28 19:04                   ` Andrea Arcangeli
2009-10-28 19:22                     ` Andrea Arcangeli
2009-10-29  9:43       ` Ingo Molnar
2009-10-29 10:36         ` Andrea Arcangeli
2009-10-29 16:50           ` Mike Travis
2009-10-30  0:40           ` KAMEZAWA Hiroyuki
2009-11-03 10:55             ` Andrea Arcangeli
2009-11-04  0:36               ` KAMEZAWA Hiroyuki
2009-10-29 12:54     ` Andrea Arcangeli
2009-10-27 20:42 ` Christoph Lameter
2009-10-27 18:21   ` Andrea Arcangeli
2009-10-27 20:25     ` Chris Wright
2009-10-29 18:51       ` Christoph Lameter
2009-11-01 10:56         ` Andrea Arcangeli
2009-10-29 18:55     ` Christoph Lameter
2009-10-31 21:29 ` Benjamin Herrenschmidt
2009-11-03 11:18   ` Andrea Arcangeli
2009-11-03 19:10     ` Dave Hansen
2009-11-04  4:10     ` Benjamin Herrenschmidt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1257307826.13611.45.camel@pasglop \
    --to=benh@kernel.crashing.org \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox