Re: RFC: Transparent Hugepage support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: RFC: Transparent Hugepage support
Date: Wed, 28 Oct 2009 16:48:27 +0100	[thread overview]
Message-ID: <20091028154827.GF9640@random.random> (raw)
In-Reply-To: <20091028141803.GQ7744@basil.fritz.box>

On Wed, Oct 28, 2009 at 03:18:03PM +0100, Andi Kleen wrote:
> Why glibc? 
> Yes, there are quite some workloads who benefit.

That's what I meant, I said glibc to mean not just KVM (like Chris
pointed out before ;)

> Even without automatic allocation and the need to prereseve
> having the same application interface for 1GB pages is still useful.
> Otherwise people who want to use the 1GB pages have to do the
> special hacks again.

They will have to do the special hacks for reservation... No many
other hacks after that if they accept if they reserve it becomes not
swappable. Then it depends how you want to give permissions to use the
reserved areas. It's all a reservation logic that you need in order to
use 1G pages with this.

> What I was thinking of was to have a relatively easy to use
> flag that allows an application to use prereserved GB pages
> transparently. e.g. could be done with a special command
> 
> hugepagehint 1GB app
> 
> Yes I realize that this is possible to some extend with libhugetlbfs
> LD_PRELOAD, but integrating it in the kernel is much saner.
> 
> So even if there are some restrictions it would be good to not
> ignore the 1GB pages completely.

I think we should ignore them in the first round of patches, knowing
this model can fit them later if we just add a reservation logic and
all pud_trans_huge. I don't think we need to provide this immediately
as it'd grow the size of the patch, but we can do it soon after. I'm
frightened by growing the patch even more, I'd rather try to get
optimal on 2M pages and only later worry about 1G pages. I think it's
higher priority to remove a couple of split_huge_page than to support
transparent gigapages given they won't be really transparent anyway.

> Agreed, prereservation is still the way to go for 1GB.

To support gigapages, would require to decide a reservation API
now. After that, the kernel will map a 1G page if it is available and
we add pud_trans_huge all over the place. There are more urgent things
like the collapse daemon, removing a couple of split_huge_page, before
I can worry about reservation APIs and to bloat further with
pud_trans_huge all over the place.

> Agreed on not doing it unconditionally, ut the advice could be per
> process or per cgroup.

It gets more and more complicated and this "hint" is all about
reservation, not something we want to deal with with 2M pages.

> Even on 2MB pages this problem exists to some point: if you explicitely
> preallocate 2MB pages to make sure some application can use them
> with hugetlbfs you don't want random applications to still the
> "guaranteed" huge pages.

This is what the sysctl is about. You can turn it off the
transparency, and then the kernel will keep mapping hugepages only
inside madvise(MADV_HUGEPAGE). There is no need of reserving anything
here.

> So some policy here would be likely needed anyways and the same
> could be used for the 1GB pages.

1GB pages can't use the same logic but again I don't think we will be
doing any additional work, if we address 2M pages now transparent, and
we lave the reservation required for 1G pages for later.

What I mean with ignore, is not to add a requirement for merging that
1G pages are also supported or we've to add even more logics that are
absolutely useless for 2M pages.

> I'm still uneasy about this, it's a very clear "glass jaw"
> that might well cause serious problems in practice. Anything that requires
> regular reboots is bad.

Here nothing requires reboot. If you get 2M pages good, otherwise
stick to 4k pages transparently, userland can't know. When some task
quits and 2M page happens we'll just collapse the 4k pages into the
newly generated 2M pages with a background daemon. Over time we can
add more logics to try to minimize fragmentation (obviously slab needs
a front-allocator that tries 2M page allocation first always, there
are many other things we have to do in the defrag front, before we can
worry about the effect of swap that calls split_huge_page). The other
syscalls that calls split_huge_page as said won't fragment anything
physically (with the exception of munmap and madvise_dontneed if used
to truncate an hugepage).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-10-28 15:48 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-26 18:51 Andrea Arcangeli
2009-10-27 15:41 ` Rik van Riel
2009-10-27 18:18 ` Andi Kleen
2009-10-27 19:30   ` Andrea Arcangeli
2009-10-28  4:28     ` Andi Kleen
2009-10-28 12:00       ` Andrea Arcangeli
2009-10-28 14:18         ` Andi Kleen
2009-10-28 14:54           ` Adam Litke
2009-10-28 15:13             ` Andi Kleen
2009-10-28 15:30               ` Andrea Arcangeli
2009-10-29 15:59             ` Dave Hansen
2009-10-31 21:32             ` Benjamin Herrenschmidt
2009-10-28 15:48           ` Andrea Arcangeli [this message]
2009-10-28 16:03             ` Andi Kleen
2009-10-28 16:22               ` Andrea Arcangeli
2009-10-28 16:34                 ` Andi Kleen
2009-10-28 16:56                   ` Adam Litke
2009-10-28 17:18                     ` Andi Kleen
2009-10-28 19:04                   ` Andrea Arcangeli
2009-10-28 19:22                     ` Andrea Arcangeli
2009-10-29  9:43       ` Ingo Molnar
2009-10-29 10:36         ` Andrea Arcangeli
2009-10-29 16:50           ` Mike Travis
2009-10-30  0:40           ` KAMEZAWA Hiroyuki
2009-11-03 10:55             ` Andrea Arcangeli
2009-11-04  0:36               ` KAMEZAWA Hiroyuki
2009-10-29 12:54     ` Andrea Arcangeli
2009-10-27 20:42 ` Christoph Lameter
2009-10-27 18:21   ` Andrea Arcangeli
2009-10-27 20:25     ` Chris Wright
2009-10-29 18:51       ` Christoph Lameter
2009-11-01 10:56         ` Andrea Arcangeli
2009-10-29 18:55     ` Christoph Lameter
2009-10-31 21:29 ` Benjamin Herrenschmidt
2009-11-03 11:18   ` Andrea Arcangeli
2009-11-03 19:10     ` Dave Hansen
2009-11-04  4:10     ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091028154827.GF9640@random.random \
    --to=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=avi@redhat.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox