linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Lameter <cl@linux-foundation.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mel@csn.ul.ie>, Dave Hansen <dave@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Chris Wright <chrisw@sous-sol.org>,
	bpicco@redhat.com,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 00 of 34] Transparent Hugepage support #14
Date: Mon, 22 Mar 2010 10:38:23 -0500 (CDT)	[thread overview]
Message-ID: <alpine.DEB.2.00.1003221027590.16606@router.home> (raw)
In-Reply-To: <20100319144101.GB29874@random.random>

On Fri, 19 Mar 2010, Andrea Arcangeli wrote:

> > Look at the patches. They add synchronization to pte operations.
>
> The whole point is that it's fundamentally unavoidable and not
> specific to any of split_huge_page, compound_lock or
> get_page/put_page!

Really? Why did no one else run into this before?

> Problem with O_DIRECT is that I couldn't use mmu notifier to prevent
> it to take the pin on the page, because there is no way to interrupt
> DMA synchronously before mmu_notifier_invalidate_* returns... So I had
> to add compound_lock and keep gup API backwards compatible and have
> the proper serialization happen _only_ for PageTail inside put_page.

You can take a refcount *before* breaking up a 2M page then you dont have
to fear the put_page.

> > What is wrong with gup and gup_fast? They mimick the traversal of the page
> > tables by the MMU. If you update in the right sequence then there wont be
> > an issue.
>
> gup and gup_fast (if we don't split_huge_page during the follow_page
> pagtable walk which is what I did initially to start but it's
> unacceptable as it makes O_DIRECT with -drive cache=off split guest
> NPT hugepages on the I/O memory source/destination) will prevents
> split_huge_page to be able to adjust the refcount of the subpages,
> unless we serialize put_page against split_huge_page and we keep track
> of the individual gup refcounts on the subpages. Which is what
> the compound_lock and the get_page/put_page changes achieve in a self
> contained manner without spreading all over the drivers and the VM.

Keep a reference count in the head page and a pointer to the subpage? Page
can only be broken up if all page references of the 2M page can be
accounted for. This implies no "atomic" breakup but this way it does not
require changes to synchronization.

> > Its pretty bold to call this patchset non-intrusive. Not sure why you
> > think you have to break gup. Certainly works fine for page migration.
>
> page migration won't convert a compound page to a not compound page in
> the page structure in place. this is not page migration, this is about
> converting a page structure from PageCompound to not-page
> compound. It's all trivial on the pagetable side, what is not trivial
> is the refcounting created by GUP. The gup caller, at any later time
> will call put_page on a random subpage, so we've to adjust the
> refcount for subpages inside __split_huge_page_refcount, depending on
> which tailpage was returned by gup.

Its the same principle: Account for all the references, stop new
references from being established and then replace the page / convert
references.

> Migration will bail out if gup is running. split_huge_page basic
> design is that it can't fail. So it can't bail out. And we don't want
> to call split_huge_page during follow_page before gup returns the
> page. That would also solve it, I did it initially but it's
> unacceptable to split hugepages across GUP.

That is the basic crux here. Do not require that it cannot fail. Its a bad
move and results in a mess.

> It's definitely zero risk if compared to what you're proposing.

No its not. I am proposing to *keep* the existing syncronization methods.
Use what is there. Do not invent new synchronization.

> > You can convert a 2M page to 4k pages without messing up the
> > basic refcounting and synchronization by following the way things are done
> > in other parts of the kernel.
>
> No other place of the kernel does anything remotely comparable to
> split_huge_page.

Page migration does a comparable thing.

> defrag, migration they all can fail, split_huge_page cannot. The very
> simple reason split_huge_page cannot fail is this: if I have to do
> anything more than a one liner to make mremap, mprotect and all the
> rest, then I prefer to take your non-practical more risky design. The
> moment you have to alter some very inner pagetable walking function to
> handle a split_huge_page error return, you'll already have to recraft
> the code in a big enough way, that you better make it hugepage
> aware. Making it hugepage aware is like 10 times more difficult and
> error prone and hard to test, than handling a split_huge_page error
> retval, but still in 10 files fixed for the error retval, will be
> worth 1 file converted not to call split_huge_page at all. That
> explains very clearly my decision to make split_huge_page not fail,
> and make sure all next efforts will be spent in removing
> split_huge_page and not in handling an error retval for a function
> that shouldn't have been called in the first place!

We already have 2M pmd handling in the kernel and can consider huge pmd
entries while walking the page tables! Go incrementally use what
is there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-03-22 15:40 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-17 15:19 Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 01 of 34] define MADV_HUGEPAGE Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 02 of 34] compound_lock Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 03 of 34] alter compound get_page/put_page Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 04 of 34] update futex compound knowledge Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 05 of 34] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 06 of 34] clear compound mapping Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 07 of 34] add native_set_pmd_at Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 08 of 34] add pmd paravirt ops Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 09 of 34] no paravirt version of pmd ops Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 10 of 34] export maybe_mkwrite Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 11 of 34] comment reminder in destroy_compound_page Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 12 of 34] config_transparent_hugepage Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 13 of 34] special pmd_trans_* functions Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 14 of 34] add pmd mangling generic functions Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 15 of 34] add pmd mangling functions to x86 Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 16 of 34] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 17 of 34] pte alloc trans splitting Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 18 of 34] add pmd mmu_notifier helpers Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 19 of 34] clear page compound Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 20 of 34] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 21 of 34] split_huge_page_mm/vma Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 22 of 34] split_huge_page paging Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 23 of 34] clear_copy_huge_page Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 24 of 34] kvm mmu transparent hugepage support Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 25 of 34] _GFP_NO_KSWAPD Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 26 of 34] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 27 of 34] transparent hugepage core Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 28 of 34] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 29 of 34] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 30 of 34] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 31 of 34] memcg compound Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 32 of 34] memcg huge memory Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 33 of 34] transparent hugepage vmstat Andrea Arcangeli
2010-03-17 15:19 ` [PATCH 34 of 34] khugepaged Andrea Arcangeli
2010-03-17 19:05 ` [PATCH 00 of 34] Transparent Hugepage support #14 Christoph Lameter
2010-03-18 23:49   ` Andrea Arcangeli
2010-03-19 13:29     ` Christoph Lameter
2010-03-19 14:41       ` Andrea Arcangeli
2010-03-22 15:38         ` Christoph Lameter [this message]
2010-03-22 16:35           ` Johannes Weiner
2010-03-22 16:46             ` Christoph Lameter
2010-03-22 17:15               ` Andrea Arcangeli
2010-03-23 17:08                 ` Christoph Lameter
2010-03-22 18:20               ` Johannes Weiner
2010-03-23 17:11                 ` Christoph Lameter
2010-03-23 19:06                   ` Andrea Arcangeli
2010-03-22 17:08             ` Andrea Arcangeli
2010-03-22 17:06           ` Andrea Arcangeli
2010-03-23 17:06             ` Christoph Lameter
2010-03-23 19:08               ` Andrea Arcangeli
2010-03-24 21:03                 ` Christoph Lameter
2010-03-24 21:22                   ` Andrea Arcangeli
2010-03-25 22:17                     ` Christoph Lameter
2010-03-25 22:41                       ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.1003221027590.16606@router.home \
    --to=cl@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=arnd@arndb.de \
    --cc=avi@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bpicco@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=travis@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox