From: 'David Gibson' <david@gibson.dropbear.id.au>
To: Andrew Morton <akpm@osdl.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
'Christoph Lameter' <christoph@schroedinger.engr.sgi.com>,
Hugh Dickins <hugh@veritas.com>,
bill.irwin@oracle.com, Adam Litke <agl@us.ibm.com>,
linux-mm@kvack.org
Subject: Re: [RFC] reduce hugetlb_instantiation_mutex usage
Date: Fri, 27 Oct 2006 14:06:26 +1000 [thread overview]
Message-ID: <20061027040626.GI11733@localhost.localdomain> (raw)
In-Reply-To: <20061026203522.d8b3e248.akpm@osdl.org>
On Thu, Oct 26, 2006 at 08:35:22PM -0700, Andrew Morton wrote:
> On Fri, 27 Oct 2006 13:11:56 +1000
> "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
>
> > On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote:
> > > On Fri, 27 Oct 2006 09:31:37 +1000
> > > "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> > > > > On Thu, 26 Oct 2006 15:17:20 -0700
> > > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> > > > >
> > > > > > First rev of patch to allow hugetlb page fault to scale.
> > > > > >
> > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > > > > > failure in a corner case: two threads race to instantiate same page with
> > > > > > only one free page left in the global pool. However, this global
> > > > > > serialization hurts fault performance badly as noted by Christoph Lameter.
> > > > > > This patch attempt to cut back the use of mutex only when free page resource
> > > > > > is limited, thus allow fault to scale in most common cases.
> > > > > >
> > > > >
> > > > > ug.
> > > > >
> > > > > How about we kill that instantiation_mutex thing altogether and fix
> > > > > the original bug in a better fashion? Like...
> > > > >
> > > > > In hugetlb_no_page():
> > > > >
> > > > > retry:
> > > > > page = find_lock_page(...)
> > > > > if (!page) {
> > > > > write_lock_irq(&mapping->tree_lock);
> > > > > if (radix_tree_lookup(...)) {
> > > > > write_unlock_irq(tree_lock);
> > > > > goto retry;
> > > > > }
> > > > > page = alloc_huge_page(...);
> > > > > if (!page)
> > > > > bail;
> > > > > radix_tree_insert(...);
> > > > > SetPageLocked(page);
> > > > > write_unlock_irq(tree_lock);
> > > > > clear_huge_page(...);
> > > > > }
> > > > >
> > > > > <stick it in page tables>
> > > > >
> > > > > unlock_page(page);
> > > > >
> > > > > The key points:
> > > > >
> > > > > - Use tree_lock to prevent the race
> > > > >
> > > > > - allocate the hugepage inside tree_lock so we never get into this
> > > > > two-threads-tried-to-allocate-the-final-page problem.
> > > > >
> > > > > - The hugepage is zeroed without locks held, under lock_page()
> > > > >
> > > > > - lock_page() is used to make the other thread(s) sleep while the winner
> > > > > thread is zeroing out the page.
> > > > >
> > > > > It means that rather a lot of add_to_page_cache() will need to be copied
> > > > > into hugetlb_no_page().
> > > >
> > > > This handles the case of processes racing on a shared mapping, but not
> > > > the case of threads racing on a private mapping. In the latter case
> > > > the race ends at the set_pte() rather than the add_to_page_cache()
> > > > (well, strictly with the whole page_table_lock atomic lump). And we
> > > > can't move the clear after the set_pte() :(.
> > > >
> > >
> > > I expect we can do a similar thing, using page_table_lock to prevent the
> > > race.
> > >
> > > The key is to be able to make racing threads still block on the page lock.
> > > Perhaps install a temp pte which is !pte_present() and also !pte_none().
> > > So the racing thread can use that pte to locate and wait upon the
> > > presently-locked page while it is being COWed by another CPU.
> >
> > Um.. yes, that might work. Though I'd need to think hard about a more
> > specific scheme. I've been through a lot of approaches lately that
> > looked ok at first glance, but weren't :-/
> >
> > And obviously we'd need to make sure such "tentative" PTEs are
> > constructible won't confuse other code on each relevant architecture.
>
> There's various cross-arch infrastructure for this which is used for
> encoding swap offsets within pte's which could perhaps be
> ab^W^Wreused.
Yes, but the encoding and assumptions about ptes aren't always exactly
the same for hugeptes as normal ptes.
> Alternatively, we could put the page into pagecache whether or not the
> mapping is MAP_SHARED. Then pull it out again prior to unlocking it if
> it's MAP_PRIVATE. So we're using pagecache just as a way for the
> concurrent faulter to locate the page.
Hrm.. interesting if we can make it work. I'd be worried about cases
with concurrent PRIVATE and SHARED pages on the same file offset.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-10-27 4:06 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-10-26 22:17 Chen, Kenneth W
2006-10-26 22:44 ` Andrew Morton
2006-10-26 23:31 ` 'David Gibson'
2006-10-27 0:04 ` Andrew Morton
2006-10-27 3:11 ` 'David Gibson'
2006-10-27 3:35 ` Andrew Morton
2006-10-27 4:06 ` 'David Gibson' [this message]
2006-10-31 2:54 ` Chen, Kenneth W
2006-10-31 3:17 ` 'David Gibson'
2006-10-31 5:15 ` Chen, Kenneth W
2006-10-31 11:05 ` 'David Gibson'
2006-10-31 12:48 ` Hugh Dickins
2006-11-01 6:18 ` Nick Piggin
2006-11-01 10:17 ` Chen, Kenneth W
2006-11-02 3:06 ` Nick Piggin
2006-11-02 2:29 ` 'David Gibson'
2006-10-27 1:47 ` 'David Gibson'
2006-10-30 20:55 ` Adam Litke
2006-10-26 23:47 ` 'David Gibson'
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061027040626.GI11733@localhost.localdomain \
--to=david@gibson.dropbear.id.au \
--cc=agl@us.ibm.com \
--cc=akpm@osdl.org \
--cc=bill.irwin@oracle.com \
--cc=christoph@schroedinger.engr.sgi.com \
--cc=hugh@veritas.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox