From: Jeff Garzik <jgarzik@pobox.com>
To: Jes Sorensen <jes@wildopensource.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Andrew Morton <akpm@osdl.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3
Date: Wed, 27 Apr 2005 11:53:15 -0400 [thread overview]
Message-ID: <426FB56B.5000006@pobox.com> (raw)
In-Reply-To: <yq0ll75rxsl.fsf@jaguar.mkp.net>
Jes Sorensen wrote:
>>>>>>"Christoph" == Christoph Hellwig <hch@infradead.org> writes:
>
>
> Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2),
> Christoph> which has a nopage routine that calls remap_pfn_range from
> Christoph> ->nopage for uncached memory that's not part of the mem
> Christoph> map. Because ->nopage wants to return a struct page * he's
> Christoph> allocating a normal kernel page and actually returns that
> Christoph> one - to get the page he wants into the pagetables his does
> Christoph> all the pagetable manipulation himself before (See the
> Christoph> glory details of pagetable walks and modification inside a
> Christoph> driver in the patch above).
>
> Christoph> I don't think these hacks are acceptable for a driver,
> Christoph> especially as the problem can easily be solved by calling
> Christoph> remap_pfn_range in ->mmap - except SGI also wants node
> Christoph> locality..
>
> Christoph,
>
> Let me try and provide some more background then.
>
> Simply doing remap_pfn_range in the mmap call doesn't work for large
> systems.
>
> Take the example of a 2048 CPU system (512 CPUs per partition/machine
> - each machine running it's own OS) running an MPI application
> across all 2048 CPUs using cross coherency domain traffic.
>
> A standard application will allocate 56 DDQs per thread (the DDQs are
> used for synchronization and allocated through the mspec driver) which
> translates to having 126976 uncached cache lines reserved or 992 pages
> per worker thread. The controlling thread on each partition will mmap
> the entire DDQ space up front and then fork off the workers who will
> then go and touch their pages. With the current approach by the driver
> this means that if you have two threads per node you will end up with
> ~32MB of uncached memory allocated per node.
>
> Alternatively doing this at mmap time having 512 worker threads per
> partition, the result is ~8GB (992 * 16K * 512) of uncached memory all
> allocated by the master thread on each machine.
>
> A typical system configuration is 4GB or 8GB of RAM per node. This
> means that by using the remap_pfn_range at mmap time approach and the
> kernel's standard overhead you end up completely starving the first
> couple of nodes of memory on each partition.
>
> Combine this with the effect of all synchronization traffic hitting
> the same node, you effectively end up with 512 CPUs all constantly
> hammering the same memory controller to death.
>
> FWIW, an initial implementation of the driver was done by someone
> within SGI, prior to me having anything to do with it. It was using
> the remap_pfn_range at mmap time approach and it was noticed then that
> 16 worker threads was pretty much enough to overwhelm a node.
>
> Having the page allocations and drop ins on a first touch basis is
> consistent with what is done for cached memory and seems a pretty
> reasonable approach to me. Sure it isn't particularly pretty to use
> the ->nopage approach, nobody disagrees with you there, but what is
> the alternative?
I don't see anything wrong with a ->nopage approach.
At Linus's suggestion, I used ->nopage in the implementation of
sound/oss/via82cxxx_audio.c.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2005-04-27 15:53 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <16987.39773.267117.925489@jaguar.mkp.net>
[not found] ` <20050412032747.51c0c514.akpm@osdl.org>
[not found] ` <yq07jj8123j.fsf@jaguar.mkp.net>
[not found] ` <20050413204335.GA17012@infradead.org>
[not found] ` <yq08y3bys4e.fsf@jaguar.mkp.net>
[not found] ` <20050424101615.GA22393@infradead.org>
[not found] ` <yq03btftb9u.fsf@jaguar.mkp.net>
2005-04-25 14:47 ` Christoph Hellwig
2005-04-26 22:14 ` Jes Sorensen
2005-04-27 15:53 ` Jeff Garzik [this message]
2005-04-27 15:55 ` Christoph Hellwig
2005-04-27 18:03 ` Jes Sorensen
2005-04-27 18:55 ` Russell King
2005-05-03 20:40 ` William Lee Irwin III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=426FB56B.5000006@pobox.com \
--to=jgarzik@pobox.com \
--cc=akpm@osdl.org \
--cc=hch@infradead.org \
--cc=jes@wildopensource.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox