linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Garzik <jgarzik@pobox.com>
To: Jes Sorensen <jes@wildopensource.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Andrew Morton <akpm@osdl.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3
Date: Wed, 27 Apr 2005 11:53:15 -0400	[thread overview]
Message-ID: <426FB56B.5000006@pobox.com> (raw)
In-Reply-To: <yq0ll75rxsl.fsf@jaguar.mkp.net>

Jes Sorensen wrote:
>>>>>>"Christoph" == Christoph Hellwig <hch@infradead.org> writes:
> 
> 
> Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2),
> Christoph> which has a nopage routine that calls remap_pfn_range from
> Christoph> ->nopage for uncached memory that's not part of the mem
> Christoph> map.  Because ->nopage wants to return a struct page * he's
> Christoph> allocating a normal kernel page and actually returns that
> Christoph> one - to get the page he wants into the pagetables his does
> Christoph> all the pagetable manipulation himself before (See the
> Christoph> glory details of pagetable walks and modification inside a
> Christoph> driver in the patch above).
> 
> Christoph> I don't think these hacks are acceptable for a driver,
> Christoph> especially as the problem can easily be solved by calling
> Christoph> remap_pfn_range in ->mmap - except SGI also wants node
> Christoph> locality..
> 
> Christoph,
> 
> Let me try and provide some more background then.
> 
> Simply doing remap_pfn_range in the mmap call doesn't work for large
> systems.
> 
> Take the example of a 2048 CPU system (512 CPUs per partition/machine
> - each machine running it's own OS) running an MPI application
> across all 2048 CPUs using cross coherency domain traffic.
> 
> A standard application will allocate 56 DDQs per thread (the DDQs are
> used for synchronization and allocated through the mspec driver) which
> translates to having 126976 uncached cache lines reserved or 992 pages
> per worker thread. The controlling thread on each partition will mmap
> the entire DDQ space up front and then fork off the workers who will
> then go and touch their pages. With the current approach by the driver
> this means that if you have two threads per node you will end up with
> ~32MB of uncached memory allocated per node.
> 
> Alternatively doing this at mmap time having 512 worker threads per
> partition, the result is ~8GB (992 * 16K * 512) of uncached memory all
> allocated by the master thread on each machine.
> 
> A typical system configuration is 4GB or 8GB of RAM per node. This
> means that by using the remap_pfn_range at mmap time approach and the
> kernel's standard overhead you end up completely starving the first
> couple of nodes of memory on each partition.
> 
> Combine this with the effect of all synchronization traffic hitting
> the same node, you effectively end up with 512 CPUs all constantly
> hammering the same memory controller to death.
> 
> FWIW, an initial implementation of the driver was done by someone
> within SGI, prior to me having anything to do with it. It was using
> the remap_pfn_range at mmap time approach and it was noticed then that
> 16 worker threads was pretty much enough to overwhelm a node.
> 
> Having the page allocations and drop ins on a first touch basis is
> consistent with what is done for cached memory and seems a pretty
> reasonable approach to me. Sure it isn't particularly pretty to use
> the ->nopage approach, nobody disagrees with you there, but what is
> the alternative?

I don't see anything wrong with a ->nopage approach.

At Linus's suggestion, I used ->nopage in the implementation of 
sound/oss/via82cxxx_audio.c.

	Jeff



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2005-04-27 15:53 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <16987.39773.267117.925489@jaguar.mkp.net>
     [not found] ` <20050412032747.51c0c514.akpm@osdl.org>
     [not found]   ` <yq07jj8123j.fsf@jaguar.mkp.net>
     [not found]     ` <20050413204335.GA17012@infradead.org>
     [not found]       ` <yq08y3bys4e.fsf@jaguar.mkp.net>
     [not found]         ` <20050424101615.GA22393@infradead.org>
     [not found]           ` <yq03btftb9u.fsf@jaguar.mkp.net>
2005-04-25 14:47             ` Christoph Hellwig
2005-04-26 22:14               ` Jes Sorensen
2005-04-27 15:53                 ` Jeff Garzik [this message]
2005-04-27 15:55                   ` Christoph Hellwig
2005-04-27 18:03                     ` Jes Sorensen
2005-04-27 18:55                       ` Russell King
2005-05-03 20:40                 ` William Lee Irwin III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=426FB56B.5000006@pobox.com \
    --to=jgarzik@pobox.com \
    --cc=akpm@osdl.org \
    --cc=hch@infradead.org \
    --cc=jes@wildopensource.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox