linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <haveblue@us.ibm.com>
To: Andrew Morton <akpm@osdl.org>
Cc: linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Matthew E Tolentino <matthew.e.tolentino@intel.com>,
	Jesse Barnes <jbarnes@engr.sgi.com>,
	Mike Kravetz <kravetz@us.ibm.com>, Bob Picco <bob.picco@hp.com>,
	Joel Schopp <jschopp@austin.ibm.com>,
	Andy Whitcroft <apw@shadowen.org>
Subject: Re: [PATCH 0/4] sparsemem intro patches
Date: Mon, 14 Mar 2005 19:53:42 -0800	[thread overview]
Message-ID: <1110858822.19340.127.camel@localhost> (raw)
In-Reply-To: <20050314183042.7e7087a2.akpm@osdl.org>

On Mon, 2005-03-14 at 18:30 -0800, Andrew Morton wrote:
> Dave Hansen <haveblue@us.ibm.com> wrote:
> >
> >  The following four patches provide the last needed changes before the
> >  introduction of sparsemem.  For a more complete description of what this
> >  will do, please see this patch:
> > 
> >  http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch
> 
> I don't know what to think about this.  Can you describe sparsemem a little
> further, differentiate it from discontigmem and tell us why we want one?
>
> Is it for memory hotplug?  If so, how does it support hotplug?

Sparsemem is more flexible than discontig, and not tied to any existing
NUMA or MM structures like zones or pgdats.  That makes it ideal for
hotplug where those structures are going to be coming and going, sliced
and diced.

Another advantage is that sparse doesn't require each NUMA node's ranges
to be contiguous.  It can handle overlapping ranges between nodes with
no problems, where DISCONTIGMEM currently throws away that memory.
DISCONTIGMEM also requires that memory *inside* of a node be contiguous,
and have mem_map for all of it.  A once 64GB NUMA node with 63GB of the
memory removed wouldn't have much space left for anything but its
mem_map without sparsemem.

> To which architectures is this useful, and what is the attitude of the
> relevant maintenance teams?

We have implementations for NUMAQ, x86 Summit, flat x86, flat x86-64,
flat and NUMA ppc64, and some ia64 configurations.  All of those can
either do simulated, virtualized, or actual hardware memory hotplug of
some kind based on the sparsemem implementations. 

Not to put words in their mouths, but there hasn't been anything
negative that I can recall in a while from the architecture maintainers.
What was said that was negative was months ago, and resolved.  We've
been talking about this to most of them for quite a while now, and I
think they've grown accustomed to the idea. :)

I've cc'd all of the guilty parties.  Perhaps they can fill in my vague
statements with actual facts.  But, here are the vague statements
anyway:

  i386 - Martin Bligh seems happy with it, he helped design it.
x86-64 - Matt Tolentino has approached Andi Kleen with the necessary
         cleanups, and I believe the reaction has been positive.  I
         think Andi had some other non-hotplug plans for sparsemem, too.
 ppc64 - I can bribe Anton and Paul's employer.  Mike Kravetz and Joel
         Schopp have been working on this port, and I believe they've
         kept the maintainers informed and calm.
  ia64 - Quote from Jesse Barnes (November 19, 2004):

>         CONFIG_NONLINEAR (SPARSE's old name) should be the *only*
>         memory init code on ia64  when this is done.  That means
>         getting rid of both discontig and contig and virtual memmap...

         I believe Jesse's been keeping up with the development as well.


> Quoting from the above patch:
> 
> > Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that
> > it can eventually become a complete replacement.
> > ...
> > This patch introduces CONFIG_FLATMEM.  It is used in almost all
> > cases where there used to be an #ifndef DISCONTIG, because
> > SPARSEMEM and DISCONTIGMEM often have to compile out the same areas
> > of code.
> 
> Would I be right to worry about increasing complexity, decreased
> maintainability and generally increasing mayhem?

You certainly would be.  For the time being, this increases the number
of config options and places for us to screw up.  However, I am
confident at this point that we're doing the right thing.  We had a more
complicated version of sparsemem at first.  We stripped it down to the
bare bones, and that's what we would like to submit soon.  It has the
capability to replace discontig, and will eventually _reduce_
complexity.

One of my favorite ways to demonstrate why I think it's *simple* are the
architecture ports.  The longest added function that I can find in the
ports is 17 lines including whitespace.

139 insertions(+), 36 deletions(-) for ia64:
http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-180-sparsemem-ia64.patch

75 insertions(+), 17 deletions(-) for ppc64:
http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-170-sparsemem-ppc64.patch

x86_64 is broken up a little more, but it's probably smaller than the
ppc64 one.

> If a competent kernel developer who is not familiar with how all this code
> hangs together wishes to acquaint himself with it, what steps should he
> take?

Dan Phillips spelled out the basic concepts of chopping things up into
sections a few years ago:

	http://lwn.net/2002/0411/a/discontig.php3

However, we haven't yet implemented the phys_to_virt() translations that
he envisioned.  We don't need that until unless we need some advanced
hot-remove features which are many, many months away. 

Where should a competent kernel developer look to understand the code
more?

The sparsemem implementation isn't horribly deep.  At the implementation
level, it replaces pfn_to_page() and page_to_pfn().  It does that with
an array lookup and some bits from page->flags.  I'd check out a few
architectures' current implementations of those functions as well as the
one in the patch referenced at the beginning of the mail:
B-sparse-150-sparsemem.patch .

Next, see how the memory_present() abstraction allows the memory layout
of the system to be either encoded in arch-specific discontig structures
or fed into the arch-independent structures that sparse_init() uses to
set up the mem_section[] array.

You could also go look at some of the hotplug code, but this email is
getting long enough as it is :)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2005-03-15  3:54 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-14 21:14 Dave Hansen
2005-03-14 21:50 ` David S. Miller
2005-03-14 22:18   ` Dave Hansen
2005-03-14 22:33     ` David S. Miller
2005-03-15  2:30 ` Andrew Morton
2005-03-15  3:53   ` Dave Hansen [this message]
2005-03-15 14:56   ` Martin J. Bligh
2005-03-17 16:21   ` Andy Whitcroft
2005-03-19 19:33 ` Pavel Machek
2005-03-28 21:23   ` Dave Hansen
2005-03-28 22:22     ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1110858822.19340.127.camel@localhost \
    --to=haveblue@us.ibm.com \
    --cc=akpm@osdl.org \
    --cc=apw@shadowen.org \
    --cc=bob.picco@hp.com \
    --cc=jbarnes@engr.sgi.com \
    --cc=jschopp@austin.ibm.com \
    --cc=kravetz@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.e.tolentino@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox