From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Fri, 05 Dec 2003 00:44:06 +0900
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
Subject: Re: memory hotremove prototype, take 3
In-Reply-To: <152440000.1070516333@[10.10.2.4]>
References: <20031201034155.11B387007A@sv1.valinux.co.jp>
	<187360000.1070480461@flay>
	<20031204035842.72C9A7007A@sv1.valinux.co.jp>
	<152440000.1070516333@10.10.2.4>
MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
Message-Id: <20031204154406.7FC587007A@sv1.valinux.co.jp>
Sender: owner-linux-mm@kvack.org
Return-Path: <owner-linux-mm@kvack.org>
To: "Martin J. Bligh" <mbligh@aracnet.com>
Cc: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>, linux-kernel@vger.kernel.org, linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>

At Wed, 03 Dec 2003 21:38:54 -0800,
Martin J. Bligh <mbligh@aracnet.com> wrote:
> > My target is somewhat NUMA-ish and fairly large.  So I'm not sure if
> > CONFIG_NONLINEAR fits, but CONFIG_NUMA isn't perfect either.
> 
> If your target is NUMA, then you really, really need CONFIG_NONLINEAR.
> We don't support multiple pgdats per node, nor do I wish to, as it'll
> make an unholy mess ;-). With CONFIG_NONLINEAR, the discontiguities
> within a node are buried down further, so we have much less complexity
> to deal with from the main VM. The abstraction also keeps the poor
> VM engineers trying to read / write the code saner via simplicity ;-)

IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
clarify this issue when his code gets ready. :-)

> WRT generic discontigmem support (not NUMA), doing that via pgdats
> should really go away, as there's no real difference between the 
> chunks of physical memory as far as the page allocator is concerned.
> The plan is to use Daniel's nonlinear stuff to replace that, and keep
> the pgdats strictly for NUMA. Same would apply to hotpluggable zones - 
> I'd hate to end up with 512 pgdats of stuff that are really all the
> same memory types underneath.

Yes. Unnecessary zone rebalancing would suck.

> The real issue you have is the mapping of the struct pages - if we can
> acheive a non-contig mapping of the mem_map / lmem_map array, we should
> be able to take memory on and offline reasonably easy. If you're willing
> for a first implementation to pre-allocate the struct page array for 
> every possible virtual address, it makes life a lot easier.

Preallocating struct page array isn't feasible for the target system
because max memory / min memory ratio is large.
Our plan is to use the beginning (or the end) of the memory block being
hotplugged.  If a 2GB memory block is added, first ~20MB is used for
the struct page array for the rest of the memory block.


> >> PS. What's this bit of the patch for?
> >> 
> >>  void *vmalloc(unsigned long size)
> >>  {
> >> +#ifdef CONFIG_MEMHOTPLUGTEST
> >> +       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
> >> +#else
> >>         return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
> >> +#endif
> >>  }
> > 
> > This is necessary because kernel memory cannot be swapped out.
> > Only highmem can be hot removed, though it doesn't need to be highmem.
> > We can define another zone attribute such as GFP_HOTPLUGGABLE.
> 
> You could just lock the pages, I'd think? I don't see at a glance
> exactly what you were using this for, but would that work?

I haven't seriously considered to implement vmalloc'd memory, but I
guess that would be too complicated if not impossible.
Making kernel threads or interrupt handlers block on memory access
sound very difficult to me.

--
IWAMOTO Toshihiro
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>