linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
To: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>, "H. Peter Anvin" <hpa@zytor.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	Tang Chen <tangchen@cn.fujitsu.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"rob@landley.net" <rob@landley.net>,
	"laijs@cn.fujitsu.com" <laijs@cn.fujitsu.com>,
	"wency@cn.fujitsu.com" <wency@cn.fujitsu.com>,
	"linfeng@cn.fujitsu.com" <linfeng@cn.fujitsu.com>,
	"yinghai@kernel.org" <yinghai@kernel.org>,
	"kosaki.motohiro@jp.fujitsu.com" <kosaki.motohiro@jp.fujitsu.com>,
	"minchan.kim@gmail.com" <minchan.kim@gmail.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"rusty@rustcorp.com.au" <rusty@rustcorp.com.au>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	Len Brown <lenb@kernel.org>, "Wang, Frank" <frank.wang@intel.com>
Subject: Re: [PATCH v2 0/5] Add movablecore_map boot option
Date: Fri, 30 Nov 2012 12:15:42 +0900	[thread overview]
Message-ID: <50B824DE.40702@jp.fujitsu.com> (raw)
In-Reply-To: <50B82064.9000405@huawei.com>

Hi Jiang,

2012/11/30 11:56, Jiang Liu wrote:
> Hi Mel,
> 	Thanks for your great comments!
>
> On 2012-11-29 19:00, Mel Gorman wrote:
>> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>>
>>>>> 2. use boot option
>>>>>    This is our proposal. New boot option can specify memory range to use
>>>>>    as movable memory.
>>>>
>>>> Isn't this just moving the work to the user? To pick good values for the
>>>> movable areas, they need to know how the memory lines up across
>>>> node boundaries ... because they need to make sure to allow some
>>>> non-movable memory allocations on each node so that the kernel can
>>>> take advantage of node locality.
>>>>
>>>> So the user would have to read at least the SRAT table, and perhaps
>>>> more, to figure out what to provide as arguments.
>>>>
>>>> Since this is going to be used on a dynamic system where nodes might
>>>> be added an removed - the right values for these arguments might
>>>> change from one boot to the next. So even if the user gets them right
>>>> on day 1, a month later when a new node has been added, or a broken
>>>> node removed the values would be stale.
>>>>
>>>
>>> I gave this feedback in person at LCE: I consider the kernel
>>> configuration option to be useless for anything other than debugging.
>>> Trying to promote it as an actual solution, to be used by end users in
>>> the field, is ridiculous at best.
>>>
>>
>> I've not been paying a whole pile of attention to this because it's not an
>> area I'm active in but I agree that configuring ZONE_MOVABLE like
>> this at boot-time is going to be problematic. As awkward as it is, it
>> would probably work out better to only boot with one node by default and
>> then hot-add the nodes at runtime using either an online sysfs file or
>> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
>> clumsy but better than specifying addresses on the command line.
>>
>> That said, I also find using ZONE_MOVABLE to be a problem in itself that
>> will cause problems down the road. Maybe this was discussed already but
>> just in case I'll describe the problems I see.
>>
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
>> metadata intensive workloads will not be able to use all of memory because
>> the kernel allocations will be confined to a subset of memory. A more
>> complex example is that page table page allocations are also restricted
>> meaning it's possible that a process will not even be able to mmap() a high
>> percentage of memory simply because it cannot allocate the page tables to
>> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
>> was a hack when it was introduced but at least then the expectation was
>> that ZONE_MOVABLE was going to be used for huge pages and there at least
>> an expectation that it would not be available for normal usage.
>>
>> Fundamentally the reason one would want to use ZONE_MOVABLE is because
>> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
>> device-allocated buffers etc.  My understanding is that other OS's get around
>> this by requiring that subsystems and drivers have callbacks that allow the
>> core VM to force certain memory to be released but that may be impractical
>> for Linux. I don't know for sure though, this is just what I heard.
> As I know, one other OS limits immovable pages at low end, and the limit
> will increase on demand. But the drawback of this solution is serious
> performance drop (average about 10%) because it essentially disable NUMA
> optimization for kernel/DMA memory allocations.
>
>> For Linux, the hotplug people need to start thinking about how to get
>> around this migration problem. The first problem faced is the memory model
>> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
>> fast but not because it's a fundamental requirement. Start considering
>> what happens if the memory model is changed to allow some sections to have
>> fast lookup for virt_to_phys and other sections to have slow lookups. On
>> hotplug, try and empty all the sections. If the section cannot be emptied
>> because of kernel pages then the section gets marked as "offline-migrated"
>> or something. Stop the whole machine (yes, I mean stop_machine), copy
>> those unmovable pages to another location, update the kernel virt->phys
>> mapping for the section being offlined so the virt addresses point to the
>> new physical addresses and resume.  Virt->phys lookups are going to be
>> a lot slower because a full section lookup will be necessary every time
>> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
>> but it should work. This will cover some slab pages where the data is only
>> accessed via the virtual address -- inode caches, dcache etc.
>>
>> It will not work where the physical address is used. The obvious example
>> is page table pages. For page tables, during stop machine you will have to
>> walk all processes page tables looking for references to the page you're
>> trying to move and update them. It is possible to just plain migrate
>> page table pages but when it was last implemented years ago there was a
>> constant performance penalty for everybody and it was not popular.  Taking a
>> heavy-handed approach just during memory hot-remove might be more palatable.
>>
>> For the remaining pages such as those that have been handed to devices
>> or are pinned for DMA then your options become more limited. You may
>> still have to restrict allocating these pages (where possible) to a
>> region that cannot be hot-removed but at least this will be relatively
>> few pages.
>>
>> The big downside of this proposal is that it's unproven, not designed,
>> would be extremely intrusive and I expect it would be a *massive* amount
>> of development effort that will be difficult to get right. The upside is
>> configuring it will be a lot easier because all you'll need is a variation
>> of kernelcore= to reserve a percentage of memory for allocations we *really*
>> cannot migrate because the physical pages are owned by a device that cannot
>> release them, potentially forever. The other upside is that it does not
>> hit crazy lowmem/highmem style problems.
>>
>> ZONE_MOVABLE at least will all a node to be removed very quickly but
>> because it will paste you into a corner there should be a plan on what
>> you're going to replace it with.
>
> I have some thoughts here. The basic idea is that it needs cooperation
> between OS, BIOS and hardware to implement a flexible memory hotplug
> solution.
>
> As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
> solution. It's quick because we could rely on existing mechanism
> to configure movable zone and no changes to the memory model needed.
> It's a little dirty because:
> 1) We need to handle cases of running out of immovable pages. The hotplug
> implementation shouldn't cause extra service interruption when normal zones
> are under pressure. Otherwise it's really a joke that some service
> interruptions are really caused by features trying to improve service
> availabilities.
> 2) We still can't handle normal kernel pages used by kernel, device etc.
> 3) It may cause serious performance drop if we configure all memory
> on a NUMA node as ZONE_MOVABLE.
>
> For the first issue, I think we could automatically convert pages
> from movable zones into normal zones. Congyan from Fujitsu has provided
> a patchset to manually convert pages from movable zones into normal zones,
> I think we could extend that mechanism to automatically convert when
> normal zones are under pressure by hooking into the slow page allocation
> path.
>
> We rely on hardware features to solve the second and third issues.
> Some new platforms provide a new RAS feature called "hardware memory
> migration", which transparent migrate memory from one memory device
> to another. With hardware memory migration, we could configure one
> memory device on a NUMA node to host normal zone, and the other memory
> devices to host movable zone. By this configuration, it won't cause
> performance drop because each NUMA node still has local normal zone.
> When trying to remove a memory device hosting normal zone, we just
> need to find another spare memory device and use hardware memory migration
> to transparently migrate memory content to the spare one. The drawback
> is we have strong dependency on hardware features so it's not a common
> solution for all architectures.

I agree with you. If BIOS and hardware support memory hotplug, OS should
use them. But if OS cannot use them, we need to solve in OS. I think
that our proposal which used ZONE_MOVABLE is first step for supporting
memory hotplug.

Thanks,
Yasuaki Ishimatsu

>
> Regards!
> Gerry
>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-11-30  3:16 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-23 10:44 Tang Chen
2012-11-23 10:44 ` [PATCH v2 1/5] x86: get pg_data_t's memory from other node Tang Chen
2012-11-24  1:19   ` Jiang Liu
2012-11-26  1:19     ` Tang Chen
2012-12-02 15:11   ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 2/5] page_alloc: add movable_memmap kernel parameter Tang Chen
2012-11-23 10:44 ` [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes Tang Chen
2012-12-05 15:46   ` Jiang Liu
2012-12-06  1:20     ` Tang Chen
2012-11-23 10:44 ` [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority Tang Chen
2012-12-05 15:43   ` Jiang Liu
2012-12-06  1:26     ` Tang Chen
2012-12-06  2:26       ` Jiang Liu
2012-12-06  2:51         ` Jianguo Wu
2012-12-06  2:57           ` Tang Chen
2012-12-09  8:10         ` Tang Chen
2012-12-10  2:15           ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map Tang Chen
2012-11-26 12:22   ` wujianguo
2012-11-26 12:53     ` Tang Chen
2012-11-26 12:40   ` wujianguo
2012-11-26 13:15     ` Tang Chen
2012-11-26 15:48       ` H. Peter Anvin
2012-11-27  0:58         ` Jianguo Wu
2012-11-27  3:19           ` Wen Congyang
2012-11-27  3:22             ` Jianguo Wu
2012-11-27  3:34               ` Wen Congyang
2012-11-27  1:12         ` Jiang Liu
2012-11-27  1:20           ` H. Peter Anvin
2012-11-27  3:15         ` Wen Congyang
2012-11-27  5:31           ` H. Peter Anvin
2012-12-06 17:28             ` Jiang Liu
2012-12-06 17:41               ` H. Peter Anvin
2012-12-07  0:18                 ` Jiang Liu
2012-12-19  9:17     ` Tang Chen
2012-11-27  3:10 ` [PATCH v2 0/5] Add movablecore_map boot option wujianguo
2012-11-27  5:43   ` Tang Chen
2012-11-27  6:20     ` H. Peter Anvin
2012-11-27  6:47     ` Jianguo Wu
2012-11-28  3:47   ` Tang Chen
2012-11-28  4:01     ` Jiang Liu
2012-11-28  5:21       ` Wen Congyang
2012-11-28  5:17         ` Jiang Liu
2012-11-28  4:53     ` Jianguo Wu
2012-11-27  8:00 ` Bob Liu
2012-11-27  8:29   ` Tang Chen
2012-11-27  8:49     ` H. Peter Anvin
2012-11-27  9:47       ` Wen Congyang
2012-11-27  9:53         ` H. Peter Anvin
2012-11-27  9:59       ` Yasuaki Ishimatsu
2012-11-27 12:09     ` Bob Liu
2012-11-27 12:49       ` Tang Chen
2012-11-28  3:24         ` Bob Liu
2012-11-28  4:08           ` Jiang Liu
2012-11-28  6:16             ` Tang Chen
2012-11-28  7:03               ` Jiang Liu
2012-11-28  8:29             ` Wen Congyang
2012-11-28  8:28               ` Jiang Liu
2012-11-28  8:38                 ` Wen Congyang
2012-11-29  0:43               ` Jaegeuk Hanse
2012-11-29  1:24                 ` Tang Chen
2012-11-30  9:20             ` Lai Jiangshan
2012-11-28  8:47 ` Jiang Liu
2012-11-28 21:34   ` Luck, Tony
2012-11-28 21:38     ` H. Peter Anvin
2012-11-29 11:00       ` Mel Gorman
2012-11-29 16:07         ` H. Peter Anvin
2012-11-29 22:41           ` Luck, Tony
2012-11-29 22:45             ` H. Peter Anvin
2012-11-30  2:56         ` Jiang Liu
2012-11-30  3:15           ` Yasuaki Ishimatsu [this message]
2012-11-30 15:36             ` Jiang Liu
2012-11-30  2:58         ` Luck, Tony
2012-11-30  3:28           ` H. Peter Anvin
2012-11-30 10:19           ` Glauber Costa
2012-11-30 10:52           ` Mel Gorman
2012-11-29 10:38     ` Yasuaki Ishimatsu
2012-11-29 11:05       ` Mel Gorman
2012-11-29 15:47       ` Jiang Liu
2012-11-29 15:53       ` Jiang Liu
2012-11-29  1:42   ` Jaegeuk Hanse
2012-11-29  2:25     ` Jiang Liu
2012-11-29  2:49       ` Wanpeng Li
2012-11-29  2:49       ` Wanpeng Li
2012-11-29  2:59         ` Jiang Liu
2012-11-30 22:27       ` Toshi Kani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50B824DE.40702@jp.fujitsu.com \
    --to=isimatu.yasuaki@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=frank.wang@intel.com \
    --cc=hpa@zytor.com \
    --cc=jiang.liu@huawei.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=lenb@kernel.org \
    --cc=linfeng@cn.fujitsu.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=rientjes@google.com \
    --cc=rob@landley.net \
    --cc=rusty@rustcorp.com.au \
    --cc=tangchen@cn.fujitsu.com \
    --cc=tony.luck@intel.com \
    --cc=wency@cn.fujitsu.com \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox