Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>, akpm@linux-foundation.org
Cc: mhocko@suse.com, dan.j.williams@intel.com,
	Jonathan.Cameron@huawei.com, anshuman.khandual@arm.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
Date: Thu, 28 Mar 2019 16:09:06 +0100	[thread overview]
Message-ID: <cc68ec6d-3ad2-a998-73dc-cb90f3563899@redhat.com> (raw)
In-Reply-To: <20190328134320.13232-1-osalvador@suse.de>

On 28.03.19 14:43, Oscar Salvador wrote:
> Hi,
> 
> since last two RFCs were almost unnoticed (thanks David for the feedback),
> I decided to re-work some parts to make it more simple and give it a more
> testing, and drop the RFC, to see if it gets more attention.
> I also added David's feedback, so now all users of add_memory/__add_memory/
> add_memory_resource can specify whether they want to use this feature or not.

Terrific, I will also definetly try to make use of that in the next
virito-mem prototype (looks like I'll finally have time to look into it
again).

> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
> 
> [Testing]
> 
> Testing has been carried out on the following platforms:
> 
>  - x86_64 (small and big memblocks)
>  - powerpc
>  - arm64 (Huawei's fellows)
> 
> I plan to test it on Xen and Hyper-V, but for now those two will not be
> using this feature, and neither DAX/pmem.

I think doing it step by step is the right approach. Less likely to
break stuff.

> 
> Of course, if this does not find any strong objection, my next step is to
> work on enabling this on Xen/Hyper-V.
> 
> [Coverletter]
> 
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) is problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> I have also seen hot-add operations failing on archs because they
> were running out of order-x pages.
> E.g On powerpc, in certain configurations, we use order-8 pages,
> and given 64KB base pagesize, that is 16MB.
> If we run out of those, we just fail the operation and we cannot add
> more memory.
> We could fallback to base pages as x86_64 does, but we can do better.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
> 
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory block will contain reserved
> area. Large x86 machines will use 2G memblocks so at least one 1G page
> will be available but this is still not 2G...
> 
> If that is a problem, we can always configure a fallback strategy to
> use the current scheme.
> 
> Since this only works when CONFIG_VMEMMAP_ENABLED is set,
> we do check for it before setting the flag that allows use
> to use the feature, no matter if the user wanted it.
> 
> [Overall design]:
> 
> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
> That is:
> 
>  - 16 sections
>  - 524288 pages
>  - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)
> 
>  The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
>  The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000
> 
>  0xffffea0004000000 is the head vmemmap page (first page), while all the others
>  are "tails".
> 
>  We keep the following information in it:
> 
>  - Head page:
>    - head->_refcount: number of sections
>    - head->private :  number of vmemmap pages
>  - Tail page:
>    - tail->freelist : pointer to the head
> 
> This is done because it eases the work in cases where we have to compute the
> number of vmemmap pages to know how much do we have to skip etc, and to keep
> the right accounting to present_pages.
> 
> When we want to hot-remove the range, we need to be careful because the first
> pages of that range, are used for the memmap maping, so if we remove those
> first, we would blow up while accessing the others later on.
> For that reason we keep the number of sections in head->_refcount, to know how
> much do we have to defer the free up.
> 
> Since in a hot-remove operation, sections are being removed sequentially, the
> approach taken here is that every time we hit free_section_memmap(), we decrease
> the refcount of the head.
> When it reaches 0, we know that we hit the last section, so we call
> vmemmap_free() for the whole memory-range in backwards, so we make sure that
> the pages used for the mapping will be latest to be freed up.
> 
> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.
> 

I guess one important thing to mention is that it is no longer possible
to remove memory in a different granularity it was added. I slightly
remember that ACPI code sometimes "reuses" parts of already added
memory. We would have to validate that this can indeed not be an issue.

drivers/acpi/acpi_memhotplug.c:

result = __add_memory(node, info->start_addr, info->length);
if (result && result != -EEXIST)
	continue;

What would happen when removing this dimm (->remove_memory())


Also have a look at

arch/powerpc/platforms/powernv/memtrace.c

I consider it evil code. It will simply try to offline+unplug *some*
memory it finds in *some granularity*. Not sure if this might be
problematic-

Would there be any "safety net" for adding/removing memory in different
granularities?

-- 

Thanks,

David / dhildenb

next prev parent reply	other threads:[~2019-03-28 15:09 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-28 13:43 Oscar Salvador
2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
2019-04-03  8:43   ` Michal Hocko
2019-03-28 13:43 ` [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
2019-04-03  8:46   ` Michal Hocko
2019-04-03  8:48     ` David Hildenbrand
2019-04-04 10:04     ` Oscar Salvador
2019-04-04 10:06       ` David Hildenbrand
2019-04-04 10:31       ` Michal Hocko
2019-04-04 12:04         ` Oscar Salvador
2019-03-28 13:43 ` [PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
2019-03-28 13:43 ` [PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
2019-03-28 15:09 ` David Hildenbrand [this message]
2019-03-28 15:31   ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
2019-03-29  8:45     ` Oscar Salvador
2019-03-29  8:56       ` David Hildenbrand
2019-03-29  9:01         ` David Hildenbrand
2019-03-29  9:20         ` Oscar Salvador
2019-03-29 13:42       ` Michal Hocko
2019-04-01  7:59         ` Oscar Salvador
2019-04-01 11:53           ` Michal Hocko
2019-04-02  8:28             ` Oscar Salvador
2019-04-02  8:39               ` David Hildenbrand
2019-04-02 12:48               ` Michal Hocko
2019-04-03  8:01                 ` Oscar Salvador
2019-04-03  8:12                   ` Michal Hocko
2019-04-03  8:17                     ` David Hildenbrand
2019-04-03  8:37                       ` Michal Hocko
2019-04-03  8:41                         ` David Hildenbrand
2019-04-03  8:49                           ` Michal Hocko
2019-04-03  8:53                             ` David Hildenbrand
2019-04-03  8:50                           ` Oscar Salvador
2019-04-03  8:54                             ` David Hildenbrand
2019-04-03  9:40                         ` Oscar Salvador
2019-04-03 10:46                           ` Michal Hocko
2019-04-04 10:25                           ` Vlastimil Babka
2019-04-03  8:34                     ` Oscar Salvador
2019-04-03  8:36                       ` David Hildenbrand
2019-03-29  8:30   ` Oscar Salvador
2019-03-29  8:51     ` David Hildenbrand
2019-03-29 22:23 ` John Hubbard
2019-04-01  7:52   ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc68ec6d-3ad2-a998-73dc-cb90f3563899@redhat.com \
    --to=david@redhat.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox