Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Oscar Salvador <osalvador@suse.de>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Date: Wed, 21 Apr 2021 10:15:46 +0200	[thread overview]
Message-ID: <20210421081546.GD22456@linux> (raw)
In-Reply-To: <YH6zQ1Dty9kJFkuk@dhcp22.suse.cz>

On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> > 
> > This has some disadvantages:
> >  a) an existing memory is consumed for that purpose
> >     (eg: ~2MB per 128MB memory section on x86_64)
> 
> I would extend this slightly. This can even lead to extreme cases where
> system goes OOM because the physically hotplugged memory depletes the
> available memory before it is onlined.

Ok.

> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
> 
> Again this can be confusing because this is not what is really happening
> in practice because we are going to have a multisection memory block
> where all sections will be backed by a common reserved space rather than
> per section sparse space. I would go with
> 
> "
> Vmemap page tables can map arbitrary memory. That means that we can
> reserve a part of the physically hotadded memory to back vmemmap page
> tables. This implementation uses the beggining of the hotplugged memory
> for that purpose.
> "

Yeah, I thought I fixed that, it should have been "That means that we can simply
use the beginning of each memory block...", but I am ok with your rewording.

> There is quite a large leap from __populate_section_memmap to the
> memory_block that deserves explaining to not lose all the subtle things
> discussed in the past. I think it should be made clear why all the fuzz.
> I would structure it as follows:
> "
> There are some non-obiously things to consider though.  Vmemmap
> pages are allocated/freed during the memory hotplug events
> (add_memory_resource, try_remove_memory) when the memory is
> added/removed. This means that the reserved physical range is not online
> yet it is used. The most obvious side effect is that pfn_to_online_page
> returns NULL for those pfns. The current design expects that this
> should be OK as the hotplugged memory is considered a garbage until it
> is onlined. For example hibernation wouldn't save the content of those
> vmmemmaps into the image so it wouldn't be restored on resume but this
> should be OK as there no real content to recover anyway while metadata
> is reachable from other data structures (e.g. vmemmap page tables).
> 
> The reserved space is therefore (de)initialized during the {on,off}line
> events (mhp_{de}init_memmap_on_memory). That is done by extracting page
> allocator independent initialization from the regular onlining path.
> The primary reason to handle the reserved space outside of {on,off}line_pages
> is to make each initialization specific to the purpose rather than
> special case them in a single function.

Ok, that definitely adds a valuable information.

> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f209925a5d4e..2e2b2f654f0a 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
> >  {
> >  	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> >  	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> > +	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> > +	struct zone *zone;
> > +	int ret;
> > +
> > +	zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
> > +
> > +	/*
> > +	 * Although vmemmap pages have a different lifecycle than the pages
> > +	 * they describe (they remain until the memory is unplugged), doing
> > +	 * their initialization and accounting at memory onlining/offlining
> > +	 * stage simplifies things a lot.
> 
> "simplify things a lot" is not really helpful to people reading the
> code. It would be much better to state reasons here. I would go with
> 	 * stage helps to keep accounting easier to follow - e.g.
> 	 * vmemmaps belong to the same zone as the onlined memory.

Ok

> >  static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> >  {
> >  	const unsigned long end_pfn = start_pfn + nr_pages;
> > -	unsigned long pfn;
> > +	unsigned long pfn = start_pfn;
> > +
> > +	while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> > +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > +		pfn += pageblock_nr_pages;
> > +	}
> 
> I believe we do not need to check for nr_pages as the actual operation
> will never run out of range in practice but the code is more subtle than

If you mean that IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES) can go, that is not
right.
Of course, with your changes below it would not be necesary.

> necessary. Using two different iteration styles is also hurting the code
> readability. I would go with the following
> 	for (pfn = start_pfn; pfn < end_pfn; ) {
> 		unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
> 
> 		while (start + (1UL << order) > end_pfn)
>                         order--;
> 		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> 		pfn += 1 << order;
> 	}
> 
> which is what __free_pages_memory does already.

this is kinda what I used to have in the early versions, but it was agreed
with David to split it in two loops to make it explicit.
I can go back to that if it is preferred.

> > +	if (memmap_on_memory) {
> > +		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > +						      get_nr_vmemmap_pages_cb);
> > +		if (nr_vmemmap_pages) {
> > +			if (size != memory_block_size_bytes()) {
> > +				pr_warn("Refuse to remove %#llx - %#llx,"
> > +					"wrong granularity\n",
> > +					start, start + size);
> > +				return -EINVAL;
> > +			}
> > +
> > +			/*
> > +			 * Let remove_pmd_table->free_hugepage_table do the
> > +			 * right thing if we used vmem_altmap when hot-adding
> > +			 * the range.
> > +			 */
> > +			mhp_altmap.alloc = nr_vmemmap_pages;
> > +			altmap = &mhp_altmap;
> > +		}
> > +	}
> > +
> >  	/* remove memmap entry */
> >  	firmware_map_remove(start, start + size, "System RAM");
> 
> I have to say I still dislike this and I would just wrap it inside out
> and do the operation from within walk_memory_blocks but I will not
> insist.

I have to confess I forgot about the details of that dicussion, as we were
quite focused on decoupling vmemmap pages from {online,offline} interface.
Would you mind elaborating a bit more?


-- 
Oscar Salvador
SUSE L3

next prev parent reply	other threads:[~2021-04-21  8:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210416112411.9826-1-osalvador@suse.de>
2021-04-16 11:24 ` [PATCH v9 1/8] drivers/base/memory: Introduce memory_block_{online,offline} Oscar Salvador
2021-04-20  9:23   ` Michal Hocko
2021-04-16 11:24 ` [PATCH v9 2/8] mm,memory_hotplug: Relax fully spanned sections check Oscar Salvador
2021-04-20  9:40   ` Michal Hocko
2021-04-21  7:37     ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count() Oscar Salvador
2021-04-20  9:45   ` Michal Hocko
2021-04-21  8:00     ` Oscar Salvador
2021-04-21  8:06       ` David Hildenbrand
2021-04-21  8:31       ` Michal Hocko
2021-04-21  8:35         ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2021-04-20 10:56   ` Michal Hocko
2021-04-21  8:15     ` Oscar Salvador [this message]
2021-04-21  8:39       ` Michal Hocko
2021-04-21  8:44         ` David Hildenbrand
2021-04-21  8:49           ` Michal Hocko
2021-04-21  8:52             ` David Hildenbrand
2021-04-21  8:46         ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
2021-04-20 10:56   ` Michal Hocko
2021-04-16 11:24 ` [PATCH v9 8/8] arm64/Kconfig: " Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210421081546.GD22456@linux \
    --to=osalvador@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox