From: David Hildenbrand <david@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>,
Dan Williams <dan.j.williams@intel.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Oscar Salvador <osalvador@suse.com>,
Michal Hocko <mhocko@suse.com>,
Pavel Tatashin <pasha.tatashin@soleen.com>,
Wei Yang <richard.weiyang@gmail.com>, Qian Cai <cai@lca.pw>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Stephen Rothwell <sfr@canb.auug.org.au>,
Dave Airlie <airlied@redhat.com>,
Andrey Konovalov <andreyknvl@google.com>,
Alexander Duyck <alexander.h.duyck@linux.intel.com>,
Ira Weiny <ira.weiny@intel.com>,
John Hubbard <jhubbard@nvidia.com>,
Arun KS <arunks@codeaurora.org>,
Souptick Joarder <jrdr.linux@gmail.com>,
Robin Murphy <robin.murphy@arm.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
Logan Gunthorpe <logang@deltatee.com>,
Vlastimil Babka <vbabka@suse.cz>,
Mel Gorman <mgorman@techsingularity.net>,
Mike Rapoport <rppt@linux.vnet.ibm.com>,
Alexander Potapenko <glider@google.com>
Subject: Re: [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device()
Date: Thu, 29 Aug 2019 18:55:42 +0200 [thread overview]
Message-ID: <46b8e215-97d4-9097-9941-58e050c5efcc@redhat.com> (raw)
In-Reply-To: <CAKgT0UdKwYhF++=b=vN_Kw_SORtgJ3ehCAn8a9kp8tb79HknyQ@mail.gmail.com>
On 29.08.19 18:39, Alexander Duyck wrote:
> On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> As far as I can see, the original split wanted to speed up a duplicate
>> initialization. We now only initialize once - and we really should
>> initialize under the lock, when resizing the zones.
>
> What do you mean by duplicate initialization? Basically the issue was
> that we can have systems with massive memory footprints and I was just
> trying to get the initialization time under a minute. The compromise I
> made was to split the initialization so that we only initialized the
> pages in the altmap and defer the rest so that they can be initialized
> in parallel.
>
> What this patch does is serialize the initialization so it will likely
> take 2 to 4 minutes or more to initialize memory on a system where I
> had brought the init time under about 30 seconds.
>
>> As soon as we drop the lock we might have memory unplug running, trying
>> to shrink the zone again. In case the memmap was not properly initialized,
>> the zone/node shrinking code might get false negatives when search for
>> the new zone/node boundaries - bad. We suddenly could no longer span the
>> memory we just added.
>
> The issue as I see it is that we can have systems with multiple nodes
> and each node can populate a fairly significant amount of data. By
> pushing this all back under the hotplug lock you are serializing the
> initialization for each node resulting in a 4x or 8x increase in
> initialization time on some of these bigger systems.
>
>> Also, once we want to fix set_zone_contiguous(zone) inside
>> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
>> always immediately stopping and never setting zone->contiguous) we have
>> to have the whole memmap initialized at that point. (not sure if we
>> want that in the future, just a note)
>>
>> Let's just keep things simple and initialize the memmap when resizing
>> the zones under the lock.
>>
>> If this is a real performance issue, we have to watch out for
>> alternatives.
>
> My concern is that this is going to become a performance issue, but I
> don't have access to the hardware at the moment to test how much of
> one. I'll have to check to see if I can get access to a system to test
> this patch set.
>
Thanks for having a look - I already dropped this patch again. We will
rather stop shrinking the ZONE_DEVICE. So assume this patch is gone.
[...]
> So if you are moving this all under the lock then this is going to
> serialize initialization and it is going to be quite expensive on bit
> systems. I was only ever able to get the init time down to something
> like ~20s with the optimized function. Since that has been torn apart
> and you are back to doing things with memmap_init_zone we are probably
> looking at more like 25-30s for each node, and on a 4 node system we
> are looking at 2 minutes or so which may lead to issues if people are
> mounting this.
>
> Instead of forcing this all to be done under the hotplug lock is there
> some way we could do this under the zone span_seqlock instead to
> achieve the same result?
I guess the right approach really is as Michal suggest to not shrink at
all (at least ZONE_DEVICE) :)
>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c5d62f1c2851..44038665fe8e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>> */
>> void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> unsigned long start_pfn, enum memmap_context context,
>> - struct vmem_altmap *altmap)
>> + struct dev_pagemap *pgmap)
>> {
>> unsigned long pfn, end_pfn = start_pfn + size;
>> struct page *page;
>> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> if (highest_memmap_pfn < end_pfn - 1)
>> highest_memmap_pfn = end_pfn - 1;
>>
>> -#ifdef CONFIG_ZONE_DEVICE
>> - /*
>> - * Honor reservation requested by the driver for this ZONE_DEVICE
>> - * memory. We limit the total number of pages to initialize to just
>> - * those that might contain the memory mapping. We will defer the
>> - * ZONE_DEVICE page initialization until after we have released
>> - * the hotplug lock.
>> - */
>> - if (zone == ZONE_DEVICE) {
>> - if (!altmap)
>> - return;
>> -
>> - if (start_pfn == altmap->base_pfn)
>> - start_pfn += altmap->reserve;
>> - end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> - }
>> -#endif
>> -
>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> /*
>> * There can be holes in boot-time mem_map[]s handed to this
>> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> if (context == MEMMAP_HOTPLUG)
>> __SetPageReserved(page);
>>
>> +#ifdef CONFIG_ZONE_DEVICE
>> + if (zone == ZONE_DEVICE) {
>> + WARN_ON_ONCE(!pgmap);
>> + /*
>> + * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> + * pointer and zone_device_data. It is a bug if a
>> + * ZONE_DEVICE page is ever freed or placed on a driver
>> + * private list.
>> + */
>> + page->pgmap = pgmap;
>> + page->zone_device_data = NULL;
>> + }
>> +#endif
>> +
>
> So I am not sure this is right. Part of the reason for the split is
> that the pages that were a part of the altmap had an LRU setup, not
> the pgmap/zone_device_data setup. This is changing that.
Yeah, you might be right, we would have to handle the altmap part
separately.
>
> Also, I am more a fan of just testing pgmap and if it is present then
> assign page->pgmap and reset zone_device_data. Then you can avoid the
> test for zone on every iteration and the WARN_ON_ONCE check, or at
> least you could move it outside the loop so we don't incur the cost
> with each page. Are there situations where you would have pgmap but
> not a ZONE_DEVICE page?
Don't think so. But I am definitely not a ZONE_DEVICE expert.
>
>> /*
>> * Mark the block movable so that blocks are reserved for
>> * movable at startup. This will force kernel allocations
>> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> */
>> __SetPageReserved(page);
>>
>> - /*
>> - * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> - * and zone_device_data. It is a bug if a ZONE_DEVICE page is
>> - * ever freed or placed on a driver-private list.
>> - */
>> - page->pgmap = pgmap;
>> - page->zone_device_data = NULL;
>> -
>> /*
>> * Mark the block movable so that blocks are reserved for
>> * movable at startup. This will force kernel allocations
>
> Shouldn't you be removing this function instead of just gutting it?
> I'm kind of surprised you aren't getting warnings about unused code
> since you also pulled the declaration for it from the header.
>
Don't ask me what went wrong here, I guess I messed this up while rebasing.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2019-08-29 16:55 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-29 7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
2019-08-29 16:39 ` Alexander Duyck
2019-08-29 16:55 ` David Hildenbrand [this message]
2019-08-29 7:00 ` [PATCH v3 02/11] mm/memory_hotplug: Simplify shrink_pgdat_span() David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 03/11] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 04/11] mm/memory_hotplug: Drop local variables in shrink_zone_span() David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 05/11] mm/memory_hotplug: Optimize zone shrinking code when checking for holes David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 06/11] mm/memory_hotplug: Fix crashes in shrink_zone_span() David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 07/11] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 08/11] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 09/11] mm/memory_hotplug: Remove pages from a zone before removing memory David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 10/11] mm/memory_hotplug: Remove zone parameter from __remove_pages() David Hildenbrand
2019-08-29 7:00 ` [PATCH v3 11/11] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
2019-08-29 8:23 ` [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory Michal Hocko
2019-08-29 11:33 ` David Hildenbrand
2019-08-29 11:43 ` David Hildenbrand
2019-08-29 12:08 ` David Hildenbrand
2019-08-29 12:15 ` Michal Hocko
2019-08-29 12:29 ` David Hildenbrand
2019-08-29 15:19 ` Michal Hocko
2019-08-29 15:28 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46b8e215-97d4-9097-9941-58e050c5efcc@redhat.com \
--to=david@redhat.com \
--cc=airlied@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.duyck@gmail.com \
--cc=alexander.h.duyck@linux.intel.com \
--cc=andreyknvl@google.com \
--cc=arunks@codeaurora.org \
--cc=cai@lca.pw \
--cc=dan.j.williams@intel.com \
--cc=glider@google.com \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=jrdr.linux@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=logang@deltatee.com \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=osalvador@suse.com \
--cc=pasha.tatashin@soleen.com \
--cc=richard.weiyang@gmail.com \
--cc=robin.murphy@arm.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=sfr@canb.auug.org.au \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox