From: Yajun Deng <yajun.deng@linux.dev>
To: David Hildenbrand <david@redhat.com>, Mike Rapoport <rppt@kernel.org>
Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com,
muchun.song@linux.dev, willy@infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY
Date: Thu, 12 Oct 2023 17:53:22 +0800 [thread overview]
Message-ID: <38cd0cb9-efe9-b98a-2768-ccb48da8b812@linux.dev> (raw)
In-Reply-To: <bf7143f4-9d50-cfc4-0ef6-d312a2cc896b@redhat.com>
On 2023/10/12 17:23, David Hildenbrand wrote:
> On 10.10.23 04:31, Yajun Deng wrote:
>>
>> On 2023/10/8 16:57, Yajun Deng wrote:
>>>
>>> On 2023/10/2 16:30, David Hildenbrand wrote:
>>>> On 29.09.23 10:30, Mike Rapoport wrote:
>>>>> On Thu, Sep 28, 2023 at 04:33:02PM +0800, Yajun Deng wrote:
>>>>>> memmap_init_range() would init page count of all pages, but the free
>>>>>> pages count would be reset in __free_pages_core(). There are
>>>>>> opposite
>>>>>> operations. It's unnecessary and time-consuming when it's
>>>>>> MEMINIT_EARLY
>>>>>> context.
>>>>>>
>>>>>> Init page count in reserve_bootmem_region when in MEMINIT_EARLY
>>>>>> context,
>>>>>> and check the page count before reset it.
>>>>>>
>>>>>> At the same time, the INIT_LIST_HEAD in reserve_bootmem_region isn't
>>>>>> need, as it already done in __init_single_page.
>>>>>>
>>>>>> The following data was tested on an x86 machine with 190GB of RAM.
>>>>>>
>>>>>> before:
>>>>>> free_low_memory_core_early() 341ms
>>>>>>
>>>>>> after:
>>>>>> free_low_memory_core_early() 285ms
>>>>>>
>>>>>> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
>>>>>> ---
>>>>>> v4: same with v2.
>>>>>> v3: same with v2.
>>>>>> v2: check page count instead of check context before reset it.
>>>>>> v1:
>>>>>> https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/
>>>>>>
>>>>>> ---
>>>>>> mm/mm_init.c | 18 +++++++++++++-----
>>>>>> mm/page_alloc.c | 20 ++++++++++++--------
>>>>>> 2 files changed, 25 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>>>>>> index 9716c8a7ade9..3ab8861e1ef3 100644
>>>>>> --- a/mm/mm_init.c
>>>>>> +++ b/mm/mm_init.c
>>>>>> @@ -718,7 +718,7 @@ static void __meminit
>>>>>> init_reserved_page(unsigned long pfn, int nid)
>>>>>> if (zone_spans_pfn(zone, pfn))
>>>>>> break;
>>>>>> }
>>>>>> - __init_single_page(pfn_to_page(pfn), pfn, zid, nid,
>>>>>> INIT_PAGE_COUNT);
>>>>>> + __init_single_page(pfn_to_page(pfn), pfn, zid, nid, 0);
>>>>>> }
>>>>>> #else
>>>>>> static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
>>>>>> @@ -756,8 +756,8 @@ void __meminit
>>>>>> reserve_bootmem_region(phys_addr_t start,
>>>>>> init_reserved_page(start_pfn, nid);
>>>>>> - /* Avoid false-positive PageTail() */
>>>>>> - INIT_LIST_HEAD(&page->lru);
>>>>>> + /* Init page count for reserved region */
>>>>>
>>>>> Please add a comment that describes _why_ we initialize the page
>>>>> count here.
>>>>>
>>>>>> + init_page_count(page);
>>>>>> /*
>>>>>> * no need for atomic set_bit because the struct
>>>>>> @@ -888,9 +888,17 @@ void __meminit memmap_init_range(unsigned long
>>>>>> size, int nid, unsigned long zone
>>>>>> }
>>>>>> page = pfn_to_page(pfn);
>>>>>> - __init_single_page(page, pfn, zone, nid, INIT_PAGE_COUNT);
>>>>>> - if (context == MEMINIT_HOTPLUG)
>>>>>> +
>>>>>> + /* If the context is MEMINIT_EARLY, we will init page
>>>>>> count and
>>>>>> + * mark page reserved in reserve_bootmem_region, the free
>>>>>> region
>>>>>> + * wouldn't have page count and we will check the pages
>>>>>> count
>>>>>> + * in __free_pages_core.
>>>>>> + */
>>>>>> + __init_single_page(page, pfn, zone, nid, 0);
>>>>>> + if (context == MEMINIT_HOTPLUG) {
>>>>>> + init_page_count(page);
>>>>>> __SetPageReserved(page);
>>>>>
>>>>> Rather than calling init_page_count() and __SetPageReserved() for
>>>>> MEMINIT_HOTPLUG you can set flags to INIT_PAGE_COUNT |
>>>>> INIT_PAGE_RESERVED
>>>>> an call __init_single_page() after the check for MEMINIT_HOTPLUG.
>>>>>
>>>>> But more generally, I wonder if we have to differentiate HOTPLUG
>>>>> here at all.
>>>>> @David, can you comment please?
>>>>
>>>> There are a lot of details to that, and I'll share some I can briefly
>>>> think of.
>>>>
>>>> 1) __SetPageReserved()
>>>>
>>>> I tried removing that a while ago, but there was a blocker (IIRC
>>>> something about
>>>> ZONE_DEVICE). I still have the patches at [1] and I could probably
>>>> take a look
>>>> if that blocker still exists (I recall that something changed at some
>>>> point, but
>>>> I never had the time to follow up).
>>>>
>>>> But once we stop setting the pages reserved, we might run into issues
>>>> with ...
>>>>
>>>>
>>>> 2) init_page_count()
>>>>
>>>> virtio-mem, XEN balloon and HV-balloon add memory blocks that can
>>>> contain holes.
>>>> set_online_page_callback() is used to intercept memory onlining and
>>>> to expose
>>>> only the pages that are not holes to the buddy: calling
>>>> generic_online_page() on !hole.
>>>>
>>>> Holes are PageReserved but with an initialized page count. Memory
>>>> offlining will fail on
>>>> PageReserved pages -- has_unmovable_pages().
>>>>
>>>>
>>>> At least virtio-mem clears the PageReserved flag of holes when
>>>> onlining memory,
>>>> and currently relies in the page count to be reasonable (so memory
>>>> offlining can work).
>>>>
>>>> static void virtio_mem_set_fake_offline(unsigned long pfn,
>>>> unsigned long nr_pages, bool onlined)
>>>> {
>>>> page_offline_begin();
>>>> for (; nr_pages--; pfn++) {
>>>> struct page *page = pfn_to_page(pfn);
>>>>
>>>> __SetPageOffline(page);
>>>> if (!onlined) {
>>>> SetPageDirty(page);
>>>> /* FIXME: remove after cleanups */
>>>> ClearPageReserved(page);
>>>> }
>>>> }
>>>> page_offline_end();
>>>> }
>>>>
>>>>
>>>> For virtio-mem, we could initialize the page count there instead. The
>>>> other PV drivers
>>>> might require a bit more thought.
>>>>
>>>>
>>>> [1]
>>>> https://github.com/davidhildenbrand/linux/tree/online_reserved_cleanup
>>>>
>>>>>
>>>>>> + }
>>>>>> /*
>>>>>> * Usually, we want to mark the pageblock
>>>>>> MIGRATE_MOVABLE,
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index 06be8821d833..b868caabe8dc 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -1285,18 +1285,22 @@ void __free_pages_core(struct page *page,
>>>>>> unsigned int order)
>>>>>> unsigned int loop;
>>>>>> /*
>>>>>> - * When initializing the memmap, __init_single_page() sets the
>>>>>> refcount
>>>>>> - * of all pages to 1 ("allocated"/"not free"). We have to
>>>>>> set the
>>>>>> - * refcount of all involved pages to 0.
>>>>>> + * When initializing the memmap, memmap_init_range sets the
>>>>>> refcount
>>>>>> + * of all pages to 1 ("reserved" and "free") in hotplug
>>>>>> context. We
>>>>>> + * have to set the refcount of all involved pages to 0.
>>>>>> Otherwise,
>>>>>> + * we don't do it, as reserve_bootmem_region only set the
>>>>>> refcount on
>>>>>> + * reserve region ("reserved") in early context.
>>>>>> */
>>>>>
>>>>> Again, why hotplug and early init should be different?
>>>>>
>>>>>> - prefetchw(p);
>>>>>> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
>>>>>> - prefetchw(p + 1);
>>>>>> + if (page_count(page)) {
>>>>>> + prefetchw(p);
>>>>>> + for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
>>>>>> + prefetchw(p + 1);
>>>>>> + __ClearPageReserved(p);
>>>>>> + set_page_count(p, 0);
>>>>>> + }
>>>>>> __ClearPageReserved(p);
>>>>>> set_page_count(p, 0);
>>>>
>>>> That looks wrong. if the page count would by pure luck be 0 already
>>>> for hotplugged memory,
>>>> you wouldn't clear the reserved flag.
>>>>
>>>> These changes make me a bit nervous.
>>>
>>>
>>> Is 'if (page_count(page) || PageReserved(page))' be safer? Or do I
>>> need to do something else?
>>>
>>
>> How about the following if statement? But it needs to add more patch
>> like v1 ([PATCH 2/4] mm: Introduce MEMINIT_LATE context).
>>
>> It'll be safer, but more complex. Please comment...
>>
>> if (context != MEMINIT_EARLY || (page_count(page) ||
>> PageReserved(page)) {
>>
>
> Ideally we could make initialization only depend on the context, and
> not check for count or the reserved flag.
>
This link is v1,
https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/
If we could make initialization only depend on the context, I'll modify
it based on v1.
@Mike, By the way, this code will cost more time:
if (context == MEMINIT_HOTPLUG)
flags = INIT_PAGE_COUNT | INIT_PAGE_RESERVED;
__init_single_page(page, pfn, zone, nid, flags);
[ 0.014999] On node 0, zone DMA32: 31679 pages in unavailable ranges
[ 0.311560] ACPI: PM-Timer IO Port: 0x508
This code will cost less time:
__init_single_page(page, pfn, zone, nid, 0);
if (context == MEMINIT_HOTPLUG) {
init_page_count(page);
__SetPageReserved(page);
[ 0.014299] On node 0, zone DMA32: 31679 pages in unavailable ranges
[ 0.250223] ACPI: PM-Timer IO Port: 0x508
next prev parent reply other threads:[~2023-10-12 9:53 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-28 8:33 [PATCH v4 0/2] mm: Don't set and reset page count in MEMINIT_EARLY Yajun Deng
2023-09-28 8:33 ` [PATCH v4 1/2] mm: pass page count and reserved to __init_single_page Yajun Deng
2023-09-29 8:19 ` Mike Rapoport
2023-09-29 9:37 ` Yajun Deng
2023-09-28 8:33 ` [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY Yajun Deng
2023-09-29 8:30 ` Mike Rapoport
2023-09-29 9:50 ` Yajun Deng
2023-09-29 10:02 ` Mike Rapoport
2023-09-29 10:27 ` Yajun Deng
2023-10-01 18:59 ` Mike Rapoport
2023-10-02 7:03 ` Yajun Deng
2023-10-02 8:47 ` Mike Rapoport
2023-10-02 8:56 ` David Hildenbrand
2023-10-02 11:10 ` Mike Rapoport
2023-10-02 11:25 ` David Hildenbrand
2023-10-03 14:38 ` Yajun Deng
2023-10-05 5:06 ` Mike Rapoport
2023-10-05 14:04 ` Yajun Deng
2023-10-12 9:19 ` Mike Rapoport
2023-10-12 9:36 ` Yajun Deng
2023-10-02 8:30 ` David Hildenbrand
2023-10-08 8:57 ` Yajun Deng
2023-10-10 2:31 ` Yajun Deng
2023-10-12 9:23 ` David Hildenbrand
2023-10-12 9:53 ` Yajun Deng [this message]
2023-10-13 8:48 ` Mike Rapoport
2023-10-13 9:29 ` Yajun Deng
2023-10-16 6:33 ` Mike Rapoport
2023-10-16 8:10 ` Yajun Deng
2023-10-16 8:16 ` David Hildenbrand
2023-10-16 8:32 ` Yajun Deng
2023-10-16 8:36 ` David Hildenbrand
2023-10-16 10:17 ` Yajun Deng
2023-10-17 9:58 ` Yajun Deng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=38cd0cb9-efe9-b98a-2768-ccb48da8b812@linux.dev \
--to=yajun.deng@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=rppt@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox