From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QyIB=IO=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 37EC7C433DB
	for <linux-mm@archiver.kernel.org>; Tue, 16 Mar 2021 16:46:22 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B950065108
	for <linux-mm@archiver.kernel.org>; Tue, 16 Mar 2021 16:46:21 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B950065108
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 43C866B006E; Tue, 16 Mar 2021 12:46:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 39E176B0070; Tue, 16 Mar 2021 12:46:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1F2136B0071; Tue, 16 Mar 2021 12:46:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192])
	by kanga.kvack.org (Postfix) with ESMTP id F1E746B006E
	for <linux-mm@kvack.org>; Tue, 16 Mar 2021 12:46:20 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id B3689613E
	for <linux-mm@kvack.org>; Tue, 16 Mar 2021 16:46:20 +0000 (UTC)
X-FDA: 77926315320.02.08913E6
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf24.hostedemail.com (Postfix) with ESMTP id 7B57BA001A9D
	for <linux-mm@kvack.org>; Tue, 16 Mar 2021 16:46:16 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1615913175;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=rGr5W1PgDszu6kWN4ONO/QzqL29aAraAzC4JvvT+R/g=;
	b=QYD/5HqNcQMEO062f7BStqD62JG+a/CahiL3DkCoVOjoGqAC7KNeA2Wlc5kvqGorXrVYeF
	KbGK3BeT5o37YB97oYeEn0N8IVmqbGS7odu8KFpICo9nhCIViixPJuk/c+fbpZf3uXlfWK
	sXD35dCsmIpw5i6LCOPpDKWVHwlEWxE=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-219-racK3aJQMOKrvE8oD-_iMw-1; Tue, 16 Mar 2021 12:46:10 -0400
X-MC-Unique: racK3aJQMOKrvE8oD-_iMw-1
Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B4EF2108BD0A;
	Tue, 16 Mar 2021 16:46:08 +0000 (UTC)
Received: from [10.36.114.203] (ovpn-114-203.ams2.redhat.com [10.36.114.203])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 6F92F2C167;
	Tue, 16 Mar 2021 16:46:06 +0000 (UTC)
To: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 Michal Hocko <mhocko@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
 Vlastimil Babka <vbabka@suse.cz>, Pavel Tatashin
 <pasha.tatashin@soleen.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <20210309175546.5877-1-osalvador@suse.de>
 <20210309175546.5877-2-osalvador@suse.de>
 <f600451e-48aa-184f-ae71-94e0abe9d6b1@redhat.com>
 <20210315102224.GA24699@linux>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Subject: Re: [PATCH v4 1/5] mm,memory_hotplug: Allocate memmap from the added
 memory range
Message-ID: <a2bf7b25-1e7a-bb6b-2fcd-08a4f4636ed5@redhat.com>
Date: Tue, 16 Mar 2021 17:46:05 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <20210315102224.GA24699@linux>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 7B57BA001A9D
X-Stat-Signature: ezb1ispiyaj3cr47ggr3usqew3sdqwca
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1615913176-680125
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 15.03.21 11:22, Oscar Salvador wrote:
> On Thu, Mar 11, 2021 at 08:06:53PM +0100, David Hildenbrand wrote:
>> This looks essentially good to me, except some parts in
>> mhp_supports_memmap_on_memory()
>>
>>> +bool mhp_supports_memmap_on_memory(unsigned long size)
>>> +{
>>> +	unsigned long pageblock_size =3D PFN_PHYS(pageblock_nr_pages);
>>> +	unsigned long remaining_mem =3D size - PMD_SIZE;
>=20
> Hi David, thanks for the review!
>=20
>> This looks weird. I think what you want to test is that
>>
>>
>> a) "nr_vmemmap_pages * sizeof(struct page)" spans complete PMDs (IOW, =
we
>> won't map too much via the altmap when populating a PMD in the vmemmap=
)
>>
>> b) "remaining =3D size - nr_vmemmap_pages * sizeof(struct page)" spans
>> complete pageblock.
>=20
> We do not know the nr_vmemmap_pages at this point in time, although it =
is
> easy to pre-compute.
>=20
> For every section we populate, we use PMD_SIZE. So, PMD_SIZE/PAGE_SIZE =
lays
> the nr_vmemmap_pages that are used for populating a single section.

I find that cross reference to vmemmap code a little hard to digest.
I would have assume that we don't have to care about PMDs in this
code here at all. The vmemmap population code should handle that.

I think I already mentioned that somewhere, I think it should be like thi=
s:

a) vmemmap code should *never* populate more memory than requested for
a single memory section when we are populating from the altmap.
If that cannot be guaranteed for PMDs, then we have to fallback
to populating base pages. Populating PMDs from an altmap with
sizeof(struct page) =3D=3D 64 is highly dangerous.

Assume we have sizeof(struct page) =3D=3D 56. A 128 MiB section
spans 32768 pages -  we need 32768 * sizeof(struct page)
space for the vmemmap.

With 64k pages we *can* use exactly one PMD. With 56k pages
we need 448 individual (full!) pages for the vmemmap.

IOW, we don't care how vmemmap code will do the mapping.
vmemmap code has to get it right. IMHO, asserting it in
this code is wrong.


b) In this code, we really should only care about what
memory onlining/offlining code can or can't do.
We really only care that

1) size =3D=3D memory_block_size_bytes()
2) remaining_size
3) IS_ALIGNED(remaining_size, pageblock_size);


I think a) is a logical consequence of b) for x86-64,
whereby the pageblock size corresponds to PMD, so we
might not have to care about a) right now.

See below for my take.


>=20
> But let me explain the reasoning I use in the current code:
>=20
> I will enumarate the assumptions that must hold true in order to suppor=
t the
> feature together with their check:
>=20
> - We span a single memory block
>=20
>    size =3D=3D memory_block_size_bytes()
>=20
> - The vmemmap pages span a complete PMD and no more than a PMD.
>=20
>    !(PMD_SIZE % sizeof(struct page))
>=20
> - The vmemmap pages and the pages exposed to the buddy have to cover fu=
ll
>    pageblocks
>=20
>    remaining_mem =3D size - PMD_SIZE;
>    IS_ALIGNED(remaining_mem, pageblock_size)
>=20
>    Although this check only covers the range without the vmemmap pages,=
 one could
>    argue that since we use only a PMD_SIZE at a time, we know that PMD_=
SIZE is
>    pageblock aligned, so the vmemmap range is PMD_SIZE as well.
>=20
> Now, I see how this might be confusing and rather incomplete.
> So I guess a better and more clear way to write it would be:
>=20
>   bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
>           unsigned long nr_vmemmap_pages =3D PMD_SIZE / PAGE_SIZE;
>           unsigned long vmemmap_size =3D nr_vmemmap_pages * sizeof(stru=
ct page);
>           unsigned long remaining_size =3D size - vmemmap_size;
>=20
>           return memmap_on_memory &&
>                  IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
>                  size =3D=3D memory_block_size_bytes() &&
>                  !(PMD_SIZE % vmemmap_size) &&
>                  IS_ALIGNED(vmemmap_size, pageblock_size) &&
>                  remaining_size &&
>                  IS_ALIGNED(remaining_size, pageblock_size);
>    }
>                 =20
> Note that above check is only for a single section, but if assumptions =
hold true
> for a single section, it will for many as well.

Okay, please document the statement about single sections, that's
important to understand what's happening.

My take would be

bool mhp_supports_memmap_on_memory(unsigned long size)
{
	/*
	 * Note: We calculate for a single memory section. The calculation
	 * implicitly covers memory blocks that span multiple sections.
	 */
	unsigned long nr_vmemmap_pages =3D SECTION_SIZE / PAGE_SIZE;
	unsigned long vmemmap_size =3D nr_vmemmap_pages * sizeof(struct page);
	unsigned long remaining_size =3D size - vmemmap_size;

	/*
	 * Note: vmemmap code has to make sure to not populate more memory
	 * via the altmap than absolutely necessary for a single section.
	 * This always holds when we allocate pageblock-sized chunks from
	 * the altmap, as we require pageblock alignment here.
	 *
	 * TODO, other doc
	 */
	return memmap_on_memory &&
                IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
	       size =3D=3D memory_block_size_bytes() &&
	       remaining_size &&
	       IS_ALIGNED(remaining_size, pageblock_size);
}


For arm64 with 64k base pages it might not hold and we might
have to fix vmemmap code to not over-populate PMDs (512 MB). I
did not have a loom at the code, though.

> =09
>> I suggest a restructuring, compressing the information like:
>>
>> "
>> Besides having arch support and the feature enabled at runtime, we nee=
d a
>> few more assumptions to hold true:
>>
>> a) We span a single memory block: memory onlining/offlining happens in
>> memory block granularity. We don't want the vmemmap of online memory b=
locks
>> to reside on offline memory blocks. In the future, we might want to su=
pport
>> variable-sized memory blocks to make the feature more versatile.
>>
>> b) The vmemmap pages span complete PMDs: We don't want vmemmap code to
>> populate memory from the altmap for unrelated parts (i.e., other memor=
y
>> blocks).
>>
>> c) The vmemmap pages (and thereby the pages that will be exposed to th=
e
>> buddy) have to cover full pageblocks: memory onlining/offlining code
>> requires applicable ranges to be page-aligned, for example, to set the
>> migratetypes properly.
>> "
>=20
> I am fine with the above, I already added it, thanks.
>=20
>> Do we have to special case / protect against the vmemmap optimization =
for
>> hugetlb pages? Or is that already blocked somehow and I missed it?
>=20
> Yes, hugetlb-vmemmap feature disables vmemmap on PMD population [1]
>=20
> [1] https://patchwork.kernel.org/project/linux-mm/patch/20210308102807.=
59745-7-songmuchun@bytedance.com/

Sorry, I might be missing something important.
How does that make both features run mutually exclusive?

hugetlb-vmemmap only instructs to populate base pages on
!altmap. I assume if hugetlb-vmemmap code stumbles over
a PMD it will skip optimization.

But what if your machine does not have
boot_cpu_has(X86_FEATURE_PSE) -- IOW, you always populate base pages?

Either hugetlb-vmemmap code would have to identify that
these pages come from an altmap (how? PageReserved() is also
used for boot memory), or both features have to somehow
run mutually exclusive.

--=20
Thanks,

David / dhildenb