From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=paXG=LI=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 33922C2B9F4
	for <linux-mm@archiver.kernel.org>; Mon, 14 Jun 2021 11:32:32 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B46006109F
	for <linux-mm@archiver.kernel.org>; Mon, 14 Jun 2021 11:32:31 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B46006109F
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 54E8B6B006C; Mon, 14 Jun 2021 07:32:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4AEB26B006E; Mon, 14 Jun 2021 07:32:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2DB7F6B0070; Mon, 14 Jun 2021 07:32:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0135.hostedemail.com [216.40.44.135])
	by kanga.kvack.org (Postfix) with ESMTP id ED6086B006C
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 07:32:30 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 81C03180AD817
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 11:32:30 +0000 (UTC)
X-FDA: 78252116460.10.DA5F0D4
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf23.hostedemail.com (Postfix) with ESMTP id D4AF9A00025F
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 11:32:22 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1623670349;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=auek9aYzqwxqvPN9zk1SNV1DLZg48GOMnwlLjvRnYkQ=;
	b=XBULiev4S1TTFStDKUPWRpxmOvTZUreX3f4wK70Lgnem4fJEhWYz8qpnCvKqkeBjxFKHnQ
	Ljha/br5SZtcOVn5uuS2/GNKgJLcKl8LfxfePtl+2NtSuN9FjrqoqSPNmt1OO+N/ThiDQ5
	DnTgQNywKGLSIzKrb+yM7TcX2SBVdh4=
Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com
 [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-561-O1weGxjoNNW7sLMQyYjY3Q-1; Mon, 14 Jun 2021 07:32:28 -0400
X-MC-Unique: O1weGxjoNNW7sLMQyYjY3Q-1
Received: by mail-wr1-f69.google.com with SMTP id n2-20020adfb7420000b029010e47b59f31so6851138wre.9
        for <linux-mm@kvack.org>; Mon, 14 Jun 2021 04:32:28 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=auek9aYzqwxqvPN9zk1SNV1DLZg48GOMnwlLjvRnYkQ=;
        b=TUFlNsU1eAvyOlYsBdTiEdhiAyLA0UQlxcsrGuYGHGTSHNDgpGGCpDeilRjW5Pb0ZY
         rs1JPWCLyrtWyHuOW7R2zI+b7iKpCSj8DQFgUOk7Tuwte+dRNWKeKFdppe1eA9C6VISO
         clLYmyjF/rCuftEj2eFX4VOLWW7LBv7TK3R0NObsQc4JeRxHwZlXros5dBhppR5KopOv
         wEBWE/6NBDCB+IhUneFI79jL/UBpHULfmxLe0M7V9qqtvt5DPPLqIJm3iS4qngiYuukn
         LP4169/tzzVhL2EQoncysBJ72zf2kONwaedR/vkqNVe+FtY1VQD19vbr8fvul1vWg2eM
         8s8w==
X-Gm-Message-State: AOAM531MRja030EWOuA6GmPzGF+pkxMnp2vtAS0+Vvq0w9t3BXkRsv7h
	AU9i0n8EXoc18Fk5k0MO355cqXHAmY6aDzxFWq+oAYPlexYEOXS1U07S5IHzutTUUGzFVBou/j4
	53pAmBn6tITCEyp27BSXULgbriVS3uUqyxc3uRb2p874rKtvHvO9mdfVHd9o=
X-Received: by 2002:a05:600c:198c:: with SMTP id t12mr15621515wmq.16.1623670347449;
        Mon, 14 Jun 2021 04:32:27 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyvMFL31q/BXB05TUCa992j0Wb2DezhZXpGzHT9FW0vrh5G/XgKNYGQVj5KUX7aymkq5w5vWQ==
X-Received: by 2002:a05:600c:198c:: with SMTP id t12mr15621479wmq.16.1623670347138;
        Mon, 14 Jun 2021 04:32:27 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c66ca.dip0.t-ipconnect.de. [91.12.102.202])
        by smtp.gmail.com with ESMTPSA id u16sm16870421wru.56.2021.06.14.04.32.26
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 14 Jun 2021 04:32:26 -0700 (PDT)
To: Zi Yan <ziy@nvidia.com>, Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>, Michael Ellerman
 <mpe@ellerman.id.au>, Benjamin Herrenschmidt <benh@kernel.crashing.org>,
 Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org,
 Andy Lutomirski <luto@kernel.org>, "Rafael J . Wysocki" <rafael@kernel.org>,
 Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
 Dan Williams <dan.j.williams@intel.com>,
 Wei Yang <richard.weiyang@linux.alibaba.com>, linux-ia64@vger.kernel.org,
 linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
 linux-mm@kvack.org
References: <20210506152623.178731-1-zi.yan@sent.com>
 <fb60eabd-f8ef-2cb1-7338-7725efe3c286@redhat.com>
 <YJUqrOacyqI+kiKW@dhcp22.suse.cz>
 <792d73e2-5d63-74a5-5554-20351d5532ff@redhat.com>
 <746780E5-0288-494D-8B19-538049F1B891@nvidia.com>
 <289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size
Message-ID: <640bd1da-4bcb-cfda-18c0-da0ddb90b661@redhat.com>
Date: Mon, 14 Jun 2021 13:32:25 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: D4AF9A00025F
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XBULiev4;
	spf=none (imf23.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Stat-Signature: a4agrpqsqyfj5tu7jh3oq6wozogydwoh
X-HE-Tag: 1623670342-251911
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 02.06.21 17:56, Zi Yan wrote:
> On 10 May 2021, at 10:36, Zi Yan wrote:
>=20
>> On 7 May 2021, at 10:00, David Hildenbrand wrote:
>>
>>> On 07.05.21 13:55, Michal Hocko wrote:
>>>> [I haven't read through respective patches due to lack of time but l=
et
>>>>    me comment on the general idea and the underlying justification]
>>>>
>>>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>>>> On 06.05.21 17:26, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patchset tries to remove the restriction on memory hotplug/ho=
tremove
>>>>>> granularity, which is always greater or equal to memory section si=
ze[1].
>>>>>> With the patchset, kernel is able to online/offline memory at a si=
ze independent
>>>>>> of memory section size, as small as 2MB (the subsection size).
>>>>>
>>>>> ... which doesn't make any sense as we can only online/offline whol=
e memory
>>>>> block devices.
>>>>
>>>> Agreed. The subsection thingy is just a hack to workaround pmem
>>>> alignement problems. For the real memory hotplug it is quite hard to
>>>> argue for reasonable hotplug scenarios for very small physical memor=
y
>>>> ranges wrt. to the existing sparsemem memory model.
>>>>
>>>>>> The motivation is to increase MAX_ORDER of the buddy allocator and=
 pageblock
>>>>>> size without increasing memory hotplug/hotremove granularity at th=
e same time,
>>>>>
>>>>> Gah, no. Please no. No.
>>>>
>>>> Agreed. Those are completely independent concepts. MAX_ORDER is can =
be
>>>> really arbitrary irrespective of the section size with vmemmap spars=
e
>>>> model. The existing restriction is due to old sparse model not being
>>>> able to do page pointer arithmetic across memory sections. Is there =
any
>>>> reason to stick with that memory model for an advance feature you ar=
e
>>>> working on?
>>
>> No. I just want to increase MAX_ORDER. If the existing restriction can
>> be removed, that will be great.
>>
>>>
>>> I gave it some more thought yesterday. I guess the first thing we sho=
uld look into is increasing MAX_ORDER and leaving pageblock_order and sec=
tion size as is -- finding out what we have to tweak to get that up and r=
unning. Once we have that in place, we can actually look into better frag=
mentation avoidance etc. One step at a time.
>>
>> It makes sense to me.
>>
>>>
>>> Because that change itself might require some thought. Requiring that=
 bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do=
.
>>
>> OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
>> SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
>> want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
>> all page++ will need to be changed to nth_page(page,1)).
>>
>>>
>>> As stated somewhere here already, we'll have to look into making allo=
c_contig_range() (and main users CMA and virtio-mem) independent of MAX_O=
RDER and mainly rely on pageblock_order. The current handling in alloc_co=
ntig_range() is far from optimal as we have to isolate a whole MAX_ORDER =
- 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains som=
ething unmovable although we don't even want to allocate that part. I act=
ually have that on my list (to be able to fully support pageblock_order i=
nstead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to=
 look into it.
>>
>> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_con=
tig_range()
>> should be used instead of buddy allocator while pageblock_order is kep=
t at a small
>> granularity like 2MB. Is that the case? Isn=E2=80=99t it going to have=
 high fail rate
>> when any of the pageblocks within a gigantic page range (like 1GB) bec=
omes unmovable?
>> Are you thinking additional mechanism/policy to prevent such thing hap=
pening as
>> an additional step for gigantic page allocation? Like your ZONE_PREFER=
_MOVABLE idea?
>>
>>>
>>> Further, page onlining / offlining code and early init code most prob=
ably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we m=
ight suddenly have in MAX_ORDER - 1 pages might become a problem and will=
 have to be handled. Not sure which other code has to be tweaked (compact=
ion? page isolation?).
>>
>> Can you elaborate it a little more? From what I understand, memory hol=
es mean valid
>> PFNs are not contiguous before and after a hole, so pfn++ will not wor=
k, but
>> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, m=
eaning page++
>> would still work. So when MAX_ORDER - 1 crosses sections, additional c=
ode would be
>> needed instead of simple pfn++. Is there anything I am missing?
>>
>> BTW, to test a system with memory holes, do you know is there an easy =
of adding
>> random memory holes to an x86_64 VM, which can help reveal potential m=
issing pieces
>> in the code? Changing BIOS-e820 table might be one way, but I have no =
idea on
>> how to do it on QEMU.
>>
>>>
>>> Figuring out what needs care itself might take quite some effort.
>>>
>>> One thing I was thinking about as well: The bigger our MAX_ORDER, the=
 slower it could be to allocate smaller pages. If we have 1G pages, split=
ting them down to 4k then takes 8 additional steps if I'm, not wrong. Of =
course, that's the worst case. Would be interesting to evaluate.
>>
>> Sure. I am planning to check it too. As a simple start, I am going to =
run will it scale
>> benchmarks to see if there is any performance difference between diffe=
rent MAX_ORDERs.
>=20
> I ran vm-scalablity and memory-related will-it-scale on a server with 2=
56GB memory to
> see the impact of increasing MAX_ORDER and didn=E2=80=99t see much diff=
erence for most of
> the workloads like page_fault1, page_fault2, and page_fault3 from will-=
it-scale.
> But feel free to check the attached complete results and let me know wh=
at should be
> looked into. Thanks.

Right, for will-it-scale it looks like there are mostly minor=20
differences, although I am not sure if the results are really stable=20
(reaching from -6% to +6%). For vm-scalability the numbers seem to vary=20
even more (e.g., stddev of =C2=B1 63%), so I have no idea how expressive =
they=20
are. But I guess for these benchmarks, the net change won't really be=20
significant.

Thanks!

--=20
Thanks,

David / dhildenb