From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=CiPL=KB=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2C04EC433B4
	for <linux-mm@archiver.kernel.org>; Thu,  6 May 2021 19:11:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9C66E613B5
	for <linux-mm@archiver.kernel.org>; Thu,  6 May 2021 19:10:59 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9C66E613B5
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id F2F1F6B006C; Thu,  6 May 2021 15:10:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EB7CE6B006E; Thu,  6 May 2021 15:10:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CBCF86B0070; Thu,  6 May 2021 15:10:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id AB6756B006C
	for <linux-mm@kvack.org>; Thu,  6 May 2021 15:10:58 -0400 (EDT)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 5F8BD988D
	for <linux-mm@kvack.org>; Thu,  6 May 2021 19:10:58 +0000 (UTC)
X-FDA: 78111748596.01.CA0B946
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf19.hostedemail.com (Postfix) with ESMTP id 81B5F90009F1
	for <linux-mm@kvack.org>; Thu,  6 May 2021 19:10:25 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1620328257;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=BLQS/VWO7OtVVjsKjxu+D/KAUrGLMXv+CIgGZgOS2ZA=;
	b=VkisTBs6/M5LYJWUBxrWQuqKQAj6BE9S/OMsQ4XK4h5aFKoHVkgzKTt6ucsQR/c+dBVThF
	dTEfGlN2tZngN8cEODxpp2ipk1mB+RY+qyLzEnrTvXf0lg8K9Drss1cxAqwQDPmhomGUm4
	ytUuSf7nyFNC4H4sei8QJgvGhr/S3KI=
Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com
 [209.85.208.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-258-HN1_VHx-MaivLAZS-LstVw-1; Thu, 06 May 2021 15:10:55 -0400
X-MC-Unique: HN1_VHx-MaivLAZS-LstVw-1
Received: by mail-ed1-f72.google.com with SMTP id y17-20020a0564023591b02903886c26ada4so3154939edc.5
        for <linux-mm@kvack.org>; Thu, 06 May 2021 12:10:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=BLQS/VWO7OtVVjsKjxu+D/KAUrGLMXv+CIgGZgOS2ZA=;
        b=B3jt5Z92vYsxDR7drqtBRbHYn31dFnyhPzE0Msr6jDfLzbPiBV0G06oOcqv1L8VKkb
         Q3bP6WiXj/rfhkhXyC0yjRfvL1AtEB+CnIu5tdKkQtpqxDp1rmg1usMIRo8XuZRAoDdg
         tRdCllBUoOvk60QKoU4sFbcmM0xat12yRElPOAkCwhRQhq2OlHN0Dm+sHz5fQiE715tj
         oQvbYlTuinxO3e+NshJEr/i7JxD86sgFn3LSdB/5gg0eLw0pxI87PSM15KHBHCIT/Hyo
         RtHUkFf4mKFBFclu+N4m8Mo4sOp4IOAFXyS7W5ZLN2eST0+gxXCrc1RP4ox5tysTgxz1
         k6mg==
X-Gm-Message-State: AOAM5309TpWihd0cl07jduQfjWrqXkJEO7kzyOGJsQZ7oMEBbSkidgnR
	rM0QoLPNqzjWc994hjohE/cBJjqrEjcZshokwofbMekdSwBIpEkuXW1gPSkyyHnk7qleZ4yYBVT
	KWpq/KpYkUabmaFlb9WYEvtVd/K4U7o2sCjVEC7IZXCMXJlNA+l/PcgxvQ70=
X-Received: by 2002:a05:6402:4242:: with SMTP id g2mr7015515edb.329.1620328254250;
        Thu, 06 May 2021 12:10:54 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxZkjeBTygPaDXDtboP122O0OliN5NROL2yY9g10gw8dg8OmnhZOJqX4e3LHC7dUY0iB6oRYw==
X-Received: by 2002:a05:6402:4242:: with SMTP id g2mr7015441edb.329.1620328253607;
        Thu, 06 May 2021 12:10:53 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c64ae.dip0.t-ipconnect.de. [91.12.100.174])
        by smtp.gmail.com with ESMTPSA id rs8sm1920481ejb.17.2021.05.06.12.10.52
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 06 May 2021 12:10:53 -0700 (PDT)
To: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>, Michael Ellerman
 <mpe@ellerman.id.au>, Benjamin Herrenschmidt <benh@kernel.crashing.org>,
 Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org,
 Andy Lutomirski <luto@kernel.org>, "Rafael J . Wysocki" <rafael@kernel.org>,
 Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>, Michal Hocko
 <mhocko@suse.com>, Dan Williams <dan.j.williams@intel.com>,
 Wei Yang <richard.weiyang@linux.alibaba.com>, linux-ia64@vger.kernel.org,
 linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
 linux-mm@kvack.org
References: <20210506152623.178731-1-zi.yan@sent.com>
 <fb60eabd-f8ef-2cb1-7338-7725efe3c286@redhat.com>
 <9D7FD316-988E-4B11-AC1C-64FF790BA79E@nvidia.com>
 <3a51f564-f3d1-c21f-93b5-1b91639523ec@redhat.com>
 <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com>
 <f3a2152c-685b-2141-3e33-b2bcab8b6010@redhat.com>
 <3A6D54CF-76F4-4401-A434-84BEB813A65A@nvidia.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size
Message-ID: <0e850dcb-c69a-188b-7ab9-09e6644af3ab@redhat.com>
Date: Thu, 6 May 2021 21:10:52 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <3A6D54CF-76F4-4401-A434-84BEB813A65A@nvidia.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VkisTBs6;
	spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 81B5F90009F1
X-Stat-Signature: fu7dyoi4w1gbsymxa6dhbmzzwzoigm5g
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620328225-950130
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

>>
>> 1. Pageblock size
>>
>> There are a couple of features that rely on the pageblock size to be r=
easonably small to work as expected. One example is virtio-balloon free p=
age reporting, then there is virtio-mem (still also glued MAX_ORDER) and =
we have CMA (still also glued to MAX_ORDER). Most probably there are more=
. We track movability/ page isolation per pageblock; it's the smallest gr=
anularity you can effectively isolate pages or mark them as CMA (MIGRATE_=
ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages mo=
st of our applications use and will use, especially on smallish systems.
>>
>> Assume you bump up the pageblock order to 1G. Small VMs won't be able =
to report any free pages to the hypervisor. You'll take the "fine-grained=
" out of virtio-mem. Each CMA area will have to be at least 1G big, which=
 turns CMA essentially useless on smallish systems (like we have on arm64=
 with 64k base pages -- pageblock_size is 512MB and I hate it).
>=20
> I understand the issue of having large pageblock in small systems. My p=
lan for this issue is to make MAX_ORDER a variable (pageblock size would =
be set according to MAX_ORDER) that can be adjusted based on total memory=
 and via boot time parameter. My apology since I did not state this clear=
ly in my cover letter and it confused you. When we have a boot time adjus=
table MAX_ORDER, large pageblock like 1GB would only appear for systems w=
ith large memory. For small VMs, pageblock size would stay at 2MB, so all=
 your concerns on smallish systems should go away.

I have to admit that I am not really a friend of that. I still think our=20
target goal should be to have gigantic THP *in addition to* ordinary=20
THP. Use gigantic THP where enabled and possible, and just use ordinary=20
THP everywhere else. Having one pageblock granularity is a real=20
limitation IMHO and requires us to hack the system to support it to some=20
degree.

>=20
>>
>> Then, imagine systems that have like 4G of main memory. By stopping gr=
ouping at 2M and instead grouping at 1G you can very easily find yourself=
 in the system where all your 4 pageblocks are unmovable and you essentia=
lly don't optimize for huge pages in that environment any more.
>>
>> Long story short: we need a different mechanism on top and shall leave=
 the pageblock size untouched, it's too tightly integrated with page isol=
ation, ordinary THP, and CMA.
>=20
> I think it is better to make pageblock size adjustable based on total m=
emory of a system. It is not reasonable to have the same pageblock size a=
cross systems with memory sizes from <1GB to several TBs. Do you agree?
>=20

I suggest an additional mechanism on top. Please bear in mind that=20
ordinary THP will most probably be still the default for 99.9% of all=20
application/library cases, even when you have gigantic THP around.

>>
>> 2. Section size
>>
>> I assume the only reason you want to touch that is because pageblock_s=
ize <=3D section_size, and I guess that's one of the reasons I dislike it=
 so much. Messing with the section size really only makes sense when we w=
ant to manage metadata for larger granularity within a section.
>=20
> Perhaps it is worth checking if it is feasible to make pageblock_size >=
 section_size, so we can still have small sections when pageblock_size ar=
e large. One potential issue for that is when PFN discontinues at section=
 boundary, we might have partial pageblock when pageblock_size is big. I =
guess supporting partial pageblock (or different pageblock sizes like you=
 mentioned below ) would be the right solution.
>=20
>>
>> We allocate metadata per section. We mark whole sections early/online/=
present/.... Yes, in case of vmemmap, we manage the memmap in smaller gra=
nularity using the sub-section map, some kind of hack to support some ZON=
E_DEVICE cases better.
>>
>> Let's assume we introduce something new "gigapage_order", correspondin=
g to 1G. We could either decide to squeeze the metadata into sections, ha=
ving to increase the section size, or manage that metadata differently.
>>
>> Managing it differently certainly makes the necessary changes easier. =
Instead of adding more hacks into sections, rather manage that metadata a=
t differently place / in a different way.
>=20
> Can you elaborate on managing it differently?

Let's keep it simple. Assume you track on a 1G gigpageblock MOVABLE vs.=20
!movable in addition to existing pageblocks. A 64 TB system would have=20
64*1024 gigpageblocks. One bit per gigapageblock would require 8k a.k.a.=20
2 pages. If you need more states, it would maybe double. No need to=20
manage that using sparse memory sections IMHO. Just allocate 2/4 pages=20
during boot for the bitmap.

>=20
>>
>> See [1] for an alternative. Not necessarily what I would dream off, bu=
t just to showcase that there might be alternative to group pages.
>=20
> I saw this patch too. It is an interesting idea to separate different a=
llocation orders into different regions, but it would not work for gigant=
ic page allocations unless we have large pageblock size to utilize existi=
ng anti-fragmentation mechanisms.

Right, any anti-fragmentation mechanism on top.

>> 3. Grouping pages > pageblock_order
>>
>> There are other approaches that would benefit from grouping at > pageb=
lock_order and having bigger MAX_ORDER. And that doesn't necessarily mean=
 to form gigantic pages only, we might want to group in multiple granular=
ity on a single system. Memory hot(un)plug is one example, but also optim=
izing memory consumption by powering down DIMM banks. Also, some architec=
tures support differing huge page sizes (aarch64) that could be improved =
without CMA. Why not have more than 2 THP sizes on these systems?
>>
>> Ideally, we'd have a mechanism that tries grouping on different granul=
arity, like for every order in pageblock_order ... max_pageblock_order (e=
.g., 1 GiB), and not only add one new level of grouping (or increase the =
single grouping size).
>=20
> I agree. In some sense, supporting partial pageblock and increasing pag=
eblock size (e.g., to 1GB) is, at the high level, quite similar to having=
 multiple pageblock sizes. But I am not sure if we really want to support=
 multiple pageblock sizes, since it creates pageblock fragmentation when =
we want to change migratetype for part of a pageblock. This means we woul=
d break a large pageblock into small ones if we just want to steal a subs=
et of pages from MOVEABLE for UNMOVABLE allocations. Then pageblock loses=
 its most useful anti-fragmentation feature. Also it seems to be a replic=
ation of buddy allocator functionalities when it comes to pageblock split=
 and merge.

Let's assume for simplicity that you have a 4G machine, maximum 4=20
gigantic pages. The first gigantic page will be impossible either way=20
due to the kernel, boot time allocations etc. So you're left with 3=20
gigantic pages you could use at best.

Obviously, you want to make sure that the remaining parts of the first=20
gigantic page are used as best as possible for ordinary huge pages, so=20
you would actually want to group them in 2 MiB chunks and avoid=20
fragmentation there.

Obviously, supporting two pageblock types would require core=20
modifications to support it natively. (not pushing for the idea of two=20
pageblock orders, just motivating why we actually want to keep grouping=20
for ordinary THP).

>=20
>=20
> The above is really a nice discussion with you on pageblock, section, m=
emory hotplug/hotremove, which also helps me understand more on the issue=
s with increasing MAX_ORDER to enable 1GB page allocation.
>=20
> In sum, if I get it correctly, the issues I need to address are:
>=20
> 1. large pageblock size (which is needed when we bump MAX_ORDER for gig=
antic page allocation from buddy allocator) is not good for machines with=
 small memory.
>=20
> 2. pageblock size is currently tied with section size (which made me wa=
nt to bump section size).
>=20
>=20
> For 1, I think making MAX_ORDER a variable that can be set based on tot=
al memory size and adjustable via boot time parameter should solve the pr=
oblem. For small machines, we will keep MAX_ORDER as small as we have now=
 like 4MB, whereas for large machines, we can increase MAX_ORDER to utili=
ze gigantic pages.
>=20
> For 2, supporting partial pageblock and allow a pageblock to cross mult=
iple sections would break the tie between pageblock size and section to s=
olve the issue.
>=20
> I am going to look into them. What do you think?

I am not sure that's really the right direction as stated above.

--=20
Thanks,

David / dhildenb