From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yKvj=KH=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 044C1C433ED
	for <linux-mm@archiver.kernel.org>; Wed, 12 May 2021 16:14:25 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5CDCE61E6C
	for <linux-mm@archiver.kernel.org>; Wed, 12 May 2021 16:14:24 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5CDCE61E6C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id C7AAD6B006C; Wed, 12 May 2021 12:14:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C29296B006E; Wed, 12 May 2021 12:14:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A558C6B0070; Wed, 12 May 2021 12:14:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0020.hostedemail.com [216.40.44.20])
	by kanga.kvack.org (Postfix) with ESMTP id 666386B006C
	for <linux-mm@kvack.org>; Wed, 12 May 2021 12:14:23 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id F2240181AEF3C
	for <linux-mm@kvack.org>; Wed, 12 May 2021 16:14:22 +0000 (UTC)
X-FDA: 78133076364.21.04EDA3D
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf13.hostedemail.com (Postfix) with ESMTP id 5C6EDE002012
	for <linux-mm@kvack.org>; Wed, 12 May 2021 16:14:04 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1620836059;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=+99igyVqmzNGHqQrMpM/kFdTQLOrOFCJjPehA1KV9HQ=;
	b=iwUJmTnMp8RaJD5cQ0YMKcYeVTy1B3nW/CJubgOgWGugdbkiF0Opopy4XvphKOfPidafN4
	gshTnvrZNjjge2wq5+g+ZkrVytPEk9YPkSVHXo9TvPG670wV7qH8MoS2N8Tixp++tzKdgw
	Ttk0mi0Pe3AYXMU06X+XcLwDudZKoBk=
Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com
 [209.85.208.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-380-nc8cjkAlM7e9uiNvA6DCNQ-1; Wed, 12 May 2021 12:14:09 -0400
X-MC-Unique: nc8cjkAlM7e9uiNvA6DCNQ-1
Received: by mail-ed1-f72.google.com with SMTP id d13-20020a056402144db0290387e63c95d8so13151288edx.11
        for <linux-mm@kvack.org>; Wed, 12 May 2021 09:14:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=+99igyVqmzNGHqQrMpM/kFdTQLOrOFCJjPehA1KV9HQ=;
        b=nSIg82pXjMPf0Fpz5AMOjjWQyajXOI3Zydx5R8cXF46lbo/emoAOj7DDWXPH/pnVWV
         UDwFBCpn2MYgaFQA5pJ/3Gy7HnQtFSTswJ2UwTuxzSM22ipGBF8POQqpgeabDDYfunfF
         nh8nu5R2UAF0Y8tiy6bXlJR++RIDFJ9+fVXTw5qRR6nBcxo4vzIAWVuEOEmFOxYkn3Ai
         vb1nTQHjRMW7QhnJZOt8LFsPU2ojEyQu7GkRr2/k9f9YyZSls4Qf/O/WABb+npTrpNpi
         l8ze4VFMqEX03L6FLCn/f68u+BtaEMJ2LMBsFr9f5xGgrnZeuOYTf+7NPzhg+3GY14xY
         Ngtw==
X-Gm-Message-State: AOAM530P/ARX/mLQwvvoJUgtCVfNXr3YCZsFUqtIM2NpTq3+wCz81AlH
	KJtS35w2rFiBI1L363lNMzb9jWPrr0wOQh+E7zONtxLr8H1ahkVE5AfqBU3aml+H/I1BTDDpp0X
	+L4XmUAohLnGjQAB8JtUirdbf4ZWs1ca79PmPHuju38mBms0UiDxhMohx+lk=
X-Received: by 2002:a05:6402:310a:: with SMTP id dc10mr44324744edb.38.1620836048611;
        Wed, 12 May 2021 09:14:08 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJz/05CH7MkXWBkCHuvNwNl+Ar0uhN69I4Cp7zGz9zWHbeo06zctJRw0Pe1wTe8pgHTDR3XEWg==
X-Received: by 2002:a05:6402:310a:: with SMTP id dc10mr44324694edb.38.1620836048213;
        Wed, 12 May 2021 09:14:08 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c65ab.dip0.t-ipconnect.de. [91.12.101.171])
        by smtp.gmail.com with ESMTPSA id bn5sm83012ejb.97.2021.05.12.09.14.07
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 12 May 2021 09:14:07 -0700 (PDT)
To: Zi Yan <ziy@nvidia.com>, Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>, Michael Ellerman
 <mpe@ellerman.id.au>, Benjamin Herrenschmidt <benh@kernel.crashing.org>,
 Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org,
 Andy Lutomirski <luto@kernel.org>, "Rafael J . Wysocki" <rafael@kernel.org>,
 Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
 Dan Williams <dan.j.williams@intel.com>,
 Wei Yang <richard.weiyang@linux.alibaba.com>, linux-ia64@vger.kernel.org,
 linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
 linux-mm@kvack.org
References: <20210506152623.178731-1-zi.yan@sent.com>
 <fb60eabd-f8ef-2cb1-7338-7725efe3c286@redhat.com>
 <YJUqrOacyqI+kiKW@dhcp22.suse.cz>
 <792d73e2-5d63-74a5-5554-20351d5532ff@redhat.com>
 <746780E5-0288-494D-8B19-538049F1B891@nvidia.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size
Message-ID: <e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com>
Date: Wed, 12 May 2021 18:14:06 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <746780E5-0288-494D-8B19-538049F1B891@nvidia.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Queue-Id: 5C6EDE002012
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iwUJmTnM;
	spf=none (imf13.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspamd-Server: rspam04
X-Stat-Signature: tzkornyohafscietnmsdgmh853rty8z6
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf13; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620836044-103337
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

>>
>> As stated somewhere here already, we'll have to look into making alloc=
_contig_range() (and main users CMA and virtio-mem) independent of MAX_OR=
DER and mainly rely on pageblock_order. The current handling in alloc_con=
tig_range() is far from optimal as we have to isolate a whole MAX_ORDER -=
 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains some=
thing unmovable although we don't even want to allocate that part. I actu=
ally have that on my list (to be able to fully support pageblock_order in=
stead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to =
look into it.
>=20
> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_cont=
ig_range()
> should be used instead of buddy allocator while pageblock_order is kept=
 at a small
> granularity like 2MB. Is that the case? Isn=E2=80=99t it going to have =
high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) beco=
mes unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happ=
ening as
> an additional step for gigantic page allocation? Like your ZONE_PREFER_=
MOVABLE idea?
>=20

I am not fully sure yet where the journey will go , I guess nobody=20
knows. Ultimately, having buddy support for >=3D current MAX_ORDER (IOW,=20
increasing MAX_ORDER) will most probably happen, so it would be worth=20
investigating what has to be done to get that running as a first step.

Of course, we could temporarily think about wiring it up in the buddy lik=
e

if (order < MAX_ORDER)
	__alloc_pages()...
else
	alloc_contig_pages()

but it doesn't really improve the situation IMHO, just an API change.

So I think we should look into increasing MAX_ORDER, seeing what needs=20
to be done to have that part running while keeping the section size and=20
the pageblock order as is. I know that at least memory=20
onlining/offlining, cma, alloc_contig_range(), ... needs tweaking,=20
especially when we don't increase the section size (but also if we would=20
due to the way page isolation is currently handled). Having a MAX_ORDER=20
-1 page being partially in different nodes might be another thing to=20
look into (I heard that it can already happen right now, but I don't=20
remember the details).

The next step after that would then be better fragmentation avoidance=20
for larger granularity like 1G THP.

>>
>> Further, page onlining / offlining code and early init code most proba=
bly also needs care if MAX_ORDER - 1 crosses sections. Memory holes we mi=
ght suddenly have in MAX_ORDER - 1 pages might become a problem and will =
have to be handled. Not sure which other code has to be tweaked (compacti=
on? page isolation?).
>=20
> Can you elaborate it a little more? From what I understand, memory hole=
s mean valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work=
, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, me=
aning page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional co=
de would be
> needed instead of simple pfn++. Is there anything I am missing?

I think there are two cases when talking about MAX_ORDER and memory holes=
:

1. Hole with a valid memmap: the memmap is initialize to PageReserved()
    and the pages are not given to the buddy. pfn_valid() and
    pfn_to_page() works as expected.
2. Hole without a valid memmam: we have that CONFIG_HOLES_IN_ZONE thing
    already, see include/linux/mmzone.h. pfn_valid_within() checks are
    required. Doesn't win a beauty contest, but gets the job done in
    existing setups that seem to care.

"If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we=20
need to check pfn validity within that MAX_ORDER_NR_PAGES block.=20
pfn_valid_within() should be used in this case; we optimise this away=20
when we have no holes within a MAX_ORDER_NR_PAGES block."

CONFIG_HOLES_IN_ZONE is just a bad name for this.

(increasing the section size implies that we waste more memory for the=20
memmap in holes. increasing MAX_ORDER means that we might have to deal=20
with holes within MAX_ORDER chunks)

We don't have too many pfn_valid_within() checks. I wonder if we could=20
add something that is optimized for "holes are a power of two and=20
properly aligned", because pfn_valid_within() right not deals with holes=20
of any kind which makes it somewhat inefficient IIRC.

>=20
> BTW, to test a system with memory holes, do you know is there an easy o=
f adding
> random memory holes to an x86_64 VM, which can help reveal potential mi=
ssing pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no i=
dea on
> how to do it on QEMU.

It might not be very easy that way. But I heard that some arm64 systems=20
have crazy memory layouts -- maybe there, it's easier to get something=20
nasty running? :)

https://lkml.kernel.org/r/YJpEwF2cGjS5mKma@kernel.org

I remember there was a way to define the e820 completely on kernel=20
cmdline, but I might be wrong ...

--=20
Thanks,

David / dhildenb