From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=WMM8=M5=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A6318C432BE
	for <linux-mm@archiver.kernel.org>; Fri,  6 Aug 2021 17:08:19 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0C8C4611CA
	for <linux-mm@archiver.kernel.org>; Fri,  6 Aug 2021 17:08:18 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0C8C4611CA
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5732D6B006C; Fri,  6 Aug 2021 13:08:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4FADA8D0001; Fri,  6 Aug 2021 13:08:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 39AC56B0072; Fri,  6 Aug 2021 13:08:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0061.hostedemail.com [216.40.44.61])
	by kanga.kvack.org (Postfix) with ESMTP id 1B4EA6B006C
	for <linux-mm@kvack.org>; Fri,  6 Aug 2021 13:08:18 -0400 (EDT)
Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id B3CCF1F054
	for <linux-mm@kvack.org>; Fri,  6 Aug 2021 17:08:17 +0000 (UTC)
X-FDA: 78445289034.34.1C1161C
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf29.hostedemail.com (Postfix) with ESMTP id 21D059013A6E
	for <linux-mm@kvack.org>; Fri,  6 Aug 2021 17:08:17 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1628269696;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/GvUXzVqBNpTYmWX6silIAbrzIkt7NivstiUqC9NAzM=;
	b=bXicpTpuf1d9BiTJXe38BQc5YFRhi9UI3qXe2HorUL4Foj1vNxN6tnj0Ic6Krq/B9wRUpi
	mnKOQ+jdueQCWJU5TAi+K7u4syA+ZWmsLlFktW4p58dXi4SIQsmQIODRqo/KZaUJRsuCJF
	/GPCajtcz3XUp5uyjkIQ1xpJihG760w=
Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com
 [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-521-MN_zgqvQObq-VPu3FaAMBA-1; Fri, 06 Aug 2021 13:08:13 -0400
X-MC-Unique: MN_zgqvQObq-VPu3FaAMBA-1
Received: by mail-wm1-f69.google.com with SMTP id l19-20020a05600c4f13b029025b036c91c6so2207540wmq.2
        for <linux-mm@kvack.org>; Fri, 06 Aug 2021 10:08:13 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=/GvUXzVqBNpTYmWX6silIAbrzIkt7NivstiUqC9NAzM=;
        b=Eq/PJ+QdkGSoShagyTcUXwhYhq+wMXTGQWIsxQzkqmPHybI7tDB6ffKFXJe6guud2L
         bvfG9k6tOviaMrOZVYVtbrA7TQVJCwgEsjbxDwIhCngXzdex6rh4gNEmaAGJtYuq1ifN
         bS0GNdwYLEddr4NKUChvxWYh95EGNi6IKyUzmr8fbOCLeOYiaqJQ93Ukd9g/maJEG4G/
         gHee8rnSPDOb/ZBGxajYNqvr/vE+uiokpLF3bUZxQSsDEq8M2gJ/+H7TBw8O1WdR4iPL
         7IcKVfTTNZofmswzXdjjJvuT2xv6UF1pxDvljcvaYVnY1aEj3pXqB9itp45BzYBuvHeF
         Ze1A==
X-Gm-Message-State: AOAM533M4hKtnvAuR1Z6unFxAoZ4YlStbKY027QSLkGjjeZUSTPjXN5a
	bZwm3lR/3kYFDplHA+VJN5Qd/tA4tdeRlDFQsi4uXTFuM6izvzFwLqygj5c9yXGVVHVp+mDYcsD
	y66v0b0cbRd8=
X-Received: by 2002:a05:600c:293:: with SMTP id 19mr4211148wmk.179.1628269692152;
        Fri, 06 Aug 2021 10:08:12 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxpG/+b2pMeozwLiwEUf9OenFHEwQndI1oxEpX71cpScaVHXoK5GX5T55V+c7A3djXlCxO+Wg==
X-Received: by 2002:a05:600c:293:: with SMTP id 19mr4211130wmk.179.1628269691924;
        Fri, 06 Aug 2021 10:08:11 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c6104.dip0.t-ipconnect.de. [91.12.97.4])
        by smtp.gmail.com with ESMTPSA id e3sm10296154wrw.51.2021.08.06.10.08.11
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 06 Aug 2021 10:08:11 -0700 (PDT)
To: Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
 linux-mm@kvack.org
Cc: Matthew Wilcox <willy@infradead.org>,
 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
 Mike Kravetz <mike.kravetz@oracle.com>, Michal Hocko <mhocko@kernel.org>,
 John Hubbard <jhubbard@nvidia.com>, linux-kernel@vger.kernel.org,
 Mike Rapoport <rppt@linux.ibm.com>
References: <20210805190253.2795604-1-zi.yan@sent.com>
 <40982106-0eee-4e62-7ce0-c4787b0afac4@suse.cz>
 <72b317e5-c78a-f0bc-fe69-f82261ec252e@redhat.com>
 <3417eb98-36c8-5459-c83e-52f90e42a146@suse.cz>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time
 parameter.
Message-ID: <59c59a77-cf93-40a8-2ad5-b72d87b8815a@redhat.com>
Date: Fri, 6 Aug 2021 19:08:10 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <3417eb98-36c8-5459-c83e-52f90e42a146@suse.cz>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 21D059013A6E
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bXicpTpu;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf29.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-Stat-Signature: qupi3dpm7igwaehd1cjhy4rzddp3pkp1
X-HE-Tag: 1628269697-620897
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 06.08.21 18:54, Vlastimil Babka wrote:
> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator c=
an merge
>>>> pages across memory sections. The check was removed when ARM64 gets =
rid of holes
>>>> in zones, but holes can appear in zones again after this patchset.
>>>
>>> To me that's most unwelcome resurrection. I kinda missed it was going=
 away and
>>> now I can't even rejoice? I assume the systems that will be bumping m=
ax_order
>>> have a lot of memory. Are they going to have many holes? What if we j=
ust
>>> sacrificed the memory that would have a hole and don't add it to budd=
y at all?
>>
>> I think the old implementation was just horrible and the description w=
e have
>> here still suffers from that old crap: "but holes can appear in zones =
again".
>> No, it's not related to holes in zones at all. We can have MAX_ORDER -=
1 pages
>> that are partially a hole.
>>
>> And to be precise, "hole" here means "there is no memmap" and not "the=
re is a
>> hole but it has a valid memmap".
>=20
> Yes.
>=20
>> But IIRC, we now have under SPARSEMEM always a complete memmap for a c=
omplete
>> memory sections (when talking about system RAM, ZONE_DEVICE is differe=
nt but we
>> don't really care for now I think).
>>
>> So instead of introducing what we had before, I think we should look i=
nto
>> something that doesn't confuse each person that stumbles over it out t=
here. What
>> does pfn_valid_within() even mean in the new context? pfn_valid() is m=
ost
>> probably no longer what we really want, as we're dealing with multiple=
 sections
>> that might be online or offline; in the old world, this was different,=
 as a
>> MAX_ORDER -1 page was completely contained in a memory section that wa=
s either
>> online or offline.
>>
>> I'd imagine something that expresses something different in the contex=
t of
>> sparsemem:
>>
>> "Some page orders, such as MAX_ORDER -1, might span multiple memory se=
ctions.
>> Each memory section has a completely valid memmap if online. Memory se=
ctions
>> might either be completely online or completely offline. pfn_to_online=
_page()
>> might succeed on one part of a MAX_ORDER - 1 page, but not on another =
part. But
>> it will certainly be consistent within one memory section."
>>
>> Further, as we know that MAX_ORDER -1 and memory sections are a power =
of two, we
>> can actually do a binary search to identify boundaries, instead of hav=
ing to
>> check each and every page in the range.
>>
>> Is what I describe the actual reason why we introduce pfn_valid_within=
() ? (and
>> might better introduce something new, with a better fitting name?)
>=20
> What I don't like is mainly the re-addition of pfn_valid_within() (or w=
hatever
> we'd call it) into __free_one_page() for performance reasons, and also =
to
> various pfn scanners (compaction) for performance and "I must not forge=
t to
> check this, or do I?" confusion reasons. It would be really great if we=
 could
> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two way=
s to
> achieve that:
>=20
> 1. we create memmap for MAX_ORDER blocks, pages in sections not online =
are
> marked as reserved or some other state that allows us to do checks such=
 as "is
> there a buddy? no" without accessing a missing memmap
> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>=20
> I think 1 would be more work, but less wasteful in the end?

It will end up seriously messing with memory hot(un)plug. It's not=20
sufficient if there is a memmap (pfn_valid()), it has to be online=20
(pfn_to_online_page()) to actually have a meaning.

So you'd have to  allocate a memmap for all such memory sections,=20
initialize it to all PG_Reserved ("memory hole") and mark these memory=20
sections online. Further, you need memory block devices that are=20
initialized and online.

So far so good, although wasteful. What happens if someone hotplugs a=20
memory block that doesn't span a complete MAX_ORDER -1 page? Broken.


The only "workaround" would be requiring that MAX_ORDER - 1 cannot be=20
bigger than memory blocks (memory_block_size_bytes()). The memory block=20
size determines our hot(un)plug granularity and can (on some archs=20
already) be determined at runtime. As both (MAX_ORDER and=20
memory_block_size_bytes) would be determined at runtime, for example, by=20
an admin explicitly requesting it, this might be feasible.


Memory hot(un)plug / onlining / offlining would most probably work=20
naturally (although the hot(un)plug granularity is then limited to e.g.,=20
1GiB memory blocks). But if that's what an admin requests on the command=20
line, so be it.

What might need some thought, though, is having overlapping=20
sections/such memory blocks with devmem. Sub-section hotadd has to=20
continue working unless we want to break some PMEM devices seriously.

--=20
Thanks,

David / dhildenb