From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70335C00528
	for <linux-mm@archiver.kernel.org>; Mon, 24 Jul 2023 16:24:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C5EB16B0071; Mon, 24 Jul 2023 12:24:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C0EE06B0074; Mon, 24 Jul 2023 12:24:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AAF996B0075; Mon, 24 Jul 2023 12:24:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 9984C6B0071
	for <linux-mm@kvack.org>; Mon, 24 Jul 2023 12:24:10 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E87251C9A47
	for <linux-mm@kvack.org>; Mon, 24 Jul 2023 16:24:09 +0000 (UTC)
X-FDA: 81047027418.28.B747B8E
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf13.hostedemail.com (Postfix) with ESMTP id B687420027
	for <linux-mm@kvack.org>; Mon, 24 Jul 2023 16:24:07 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EkZ30KOE;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1690215847;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ug85PaAC7KA1yX0CxNMRE7rjsGHUiw/OFWL3G9SVx48=;
	b=a7A+1NgdEH2Kv5gGJwPJtqVrFnpzOTBzOTcnJJ2HHR2MoXNt+Bzyl9+Qpx1Si6cy/NIwkq
	okdnqVTOWJIl8SNQjah/sQl/dmQkNHYSx8dqezNqerA6osEtDJYxw73lXRyDxUUcMLL8gp
	/JNqvopX+hgwDef/Volyfa7b4UhHzA0=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EkZ30KOE;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690215847; a=rsa-sha256;
	cv=none;
	b=JvuZ6XjY2CgoYKsf07EmuRzi3JIpwmMN0jRRaPsD3avO0GPTbGMt2KQD6vUSXaFftOpBfX
	fJTWdbJyKb1Js0NwWyW1wn0Wv2z7Q1L+tcxOsGMTjdt/nr87gITVnr+gg+/B+l+I9Zslil
	6jRn4FiOfuppgY6sIa4YHAqpHptVVu4=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1690215847;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ug85PaAC7KA1yX0CxNMRE7rjsGHUiw/OFWL3G9SVx48=;
	b=EkZ30KOEy9Co9JHC23797qI5Zc/kaIdox4VQN9YL1bc0ucqW0h5BMZuzcrTOZ3k8tT9cua
	qUAGMrFmzTrowgYNk+svUyx9NxuyY5HcMByNsSq2tgosudQP3nyDjJEie0XZUmqc+pG4EO
	9TcxxGnlm5I2k/+BexDEYhnxd7klgak=
Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com
 [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-16-AUA3-sVtMt-YXtMdT700uA-1; Mon, 24 Jul 2023 12:24:05 -0400
X-MC-Unique: AUA3-sVtMt-YXtMdT700uA-1
Received: by mail-lf1-f71.google.com with SMTP id 2adb3069b0e04-4fdf1798575so1227222e87.2
        for <linux-mm@kvack.org>; Mon, 24 Jul 2023 09:24:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1690215844; x=1690820644;
        h=content-transfer-encoding:in-reply-to:subject:organization:from
         :references:cc:to:content-language:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=ug85PaAC7KA1yX0CxNMRE7rjsGHUiw/OFWL3G9SVx48=;
        b=lMSTvfaIXM5EIn9XE8Zfpw5yTcKZOEEYalqv9n4pvmzegimZNU7iffRp2OgoSh4954
         aMnEK1Otkw3MuHLQ+azq2l0164gOBfhyyj/1z0EFi2UJpKPPjNMdIkPno0a80i68eGsi
         hnDflhteYg2znUpUd+uz8HieWiEuFndu7c74ZCNFtmCfpLZILj3bKt33BzVrwG3BgH0Q
         a7X1VJc0LajXOKJN/7dkiM1ucOtjgn8yGoD8NJF09vyX+34xTTCAzzHIVHuiSOPNf2CN
         zNmiivPzIY0kQSD9a78YOataVyBKOsjceQIVUaxdePM90ZelaW49AOGcPk6WD8rTV6DO
         SfHQ==
X-Gm-Message-State: ABy/qLaLZ1gvtWEZP0Fk3ARjNpXil9upXUn/eOjnpAHD+uC5THyi7A/n
	UU+6G6dHdzWshDVbQzvkHw44GHh+GaIbzGuk6qTTFEuu/EbYR1/SrKAUyFKfnIJLDQs87/7LusZ
	kn6dZ0DoVNaQ=
X-Received: by 2002:ac2:5dee:0:b0:4f8:651f:9bbe with SMTP id z14-20020ac25dee000000b004f8651f9bbemr5788085lfq.54.1690215843896;
        Mon, 24 Jul 2023 09:24:03 -0700 (PDT)
X-Google-Smtp-Source: APBJJlHEekmXC2iJCPsbTrDc/44MKabM5a2iDYfkPkSxlTotpMKjXHabhuT78kKa0siW8UsfPUSzgg==
X-Received: by 2002:ac2:5dee:0:b0:4f8:651f:9bbe with SMTP id z14-20020ac25dee000000b004f8651f9bbemr5788062lfq.54.1690215843479;
        Mon, 24 Jul 2023 09:24:03 -0700 (PDT)
Received: from ?IPV6:2003:cb:c73d:bb00:91a5:d1c:3a7e:4c77? (p200300cbc73dbb0091a50d1c3a7e4c77.dip0.t-ipconnect.de. [2003:cb:c73d:bb00:91a5:d1c:3a7e:4c77])
        by smtp.gmail.com with ESMTPSA id q16-20020a5d5750000000b003112f836d4esm13475831wrw.85.2023.07.24.09.24.02
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 24 Jul 2023 09:24:02 -0700 (PDT)
Message-ID: <5ee3550d-5bbe-4223-722b-9a388f86fc21@redhat.com>
Date: Mon, 24 Jul 2023 18:24:01 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
To: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>, linux-mm@kvack.org,
 akpm@linux-foundation.org, mpe@ellerman.id.au,
 linuxppc-dev@lists.ozlabs.org, npiggin@gmail.com, christophe.leroy@csgroup.eu
Cc: Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>,
 Vishal Verma <vishal.l.verma@intel.com>
References: <20230718024409.95742-1-aneesh.kumar@linux.ibm.com>
 <20230718024409.95742-5-aneesh.kumar@linux.ibm.com>
 <f9597236-866d-17cd-d549-938ea80eacbe@redhat.com>
 <bbd774bb-10b9-30b1-c82b-27d01d304f8d@linux.ibm.com>
 <29eb32f0-fb0b-c8f9-ba23-8295147808ea@redhat.com>
 <3f22b23a-701a-548b-9d84-8ecad695c313@linux.ibm.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap
 reservation
In-Reply-To: <3f22b23a-701a-548b-9d84-8ecad695c313@linux.ibm.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: B687420027
X-Stat-Signature: nz3jf4hu9bs1hoi713kbmzzbmdqpwgbw
X-Rspam-User: 
X-HE-Tag: 1690215847-446041
X-HE-Meta: U2FsdGVkX19OQgujBz6on8mzJucmOmV0jIugkIo8ip+0zWorslfXxMu02DyFMji3bBC0BhwlwHl3oCkOOzHjyLsXqPQ50Cx6T0PgTQJSigjeVqfeNiGd9OoU5boUOmdAa2oaFhGf5G7DNgfTlROBA/B75YVr3anxXiREF+LSOqAj9VW7P7WX+939fNYkd4uRAmYUEgJUru5MN1chRdt8bElCojJaA/v4CRmaas4pxso9kMxYJoswXrdvr2epnEL+KyFEs5UU2fcW/dljf573nHKyw5IlstjeG+OvgF6mWhmm4nlnBkrpJk+e5SNJCVW5sirq3+6fGbx55IQPCVYVgovGgfylwqKzoST5GLKiqEfThiSR8JJNA36l4O/G+S7Pzd3wHC6LHBs2iDDSSKqMPXCfnEqe0+PNQEAV3FfV2imUzTMDFF6+20Lv2gLkGAxlPQtxunES1yg2Lkct0ZwZuN+M4ei2/mLnROg0UVyXGj7wyeFVYAY776ugK8XshcsUzimxEU77hDYXvPlv9WFjxULszTzqj0e1bpGjxsq451gPwy3PpgFe2NMsSabBgwdlCLQ6qjI12gzNhLiImv3B65Hb/O4TdjgzlChxeqjGidlcQkX2oafbp+Uy1UQl5jxPwz221ykkd+8wKK3SrRB6gMLv1Z1veB6/6H1n6UyIaOL7Lg44paID7WPC08KFN3Kf3IRalD0CDKL+60AjSayFiEhcv7qU7s6sr4b5BMSv6aCpvlJVl007X8LUkkTZd9hqeDkNN7Ol2feTavyOgzLFsyDHGLrWc+hhNJEKXzFZfkJhcG9cCN/afN7W8FoBi8Fyq2W0swYis8W0/Ze6tJgAxwBp8Bn2VEGmfv+zS6ZQavGPJwI9VXGadBKWnPt6/YjvLvV43qitBdOG/5uJQHmPFoFkpx29kKGQCCvEu6lPC4THSjaLmizNxFlGgAs0bmYqoDgNjXFGImOYNsRH3Hy
 obYIf17K
 xqyA2U6+bR7bUkSgJf9auwyBBCSTfCm5PZ3qFRec1wJDl0gVFF1m6T9ey4h1/bgPGUgdELcbEvscVycglO6KhV0FbeRFt82/hM0Htymjma9ddCboEwnq3mcJjzaqx3Hq1HWV0rn5Qw56XrmtwLn3KczZ/XI2kgIL78VhAPvDrC64dLpROj7ZT+wdP01Vi2n+G7Ct+n9htIRfdj5UdrYXghWb/AP/GLH7q+82mvdJCsq7QxnACwmhznA6Nt6MtcZyP562mJsz42g51tCtK8ibe6r3V9iPRijQnIqavf5/RHnHTpjVjVwGu69ne4j3ZAy5vE/O6xkhrfJo5iduqbwc8CeGi5evow5L6SA6XRgHUVPzlDMZOdzRaFTx9uTG98Ob3XatV
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 24.07.23 18:02, Aneesh Kumar K V wrote:
> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>
>>>>
>>>> /*
>>>>    * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>>    * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>>    * aligned, however, we might waste memory.
>>>>    */
>>>
>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>>> in the default case we do that by making sure only memory blocks of specific size supporting
>>> that alignment can use MEMMAP_ON_MEMORY feature.
>>
>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
>>
>> But if you have any ideas on how to clarify that (terminology), I'm all ears!
>>
> 
> 
> I updated the commit message
> 
> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
> 
> Currently, memmap_on_memory feature is only supported with memory block
> sizes that result in vmemmap pages covering full page blocks. This is
> because memory onlining/offlining code requires applicable ranges to be
> pageblock-aligned, for example, to set the migratetypes properly.
> 
> This patch helps to lift that restriction by reserving more pages than
> required for vmemmap space. This helps to align the start addr to be
> page block aligned with different memory block sizes. This implies the
> kernel will be reserving some pages for every memoryblock. This also
> allows the memmap on memory feature to be widely useful with different
> memory block size values.
> 
> For ex: with 64K page size and 256MiB memory block size, we require 4
> pages to map vmemmap pages, To align things correctly we end up adding a
> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
> 
> 

Much better.

> Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
> later).

Let's focus on this patchset here first.

Factoring out how manye memmap pages we actually need vs. how many pages 
we need when aligning up sound very reasonable to me.


Can you elaborate what the problem is?

> Also I still think altmap.reserve is easier because of the start_pfn calculation.
> (more on this below)

Can you elaborate? Do you mean the try_remove_memory() change?

> 
> 
>> [...]
>>
>>>>> +    return arch_supports_memmap_on_memory(size);
>>>>>     }
>>>>>       /*
>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>>     {
>>>>>         struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>>         enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>>> -    struct vmem_altmap mhp_altmap = {};
>>>>> +    struct vmem_altmap mhp_altmap = {
>>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>>
>>>> Can you remind me why we have to set reserve here at all?
>>>>
>>>> IOW, can't we simply set
>>>>
>>>> .free = memory_block_memmap_on_memory_size();
>>>>
>>>> end then pass
>>>>
>>>> mhp_altmap.alloc + mhp_altmap.free
>>>>
>>>> to create_memory_block_devices() instead?
>>>>
>>>
>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>>> the required alignment. One difference is where we allocate the struct page at. For this specific
>>> case it should not matter.
>>>
>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>>> {
>>>      return altmap->base_pfn + altmap->reserve + altmap->alloc
>>>          + altmap->align;
>>> }
>>>
>>> And other is where we online a memory block
>>>
>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>>
>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
>>
>> "Reserve" is all about "reserving that much memory for driver usage".
>>
>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
>>
>>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
>>
>> "
>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>> @free: free pages set aside in the mapping for memmap storage
>> @alloc: track pages consumed, private to vmemmap_populate()
>> "
>>
>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
>>
>> free + alloc should always sum up to our set-aside pageblock(s), no?
>>
>>
> 
> The difference is
> 
>   mhp_altmap.free = PHYS_PFN(size) - reserved blocks;
> 
> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.
> 
> mhp_altmap.free = PHYS_PFN(size) - 28.
> 
> So that 4 pages from which we are allocating the memmap pages are still counted in free page.
> 
> We could all make it work by doing
> 
> mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())
> 
> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
> But it is really reserving things for required alignment.


Let's take a step back.

altmap->alloc tells us how much was already allocated.

altmap->free tells us how much memory we can allocate at max (confusing, 
but see vmem_altmap_nr_free()).

altmap->free should actually have been called differently.


I think it's currently even *wrong* to set free = PHYS_PFN(size). We 
don't want to allocate beyond the first pageblock(s) we selected.


Can't we set:

1) add_memory_resource():

	.base_pfn = PHYS_PFN(start);
	.free = PHYS_PFN(memory_block_memmap_on_memory_size());

2) try_remove_memory():
	.base_pfn = PHYS_PFN(start);
	.alloc = PHYS_PFN(memory_block_memmap_on_memory_size());

Faking that all was allocated and avoiding any reservation terminology?

-- 
Cheers,

David / dhildenb