From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF3EEC001B0 for ; Mon, 24 Jul 2023 16:04:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 573F66B0071; Mon, 24 Jul 2023 12:04:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 524EE8E0001; Mon, 24 Jul 2023 12:04:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C4876B0075; Mon, 24 Jul 2023 12:04:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 2E0F96B0071 for ; Mon, 24 Jul 2023 12:04:14 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EB197B1765 for ; Mon, 24 Jul 2023 16:04:13 +0000 (UTC) X-FDA: 81046977186.13.1FAF6E1 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf09.hostedemail.com (Postfix) with ESMTP id 5DA5414011A for ; Mon, 24 Jul 2023 16:02:35 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="j9sD/xDe"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf09.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690214556; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FKae4M25rat9dvVyISkmIm3wpE3aeb3pvq51ZIG02TA=; b=GPX07p2VGwUJ7Ad9+dV9uU5KJY3824quCDYv2PdKCGTL16lRZs0M2OUvXSnvYIlf9rEeLu F+pDSWzz3AXXN5gucuvRTyXSKNhgvqfbapRjVhC3DBwsRiNWT+2k17LxRoj5TGQQmUdZ8/ KGXFQdMIWmXQXAg0M1SFLhenAI4mUNM= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="j9sD/xDe"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf09.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690214556; a=rsa-sha256; cv=none; b=YQHWLoVJrAr3xJLLTIp/AN+Ak15sfDR9vJFizqFtbj4IW/azccSAI0KEQepamP/tGCF15D Wr1yAIBzDFc5JNJV0sQwvub8k1bwFp8KN/H82axU9cafo9oHAul0c0ZhpTERGmr/T/UZJA gs8gscUu9kAF/ju39tygxV1n0wtxnmA= Received: from pps.filterd (m0353724.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 36OFlrXD010661; Mon, 24 Jul 2023 16:02:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=FKae4M25rat9dvVyISkmIm3wpE3aeb3pvq51ZIG02TA=; b=j9sD/xDeqY8meF/aAg2jXQIEkhSu6HR+p/A1T/hr3vkZYAgPWo5ti9YQIKL0eV5WoTSU Hh/1vysE/BnNlwK0O0PnAVdhXbe7F0LwA+ohpxfbDFJ5t/H8Cf9RN+9dZM+n1daoG5aa HQodz3n93oirDEqbohknfGqeLCnjTHn/bQaRDahip+DA8HSnJad9UJg70qW7HZ7aQVoX dMjnc70+UJ/WNQv51ShAMgEP8AE/FcB7KAv0RzB0IplDO5wCORlSo4bEKc7Z3bEkhtzL UXCLb0X/8UnfFhTP0eLhO7/GmSvucbuxLxsuIS/dRhgQlO9KkbAaEqp/5Jyzjy6ZJJWa QA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3s1qwpfypr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Jul 2023 16:02:27 +0000 Received: from m0353724.ppops.net (m0353724.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 36OFdKFP007647; Mon, 24 Jul 2023 16:02:26 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3s1qwpfypd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Jul 2023 16:02:26 +0000 Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 36OFmafG002059; Mon, 24 Jul 2023 16:02:25 GMT Received: from smtprelay03.fra02v.mail.ibm.com ([9.218.2.224]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3s0temme43-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Jul 2023 16:02:25 +0000 Received: from smtpav05.fra02v.mail.ibm.com (smtpav05.fra02v.mail.ibm.com [10.20.54.104]) by smtprelay03.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 36OG2NOP56951072 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Jul 2023 16:02:23 GMT Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A12ED20043; Mon, 24 Jul 2023 16:02:23 +0000 (GMT) Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3046620049; Mon, 24 Jul 2023 16:02:21 +0000 (GMT) Received: from [9.43.110.108] (unknown [9.43.110.108]) by smtpav05.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Jul 2023 16:02:20 +0000 (GMT) Message-ID: <3f22b23a-701a-548b-9d84-8ecad695c313@linux.ibm.com> Date: Mon, 24 Jul 2023 21:32:20 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation Content-Language: en-US To: David Hildenbrand , linux-mm@kvack.org, akpm@linux-foundation.org, mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org, npiggin@gmail.com, christophe.leroy@csgroup.eu Cc: Oscar Salvador , Michal Hocko , Vishal Verma References: <20230718024409.95742-1-aneesh.kumar@linux.ibm.com> <20230718024409.95742-5-aneesh.kumar@linux.ibm.com> <29eb32f0-fb0b-c8f9-ba23-8295147808ea@redhat.com> From: Aneesh Kumar K V In-Reply-To: <29eb32f0-fb0b-c8f9-ba23-8295147808ea@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: rr6EOoenMWgBYkLziH3FEtZQgSLiie6z X-Proofpoint-GUID: 0QnDwqvYWaQw4YP1-RpnR58RIvM2Ltsu X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-07-24_12,2023-07-24_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 mlxscore=0 clxscore=1015 malwarescore=0 priorityscore=1501 suspectscore=0 bulkscore=0 phishscore=0 spamscore=0 lowpriorityscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2306200000 definitions=main-2307240139 X-Rspamd-Queue-Id: 5DA5414011A X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: tpy7swxierxw45a6akbqmd5qyh9e5o6r X-HE-Tag: 1690214555-428564 X-HE-Meta: U2FsdGVkX18Z9AYUAbKIxI0aY50q0S8ywtN3NlYId8+ptocHPgZ55EV0pJ2RAa4ZWZh8afERAypl6duxBYa3OFXRBkLL90KE7FVBjqLX5Y88PYzF0+uX6PayVNxlL3jM0rdjdqHpc37gEKQYG1KObm8OzZgr1BYWRR6GiHdWbEEcRFULMaBug7Fs3SxMay/k/9Q8EY9gLzYTjl6C/QD4wiZSeYeAqtysY/BRvqlLyJ/QIZrgGu6H10bxF/RAugU/JYcp8ncSwR1U2a/5P902dHIVToYUpj8XAT+uD3PfZswDYZ+b7QybAunxQigu/GvHRq5coMR40o2QzPpIVEpqHbkn6AczqbyqqhTPFsAMRbhyGPLid++Ll0Z9uhmIbi0Ov0wjSGClzXOLTd15syDbA9tBPLp14c/HS8e+RqM5ybixdnjimwbyWgSE5adm9OBfwz1O51fwKVLUHnJ9d5ucJFGV4zHXJDsv6cniIB/MHes+mwNZyM5/tmsj0x6p8bURgXU+wf3ISAUoX8IixWa7bYnjfLRgyfNCuGdwJvyD83crmbUnr70T0qkLZUeByOteefWDPZ5/gLPJJ5wi368x/vrvEGVPVodD6oQjQxLXjyrxQa6MoGA/ja2UUecatw5/ntCvpMazdcgnxGwNGXHw852sfiDU564v/gGUe8K9FasbaAPveUha3Q1nWF9pA+RwoKMyKnSLfZA0kotehG1p0gOEJCHt6WZPwrKDQ6YVRn+G4moztGYCxcwdGLvZ02vOP9epCltH/ZnA9IWy5foDqXl+KRTSPCRVNbtgVAt/F4YmRDJXIw65zy7bzMt0DfD0+j3su7mWOX6m9ehy55oQXQElIvlTbUUrq52s86nbMeQlssEnA64yXqB4yZnizqkOyLJZGZwymguaJdxPAJdD0ew8Ug2wOXdNhUBgKep4FIMveMDjAtJkPOO1TI6xnEKF57xH3m0OYyqNSfRG53C RtxK9D3D pvX1YxzoS/kIO61G6qzg0ONAULF7gN7rQrgUVIQEe24rDxukS15HJadjZsca7YU2qwQnuuviwgdZYHJmGI8OPVbckCcu6QCGqN2FPXh6LBZksttydo/CcrAYG0QFUNzTnuoVu X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/24/23 9:11 PM, David Hildenbrand wrote: > On 24.07.23 17:16, Aneesh Kumar K V wrote: > >>> >>> /* >>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover >>>   * full pageblocks. That way, we can add memory even if the vmemmap size is not properly >>>   * aligned, however, we might waste memory. >>>   */ >> >> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways. >> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and >> in the default case we do that by making sure only memory blocks of specific size supporting >> that alignment can use MEMMAP_ON_MEMORY feature. > > See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then. > > But if you have any ideas on how to clarify that (terminology), I'm all ears! > I updated the commit message mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks Currently, memmap_on_memory feature is only supported with memory block sizes that result in vmemmap pages covering full page blocks. This is because memory onlining/offlining code requires applicable ranges to be pageblock-aligned, for example, to set the migratetypes properly. This patch helps to lift that restriction by reserving more pages than required for vmemmap space. This helps to align the start addr to be page block aligned with different memory block sizes. This implies the kernel will be reserving some pages for every memoryblock. This also allows the memmap on memory feature to be widely useful with different memory block size values. For ex: with 64K page size and 256MiB memory block size, we require 4 pages to map vmemmap pages, To align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Also while implementing your suggestion to use memory_block_memmap_on_memory_size() I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper later). Also I still think altmap.reserve is easier because of the start_pfn calculation. (more on this below) > [...] > >>>> +    return arch_supports_memmap_on_memory(size); >>>>    } >>>>      /* >>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) >>>>    { >>>>        struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; >>>>        enum memblock_flags memblock_flags = MEMBLOCK_NONE; >>>> -    struct vmem_altmap mhp_altmap = {}; >>>> +    struct vmem_altmap mhp_altmap = { >>>> +        .base_pfn =  PHYS_PFN(res->start), >>>> +        .end_pfn  =  PHYS_PFN(res->end), >>>> +        .reserve  = memory_block_align_base(resource_size(res)), >>> >>> Can you remind me why we have to set reserve here at all? >>> >>> IOW, can't we simply set >>> >>> .free = memory_block_memmap_on_memory_size(); >>> >>> end then pass >>> >>> mhp_altmap.alloc + mhp_altmap.free >>> >>> to create_memory_block_devices() instead? >>> >> >> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get >> the required alignment. One difference is where we allocate the struct page at. For this specific >> case it should not matter. >> >> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap) >> { >>     return altmap->base_pfn + altmap->reserve + altmap->alloc >>         + altmap->align; >> } >> >> And other is where we online a memory block >> >> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve; >> >> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this? > > "Reserve" is all about "reserving that much memory for driver usage". > > We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care. > >> I find it confusing to update free when we haven't allocated any altmap blocks yet. > > " > @reserve: pages mapped, but reserved for driver use (relative to @base)" > @free: free pages set aside in the mapping for memmap storage > @alloc: track pages consumed, private to vmemmap_populate() > " > > To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there. > > free + alloc should always sum up to our set-aside pageblock(s), no? > > The difference is mhp_altmap.free = PHYS_PFN(size) - reserved blocks; ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment. mhp_altmap.free = PHYS_PFN(size) - 28. So that 4 pages from which we are allocating the memmap pages are still counted in free page. We could all make it work by doing mhp_altmap.free = PHYS_PFN(size) - (memory_block_memmap_on_memory_size() - memory_block_memmap_size()) But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case. But it is really reserving things for required alignment. -aneesh