From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9868AC021AA for ; Mon, 17 Feb 2025 20:05:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7E3428009D; Mon, 17 Feb 2025 15:05:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E2D3D28009C; Mon, 17 Feb 2025 15:05:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF4E828009D; Mon, 17 Feb 2025 15:05:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B174428009C for ; Mon, 17 Feb 2025 15:05:52 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3B669A0E4C for ; Mon, 17 Feb 2025 20:05:52 +0000 (UTC) X-FDA: 83130517344.18.F2F15A1 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf26.hostedemail.com (Postfix) with ESMTP id 2379F14000D for ; Mon, 17 Feb 2025 20:05:49 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=W18sDC5R; spf=pass (imf26.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.51 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739822750; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vCgeOMU2MhngBhzCworEtjOZ+T+D6653XM8U1/nQhGA=; b=a4Be5E3XyK664Eb16LwdV9R2NlYteUE+Z4Vpq904JDqa3oypGS5aVsQeTZCfxDMDZ0T5Fx 4ae6hTjEuzFFOmGbyzjT0b9LJRkQZOCKjZ5JiDdQP8aSn4xkFiSgIpjvkyPjMAmcJ+OqFR Ru10hLm5cBvs3Xm9bL6yCTg8DijrIto= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=W18sDC5R; spf=pass (imf26.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.51 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739822750; a=rsa-sha256; cv=none; b=VbT19xhl6EZhNoYQmVPxf25gTWAIl9ycs8tsxk32V9/x5aXzwuJgB2OoduUGNKLnCM4ZWQ cxfckunE90H+v7w4mgjZs/qiNwsI7qZw/epB+pzoUqS25G5C4xdBwAJwd2izzp0WKoXcxT FeWvkm7NXV0g7YenqnZI7eIwGxLfX0o= Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6e67fad4671so10653596d6.1 for ; Mon, 17 Feb 2025 12:05:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1739822749; x=1740427549; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=vCgeOMU2MhngBhzCworEtjOZ+T+D6653XM8U1/nQhGA=; b=W18sDC5Rutrl3gg4pkeOq3yt5ypPSYtZr11I+z7eTh0Tye5YssPlZR8/IfqBGsQ9l6 1786TPDDiE6iktbl3rCWQTu5F5OuMFr21pycVNESr+HIfJYRcZ35vXTDKs4zaczYRvVF 1Rpnj3a1u0Oc6bZeY3eFEQRgMLOfpJWowjjGA95nZd/UM1iKYZFX2Dm8Da7YzqrOPEGb ioSBOiAnJSZdqBrKkULBDrRIiYMGn7Ek5oggXwkpxXJQx4oU6SEkmSdCo679MAjbSPWe DVgTcj4L4SJAhF/gHju6g19uWPfdPYCCi5RIFaPLU2KACjTfoes74djPkXD47+LDP0Xx svRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739822749; x=1740427549; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=vCgeOMU2MhngBhzCworEtjOZ+T+D6653XM8U1/nQhGA=; b=Tx4n6AP4UcVGnkcTPT+BeeTUiS4l1PrTrro8pXq9Cwq3atWbCmXi6sGl4FT45M/JcQ 1b5jZIKp0FTN6Qt1GPW9L/M5gwrMtK8ff92C4btmiyrYknjdt01Ht4UU3Vk6C4YCBGrT 62I3pxYhumqHcX2DinUIhfkmJhnQjWKH/cyi9vWq7OG+jesdL4bhynjYpy+6Rc8HKRb8 Jge92X3uxX97YAaqG6MbPjRiLHYmUSqUiu8iRxgZbvABRAD2pWay9VwrgrBdWjfHsDI6 4EFcrIZSL2O4nCOiVh9B57LQ9J1Abxigg3GKnILJrlQwd6xvpFTPQiybc3+QfFbDPHbH KFqA== X-Gm-Message-State: AOJu0YxoCuiHGPneaYjoPsB8ieHLTW6/oYT/KZOZOPRWaa440AkfnYpd /v8bKN0GHXpWG6c+nEEXcZDOK4AUQcKkTu8aemIfcpLdFVDoaWazYXawBVcxtO4= X-Gm-Gg: ASbGncuVVPSvmnV9sJ/xCVi1Y5V7HUz9vhb+93VerGGJEDI/9KlC5Vr2ahJkb2t0tSD TI7qPkuSvYVXom3UUiQG9wS3zJIPPbF8TWQXvom9GcOIZc7/idFULWtazh9q4ZfW6fm0gIALfay nPIHLa6zszZKNgPOJWQXvHD7jxyj8cKFE6KVXOrdxy/PTO40fGucw5eI3hD2AtsUZ+C1V8nfY9l fso23yDnGWaSiZ1HrTs/AqOeCIvsTr4kJUFJwVJHrJRelr4WBcysbxZZsMeZ5MJNZJ6/Xkp+FOo /M24W4BEN1E30U6okVRJViAVFM4CrPUhB+RXXse4pDZdReCe5JyWeA8WQKY6kgRC/LSQGkZVKw= = X-Google-Smtp-Source: AGHT+IFPULOdVELO2VyJgNV5ljLKvzQJzC37IUdklvHnQ5b1U1CtrvxTzhQkTu86wiLswV9Q8uTiRQ== X-Received: by 2002:a05:6214:3bc9:b0:6e4:4331:aad9 with SMTP id 6a1803df08f44-6e676294de1mr124170946d6.2.1739822748947; Mon, 17 Feb 2025 12:05:48 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6e65daffdbbsm55675956d6.99.2025.02.17.12.05.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Feb 2025 12:05:48 -0800 (PST) Date: Mon, 17 Feb 2025 15:05:44 -0500 From: Gregory Price To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2379F14000D X-Stat-Signature: q8sr3t7q6djxas37mr4hosa4i87d11xo X-Rspam-User: X-HE-Tag: 1739822749-371113 X-HE-Meta: U2FsdGVkX18Kh+Zjis6NIpV4Q+sVbGpoznJagxDbYFQiyiAJBPC7qU9dTr0lkU9al6igbWnxkqfeAaNGRtEloT7a1Kbo5G2iLl8183dEtc6gY5+6Axx5G9J9xbCpKppTEt2L4ecH9RvzcHvvgf6t218NQ0AjtxTrv/5hycIXeY/PD1FQfuXsPh2vERFyNhx8GhI1o+VIaZ9HJbyqpzY0MQSFP2BDftcXGeY918yaAuputCi/6SL0QTMfgtZmoH7sphl9S4gcgeuehWUEt6+On6KKCiQISjZuqWL1hjBco7ZcBsms5HLd49GAswLgNhTHQyfMxJNxDHqFqe9PeczExYA3qnIXtXPlyqbsm6jpQNLYtH6U/0lZDlUTE1yKL+1lgXr/64XK3RFQgcY0aZL5Qw+JJzr3fNXS7dMe7q4l0KTcOO2KQFkhyrvBlCwgVl84CBsgLdLP/rFTo5h6kbxB6tAGZvhWE7YSjSVg5C595FruHpfzWVyITsf4BFlZzX5ZGgsTL6gHAT0byzIkNqVhsljodxn9nLg755q6ydv+bjCYUEKv8g/5QNKtrhUyHNH36/8pOTC7YwIZe5DJ/RtW44sNLyPPDbMkgT+1P/taA3ggxsV1Ffs0LfCVF8GgRWK/sewKGo1vEVkJg+VI4UqKkov72bjUWTodov2dTXipFx0/9LRu3M9Yf6LMntDn1gRKJcOyUUtSQ0NrxpeWCgjzvWuLnPR+yyXRM1zEebI6+J1lF84NQRv8XD86DmqxR+V3ZAmLJBbly2XdHNrJIKxmDwRB/woo3d34uVNlXkRjtp3WOyRsBOXWr36FGtY+2xVtZfQ6LshVywtdVZpPK0k/c7C7aTpobVVC8+V3GqKz3VtbadmtK80b6l6aB6OYS3uRL8pkFlg/Id9g3urB4O8TOLRt48dHhH8SwCxEYwT7FqR6hJQ/SV8MQEjEDvhjCb0uh/b0kUoJZ0hOhonbWeC uJkL0jRH j0XALwvIJAEYc09+6DmwhvRlEAxkUYhOw87DAKJe5etkmvOVD1t7KiUgSKloTrWQ6oqJs3ofkQyrq+NbkUhSx7ZAPHso2479jZQWCr3Di6BEZTMTyxXrmvjGL6/79aoNFKi8gYZoITIZKjAGgYNGK2tbMWTPBUK71qRLHYpuN4dnoCzEEQQVahFkp50y0YmWVQUQJ7NiiCAcm5oaVkE7UBwD6g9vy8jWAd0KI0uBiMmzX2hB/wmP0+8VDDglCBe0+HhVdU8cE8Mdf81mBCjzsTeLWZAJbKCKGQVhEInHEFPv+kudKhO+NN9hmG43r5WEFyTZUJMeU7xBKpcZaQkY8EsSezs6MyRDNIMkkNSZuf5T6v809A1Z3hd58+OqMZCFX24tY8/YE6XbOJUHa2deROIVxr92JGudOJ3PcXRGXljtiI7kQbCuzchZ/nF2tGV7QTN2zogtOPqmtxnywGuL1oSeJqg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The story up to now ------------------- When we left the driver arena, we had created a dax device - which connects a Soft Reserved iomem resource to one or more `memory blocks` via the kmem driver. We also discussed a bit about ZONE selection and default online behavior. In this section we'll discuss what actually goes into memory block creation, how those memory blocks are exposed to kernel allocators (tl;dr: sparsemem / memmap / struct page), and the implications of the selected memory zones. ------------------------------------- Step 7: Hot-(un)plug Memory (Blocks). ------------------------------------- Memory hotplug refers to surfacing physical memory to kernel allocators (page, slab, cache, etc) - as opposed to the action of "physically hotplugging" a device into a system (e.g. USB). Physical memory is exposed to allocators in the form of memory blocks. A `memory block` is an abstraction to describe a physically contiguous region memory, or more explicitly a collection of physically contiguous page frames which is described by a physically contiguous set of `struct page` structures in the system memory-map. The system memmap is what is used for pfn-to-page (struct) and page(struct)-to-pfn conversions. The system memmap has `flat` and `sparse` modes (configured at build-time). Memory hotplug requires the use of `sparsemem`, which aptly makes the memory map sparse. Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove an active memory block, the pages in-use must have their data (and therefore mappings) migrated to another memory block. Hot-remove must be specifically enabled separate from hotplug. Build configurations affecting memory block hot(un)plug CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG CONFIG_SPARSEMEM CONFIG_64BIT CONFIG_MEMORY_HOTPLUG CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE CONFIG_MHP_MEMMAP_ON_MEMORY CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE CONFIG_MIGRATION CONFIG_MEMORY_HOTREMOVE During early-boot, the kernel finds all SystemRAM memory regions NOT marked "Special Purpose" and will create memory blocks for these regions by default. These blocks are defaulted into ZONE_NORMAL (more on zones shortly). Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks created and hot-plugged by drivers. The same mechanism is used to hot-add memory physically hotplugged after system boot (i.e. not present in the EFI Memory Map at boot time). The DAX/KMEM driver hotplugs memory blocks via the `add_memory_driver_managed()` function. ------------------------------- Step 8: Page Struct allocation. ------------------------------- A `memory block` is made up of a collection of physical memory pages, which must have entries in the system Memory Map - which is managed by sparsemem on systems with memory (block) hotplug. Sparsemem fills the memory map with `struct page` for hot-plugged memory. Here is a rough trace through the (current) stack on how page structs are populated into the system Memory Map on hotplug. ``` add_memory_driver_managed add_memory_resource memblock_add_node arch_add_memory init_memory_mapping add_pages __add_pages sparse_add_section section_activate populate_section_memmap __populate_section_memmap memmap_alloc memblock_alloc_try_nid_raw memblock_alloc_internal memblock_alloc_range_nid kzalloc_node(..., GFP_KERNEL, ...) ``` All allocatable-memory requires `struct page` resources to describe the physical page state. On a system with regular 4kb size pages and 256GB of memory - 4GB is required just to describe/manage the memory. This is ~1.5% of the new capacity to just surface it (4/256). This becomes an issue if the memory is not intended for kernel-use, as `struct page` memory must be allocated in non-movable, kernel memory `zones`. If hot-plugged capacity is designated for a non-kernel zone (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient ZONE_NORMAL (or similar kernel-compatible zone) to allocate from. Matthew Wilcox has a plan to reduce this cost, some details of his plan: https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/ --------------------- Step 9: Memory Zones. --------------------- We've alluded to "Memory Zones" in prior sections, with really the only detail about these concepts being that there are "Kernel-allocation compatible" and "Movable" zones, as well as some relationship between memory blocks and memory zones. The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`. For the purpose of this reading we'll consider two basic use-cases: - memory block hot-unplug - kernel resource allocation You can (for the most part) consider these cases incompatible. If the kernel allocates `struct page` memory from a block, then that block cannot be hot-unplugged. This memory is typically unmovable (cannot be migrated), and its pages unlikely to be removed from the memory map. There are other scenarios, such as page pinning, that can block hot-unplug. The individual mechanisms preventing hot-unplug are less important than their relationship to memory zones. ZONE_NORMAL basically allows any allocations, including things like page tables, struct pages, and pinned memory. ZONE_MOVABLE, under normal conditions, disallows most kernel allocations. ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability. The kernel and privileged users can cause long-term pinning to occur - even in ZONE_MOVABLE. It should be seen as a best-attempt at providing hot-unplug-ability under normal conditions. Here's the take-away: Any capacity marked SystemRAM but not Special Purpose during early boot will be onlined into ZONE_NORMAL by default - making it available for kernel-use during boot. There is no guarantee of being hot-unpluggable. Any capacity marked Special Purpose at boot, or hot-added (physically), will be onlined into a user-selected zone (Normal or Movable). There are (at least) 4 ways to select what zone to online memory blocks. Build Time: CONFIG_MHP_DEFAULT_ONLINE_TYPE_* Boot Time: memhp_default_state (boot parameter) udev / daxctl: user policy explicitly requesting the zone memory sysfs online_movable > /sys/bus/memory/devices/memoryN/online ------------------------------------------ Nuance: memmap_on_memory and ZONE_MOVABLE. ------------------------------------------ As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity will consume ZONE_NORMAL capacity for its kernel resources. This can be problematic if vast amounts of ZONE_MOVABLE is added on a system with limited ZONE_NORMAL capacity. For example, consider a system with 4GB of ZONE_NORMAL and 256GB of ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would be consumed to allocate `struct page` resources for the ZONE_MOVABLE capacity - leaving no working memory for the rest of the kernel. The `memmap_on_memory` configuration option allows for hotplugged memory blocks to host their own `struct page` allocations... if they're placed in ZONE_NORMAL. To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`. Sparsemem allocation of memory map resources ultimately uses a `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with a *suggested* node. ``` memmap_alloc memblock_alloc_try_nid_raw memblock_alloc_internal memblock_alloc_range_nid kzalloc_node(..., GFP_KERNEL, ...) ``` The node ID passed in as an argument is a "preferred node", which means is insufficient space on that node exists to service the GFP_KERNEL allocation, it will fall back to another node. If all hot-plugged memory is added to ZONE_MOVABLE, two things occur: 1) A portion of the memory block is carved out for to allocate memmap data (reducing usable size by 64b*nr_pages) 2) The memory is allocated on ZONE_NORMAL on another node.. Result: Lost capacity due to the unused carve-out area for no value. -------------------------------- The Complexity Story up til now. -------------------------------- Platform and BIOS: May configure all the devices prior to kernel hand-off. May or may not support reconfiguring / hotplug. BIOS and EFI: EFI_MEMORY_SP - used to defer management to drivers Kernel Build and Boot: CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG CONFIG_SPARSEMEM CONFIG_64BIT CONFIG_MEMORY_HOTPLUG CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE CONFIG_MHP_MEMMAP_ON_MEMORY CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE CONFIG_MIGRATION CONFIG_MEMORY_HOTREMOVE CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM nosoftreserve - Will always result in CXL as SystemRAM kexec - SystemRAM configs carry over to target memory_hotplug.memmap_on_memory Driver Build Options Required CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM User Policy CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13) CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14) memhp_default_state (boot param) daxctl online-memory daxN.Y (userland) Nuances Early-boot resource re-use Memory Block Alignment memmap_on_meomry + ZONE_MOVABLE ---------------------------------------------------- Next up: RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE Interleave - RAS and Region Management ~Gregory