From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FC69C021AA for ; Tue, 18 Feb 2025 17:49:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E606280172; Tue, 18 Feb 2025 12:49:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 076AB280170; Tue, 18 Feb 2025 12:49:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E501E280172; Tue, 18 Feb 2025 12:49:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C4AA8280170 for ; Tue, 18 Feb 2025 12:49:43 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8B6B2160326 for ; Tue, 18 Feb 2025 17:49:43 +0000 (UTC) X-FDA: 83133803046.07.7081EB8 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf03.hostedemail.com (Postfix) with ESMTP id 892A82000B for ; Tue, 18 Feb 2025 17:49:41 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T82Wmnal; spf=pass (imf03.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739900981; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=j4D3RkKdQuZIty+oDae4mcM33z7ITGGOJJS9/JjB6iQ=; b=J1rSy0AlA0sEYJXl/BJuzz3B50BlhPifXCPMihhmy4ZcCNN/oD+P2dhmHGGKOwWBDg9rJA esJ8Q7AWbKCiqbY7JBf7igPIq48S/iPBWIB1kPnBdT2oVm/tR1ylYvZjpYODw34Ht4VTRn /IcUAeN1qghkaRdqZ1/RQSh6hTEl5NM= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T82Wmnal; spf=pass (imf03.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739900981; a=rsa-sha256; cv=none; b=TS1pJLBddD8GV5CIZuNBB3NTMgOw8QHNaMW1l1otqGe7lIWV8YGx4XAj7IlCrtg/XwCLtO ZGzIWTA945bMCBaaDkYD0y+J6RK87l1FJ3cb4CU/Jh+6WPxS92UhxnOkalTb2nfsHrM1fv GHInT1AINxXKt0vmYCYCpU6AOcAhjZg= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5dec817f453so9979113a12.2 for ; Tue, 18 Feb 2025 09:49:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1739900980; x=1740505780; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=j4D3RkKdQuZIty+oDae4mcM33z7ITGGOJJS9/JjB6iQ=; b=T82WmnalhKlkoaPbFzOGvbgh8gmc6jHN7WFmjZ/NNY/P7EQZGKlJR/BzqwLKOfw+8z 2YdVu+wCdFICSxm70UyK2rRCXKTzPm4wgtpUH93OH7fO72sruswisk/ppnrMokZ43MiQ fVGWWrvJp0ER0uGT8fxAh8A+6R3WVZmvVHS17m6gAkWf093q0JUd3VzL7h1Mh/k6f347 jROBEUR1zm8x7JMN/LzWMpHVeaUodu5rZMAx9QRwQDleWfm8r2FqXp9QNJT+kqT5a/mm WaxMWMoSZDCaRXYL+zKsb50sEpIjMgwWT2jy+ESRZo/U3ayFvLpaqJPWYP8S6PD1FfQu N0hQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739900980; x=1740505780; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j4D3RkKdQuZIty+oDae4mcM33z7ITGGOJJS9/JjB6iQ=; b=s2Dd+HJxIlcT7Uq3SnBMS1QA7XgasomN/CDzBev63x2YYLe1jtA09gRjjHgf/jnKF/ SlPNq3UL6qII6/tH+rdfRZ0XtvQXUS0xXt8bvaKv3lZfWdMDui4Wx+pJwSL1DIIN5eiX 68ZcvfCZHqQ5n3PtIlopPYxI1JquTtGgRD1N/b1QrOruBp657DypVhUeA+7n870MrM04 9dxyN+C9oANnSs1kfGcNZg+LyTB7TkuzKLyNmsi3YXPtPH2eDU63XDET22NYWZkEFXOY p1nwmxM2bMaq9Z3JyvGFdIOQg0xfj2fXJfMsLhZ1B5Zw9CDw+TCqjMs7/L3/p0KdH9JU qPyg== X-Forwarded-Encrypted: i=1; AJvYcCVQMeRy61axwlSfDt/uw6LoaD8YG1iKA4Hsd3FNX8zmBQmdadmsoFb2kRrcl99va6+Y2564DiDpTA==@kvack.org X-Gm-Message-State: AOJu0Yw1IieV4B8MrfsqE/we7eRVRfETAsM7ZNTbwKKHAHYfObUIXwxp 1XMD4/V/yKh8kREAcaLYLCQptNaQ4o7kHuZqNKEeqJ6JMcY+e708gqTvCKiANprMqTWTMln2nAN VrFHhvqhbRtTkFgnYeJLIUxbBGUx8vA== X-Gm-Gg: ASbGncu5ZV6Z4bWW9FNpElJyM3dtJ0EyHBo4KCdCvJwi5bhJwuL5/+u1MTlsru0MbbI 7jkM7egz4bq9dGnwwkqAFKfyED+mB5JQI1Jxju0IL7TmGxSlVZRaYNKEIQcwHqOXrMLi/rDea X-Google-Smtp-Source: AGHT+IF30cO73ZHk8lsLdxM9JqGuET62bpvaxA4FoYYrORLv6LwGBx/g81X9jt1z3YnCkAfBvl04sCjPVR2uLixOjJ0= X-Received: by 2002:a05:6402:26d2:b0:5e0:51a9:d425 with SMTP id 4fb4d7f45d1cf-5e089f2b807mr283207a12.29.1739900979572; Tue, 18 Feb 2025 09:49:39 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Tue, 18 Feb 2025 09:49:28 -0800 X-Gm-Features: AWEUYZmAPUBBn9Bihii5OezBwEt76xfUEg_WniyupQXJMNKXU2HRJVUhdQfRZ6E Message-ID: Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug To: Gregory Price Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 892A82000B X-Stat-Signature: cawaie5cztebjenwyrp5xb4ghj4hhhrb X-Rspamd-Server: rspam03 X-HE-Tag: 1739900981-878399 X-HE-Meta: U2FsdGVkX1+YYGc8TRc4KDraB3luigmZQsXihmSipP5ZJmhkF/D6C6ps4tn95DYKZyTTF0Zz4vWFwAfei+MSK1DyXlC43z19VdWLo4qCca6AhYhLgm8pz5XxPsuvQmqpofI4gXFCYtHZsh2VpkzwzKfqcTH+HDTUjzCLd6y5EjAka0UNawEEWz9gBDYaAHSVn1kuL4aWGowiskdPE7CKMNehr1lWneiwLW83GME25C5lNz7xOoIKNX3hzPJjPv+UJ+OTkmfBrAZ1EyJ0s5cVo2xLxLrOUSVjjpD8dhce8rsknN3MvydFuHrEAaDakUAYcE/6+mOvHNE1Ewz79KFHQINMxpRq2Pjqe3mu1zI5+u95mto69hlORT8u1G2e2iVvWfEzpbEPGKNF9dYTYODf5QDkUZoli6YUg4Xl0nJC2GsMHaMe4Ay1HmdHq5eqTVV5ZMJh0ephFq4sO8SFjf0JHImHbqBBXaSImKtXDefYjIbnIaZQgodSmXtVDTdWqd5lkxNhD4b1ztJODuI6S19U0y+9yTdIewn5m2uFnJLB9U66iy3s9X0nJ9jSs2jW6C/mlfA7b6UvXijOKagnkCKlRE6sHbze+7KrPKk+F95G0LJwHHRVoU2CEKpZdJV4DtsH8ahTXWttPsgbo6ACI2K8H9A1hEVHq5dD+sW7hxPwD0Lj04wzWbDf6nB1lKoZMH2eSdCIovX1FsSxFdZPIRpNDGBeoAvnA/93/Aj1E8V/HYS0R54HqUYDKYGwL5Y5CncchDvfLPbRXZ1Omg3kIyPufNVljUl1ywEeq6dONQvGi/omzY+nRQdqJfy1FULYf/414wi5+LqQPhE3urg2871Df5tCxGr4WmrY58iLDa8fgcH8PPkvovl+sgH0o6Bg1cRVyr2US3MSD4N3Vl0ksuAOfo7PKBot6Wtf9SnL3hZmGAfjZRkJMTMCwsvHvP/fm3L7k8MvNIxMdU8kdPNjRuz xW497Fb/ wi8v8qKvPYtVKJXWiP2SLYmJqPd/n8t32TSsTjb+QIwrfOFgzyjWXgCWPnY5vniTYELfKtYbHHg3tEkok/9VUUxRcDRYBOrpyRs0djsZDVIDFcK84aKr+dejXHZngplcvjV3ABU8c4bCtTqTCQZbDX8hH+QEbolcB5DClU23SjVoSVF/0pqBGvfehtUreiYTtf3pgZGQlG5uKC7tYw9yTDR0hdOJfU9TkiuGlo2Pj5WxlUCIXwWaFbQj/wx0SEHku35Zcpu0AneoqO3TJ35nvzInEv+UaGIS1526eHB0YStUQGQcQMYx99WXoO35Mt0daOPG2TM6DSTRNL3ttm6tU8A+7eCqEstilch7CMGNY9SpH0JRap5EzqFBXS8DKLZJiF9pCp8/y+PALuyeLcwe/b9LVAyQmZY6G49LRV+Pef54Ko3GMs6ihvZm5MwKGHKmgSoUkqnso3yYE1TK6eVYIHc6SjRG1v7jvfUG1nC4Gppb0p96VtrmsFPy1GfxpphggniQe X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 17, 2025 at 12:05=E2=80=AFPM Gregory Price = wrote: > > > The story up to now > ------------------- > When we left the driver arena, we had created a dax device - which > connects a Soft Reserved iomem resource to one or more `memory blocks` > via the kmem driver. We also discussed a bit about ZONE selection > and default online behavior. > > In this section we'll discuss what actually goes into memory block > creation, how those memory blocks are exposed to kernel allocators > (tl;dr: sparsemem / memmap / struct page), and the implications of > the selected memory zones. > > > ------------------------------------- > Step 7: Hot-(un)plug Memory (Blocks). > ------------------------------------- > Memory hotplug refers to surfacing physical memory to kernel > allocators (page, slab, cache, etc) - as opposed to the action of > "physically hotplugging" a device into a system (e.g. USB). > > Physical memory is exposed to allocators in the form of memory blocks. > > A `memory block` is an abstraction to describe a physically > contiguous region memory, or more explicitly a collection of physically > contiguous page frames which is described by a physically contiguous > set of `struct page` structures in the system memory-map. > > The system memmap is what is used for pfn-to-page (struct) and > page(struct)-to-pfn conversions. The system memmap has `flat` and > `sparse` modes (configured at build-time). Memory hotplug requires the > use of `sparsemem`, which aptly makes the memory map sparse. > > Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove > an active memory block, the pages in-use must have their data (and > therefore mappings) migrated to another memory block. Hot-remove must > be specifically enabled separate from hotplug. > > > Build configurations affecting memory block hot(un)plug > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG > CONFIG_SPARSEMEM > CONFIG_64BIT > CONFIG_MEMORY_HOTPLUG > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE > CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE > CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO > CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL > CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE > CONFIG_MHP_MEMMAP_ON_MEMORY > CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > CONFIG_MIGRATION > CONFIG_MEMORY_HOTREMOVE > > During early-boot, the kernel finds all SystemRAM memory regions NOT > marked "Special Purpose" and will create memory blocks for these > regions by default. These blocks are defaulted into ZONE_NORMAL > (more on zones shortly). > > Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks > created and hot-plugged by drivers. The same mechanism is used to > hot-add memory physically hotplugged after system boot (i.e. not present > in the EFI Memory Map at boot time). > > The DAX/KMEM driver hotplugs memory blocks via the > `add_memory_driver_managed()` > function. > > > ------------------------------- > Step 8: Page Struct allocation. > ------------------------------- > A `memory block` is made up of a collection of physical memory pages, > which must have entries in the system Memory Map - which is managed by > sparsemem on systems with memory (block) hotplug. Sparsemem fills the > memory map with `struct page` for hot-plugged memory. > > Here is a rough trace through the (current) stack on how page structs > are populated into the system Memory Map on hotplug. > > ``` > add_memory_driver_managed > add_memory_resource > memblock_add_node > arch_add_memory > init_memory_mapping > add_pages > __add_pages > sparse_add_section > section_activate > populate_section_memmap > __populate_section_memmap > memmap_alloc > memblock_alloc_try_nid_raw > memblock_alloc_internal > memblock_alloc_range_nid > kzalloc_node(..., GFP_KERNEL, ...) > ``` > > All allocatable-memory requires `struct page` resources to describe the > physical page state. On a system with regular 4kb size pages and 256GB > of memory - 4GB is required just to describe/manage the memory. > > This is ~1.5% of the new capacity to just surface it (4/256). > > This becomes an issue if the memory is not intended for kernel-use, > as `struct page` memory must be allocated in non-movable, kernel memory > `zones`. If hot-plugged capacity is designated for a non-kernel zone > (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient > ZONE_NORMAL (or similar kernel-compatible zone) to allocate from. > > Matthew Wilcox has a plan to reduce this cost, some details of his plan: > https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ > https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/ > > > --------------------- > Step 9: Memory Zones. > --------------------- > We've alluded to "Memory Zones" in prior sections, with really the only > detail about these concepts being that there are "Kernel-allocation > compatible" and "Movable" zones, as well as some relationship between > memory blocks and memory zones. > > The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`. > > For the purpose of this reading we'll consider two basic use-cases: > - memory block hot-unplug > - kernel resource allocation > > You can (for the most part) consider these cases incompatible. If the > kernel allocates `struct page` memory from a block, then that block canno= t > be hot-unplugged. This memory is typically unmovable (cannot be migrated= ), > and its pages unlikely to be removed from the memory map. > > There are other scenarios, such as page pinning, that can block hot-unplu= g. > The individual mechanisms preventing hot-unplug are less important than > their relationship to memory zones. > > ZONE_NORMAL basically allows any allocations, including things like page > tables, struct pages, and pinned memory. > > ZONE_MOVABLE, under normal conditions, disallows most kernel allocations. > > ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability. > The kernel and privileged users can cause long-term pinning to occur - > even in ZONE_MOVABLE. It should be seen as a best-attempt at providing > hot-unplug-ability under normal conditions. > > > Here's the take-away: > > Any capacity marked SystemRAM but not Special Purpose during early boot > will be onlined into ZONE_NORMAL by default - making it available for > kernel-use during boot. There is no guarantee of being hot-unpluggable. > > Any capacity marked Special Purpose at boot, or hot-added (physically), > will be onlined into a user-selected zone (Normal or Movable). > > There are (at least) 4 ways to select what zone to online memory blocks. > > Build Time: > CONFIG_MHP_DEFAULT_ONLINE_TYPE_* > Boot Time: > memhp_default_state (boot parameter) > udev / daxctl: > user policy explicitly requesting the zone > memory sysfs > online_movable > /sys/bus/memory/devices/memoryN/online > > > ------------------------------------------ > Nuance: memmap_on_memory and ZONE_MOVABLE. > ------------------------------------------ > As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity > will consume ZONE_NORMAL capacity for its kernel resources. This can > be problematic if vast amounts of ZONE_MOVABLE is added on a system > with limited ZONE_NORMAL capacity. > > For example, consider a system with 4GB of ZONE_NORMAL and 256GB of > ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would > be consumed to allocate `struct page` resources for the ZONE_MOVABLE > capacity - leaving no working memory for the rest of the kernel. > > The `memmap_on_memory` configuration option allows for hotplugged memory > blocks to host their own `struct page` allocations... > > if they're placed in ZONE_NORMAL. > > To enable, use the boot param: `memory_hotplug.memmap_on_memory=3D1`. > > Sparsemem allocation of memory map resources ultimately uses a > `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with > a *suggested* node. > > ``` > memmap_alloc > memblock_alloc_try_nid_raw > memblock_alloc_internal > memblock_alloc_range_nid > kzalloc_node(..., GFP_KERNEL, ...) > ``` > > The node ID passed in as an argument is a "preferred node", which means > is insufficient space on that node exists to service the GFP_KERNEL > allocation, it will fall back to another node. > > If all hot-plugged memory is added to ZONE_MOVABLE, two things occur: > > 1) A portion of the memory block is carved out for to allocate memmap > data (reducing usable size by 64b*nr_pages) > > 2) The memory is allocated on ZONE_NORMAL on another node.. Nice write-up, thanks for putting everything together. A follow up question on this. Do you mean the memmap memory will show up as a new node with ZONE_NORMAL only besides other hot-plugged memory blocks? So we will actually see two nodes are hot-plugged? Thanks, Yang > > Result: Lost capacity due to the unused carve-out area for no value. > > -------------------------------- > The Complexity Story up til now. > -------------------------------- > Platform and BIOS: > May configure all the devices prior to kernel hand-off. > May or may not support reconfiguring / hotplug. > > BIOS and EFI: > EFI_MEMORY_SP - used to defer management to drivers > > Kernel Build and Boot: > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG > CONFIG_SPARSEMEM > CONFIG_64BIT > CONFIG_MEMORY_HOTPLUG > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE > CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE > CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO > CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL > CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE > CONFIG_MHP_MEMMAP_ON_MEMORY > CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > CONFIG_MIGRATION > CONFIG_MEMORY_HOTREMOVE > CONFIG_EFI_SOFT_RESERVE=3Dn - Will always result in CXL as SystemRAM > nosoftreserve - Will always result in CXL as SystemRAM > kexec - SystemRAM configs carry over to target > memory_hotplug.memmap_on_memory > > Driver Build Options Required > CONFIG_CXL_ACPI > CONFIG_CXL_BUS > CONFIG_CXL_MEM > CONFIG_CXL_PCI > CONFIG_CXL_PORT > CONFIG_CXL_REGION > CONFIG_DEV_DAX > CONFIG_DEV_DAX_CXL > CONFIG_DEV_DAX_KMEM > > User Policy > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=3Dv6.13) > CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=3Dv6.14) > memhp_default_state (boot param) > daxctl online-memory daxN.Y (userland) > > Nuances > Early-boot resource re-use > Memory Block Alignment > memmap_on_meomry + ZONE_MOVABLE > > ---------------------------------------------------- > Next up: > RAS - Poison, MCE, and why you probably want CXL=3DZONE_MOVABLE > Interleave - RAS and Region Management > > ~Gregory >