From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 444FCC02192 for ; Wed, 5 Feb 2025 16:06:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6BB3280013; Wed, 5 Feb 2025 11:06:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B1ACF280003; Wed, 5 Feb 2025 11:06:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BBDB280013; Wed, 5 Feb 2025 11:06:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7A8C9280003 for ; Wed, 5 Feb 2025 11:06:14 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 013D7C02B3 for ; Wed, 5 Feb 2025 16:06:13 +0000 (UTC) X-FDA: 83086367868.15.112D164 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf27.hostedemail.com (Postfix) with ESMTP id C546C40003 for ; Wed, 5 Feb 2025 16:06:11 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=OasksI7n; spf=pass (imf27.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.170 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738771572; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8RtR+/2+vJwWC8LPRbdeCn1fr1KijuWFk6Bt2w0eRwc=; b=r66O05WF7WHD///fckK+8+ddk+uDwtTDRqhdiNzfGERMnZ+rOY0B/oVcXhLgGJE0q/mVXi PAjyeFllJURWFTxdsONl+eNhkd5XRyo+RGgBk892+SKJLDTy21kEXdHgBImjwkysg0Xfb+ YPIFEs6OPUrcgRyHNmTvje8Eit3TgVM= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=OasksI7n; spf=pass (imf27.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.170 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738771572; a=rsa-sha256; cv=none; b=k1NNLMxG0DV5m00rcvyDEkeZFL1ya9AluzBhmDvmHwdA3z5uNIZ/LjESTTGkTMSGdU8zIJ fxmIsSXXXSDgpmsBtdx/lILa/EmrqXMK5bQF6HOi8HTVgOIjIvkCWx0mcBC7ietid8NOSr R8UuBKt/qVv0wyotgOEXxG6WNqS4wTI= Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-4679eacf25cso45648101cf.3 for ; Wed, 05 Feb 2025 08:06:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1738771571; x=1739376371; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=8RtR+/2+vJwWC8LPRbdeCn1fr1KijuWFk6Bt2w0eRwc=; b=OasksI7nhfOkb/1lNMHDuh5FRsv3h6RMOdA0hWzGoSgjLuPm6/Famwo9VsZ3yYgWOH r948lDIa1jhID8zloTVnBje1aZzEdjZl84tCDefSpaR1m2aiZcSti0B3aTnTKGwTttoW YbNLaSetF7u3m3UQehIxJaE/w8LWAKsbEuei6rPpGbXZfglhOidXQOWHw4/Vzw3ZgNnM 3j+n1KvfxsdeWU17WoWWsHf5duazt0Fqn9mq2+/2kmoKwF1bh2WEWTpZPWo89uuX1gyc Yv0N26J99OkWOWaSiWctz7zrdQalY7nfG5m0vxR572X5p0Vlyrc0sUwkhmTAgAwqQUQ/ ewqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738771571; x=1739376371; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8RtR+/2+vJwWC8LPRbdeCn1fr1KijuWFk6Bt2w0eRwc=; b=kLC30z3JiWl1ZMhhIeVQdAJvNp6O7djGvr5psYb3VytBvEbS3rjCoEUCMz+HQqbXZT StR5X6cCM+3gybtbu62FbemzibKb+ohHy/SJN3IrYIoAwVuweNgNvzoEa2b20PkeVhOy TAaoJqcaKRctRdgRWFIpfiR5NnA3opz3y23o0NGMJ5f4B/KPclaaBtR7evdHGOnk0iHo o/O3FOoWGNymAanu7Q02JGLGOU9Nw2bzqtEM1i7rY11tYyTPxoTNjdEh0Z02sYFcljwt MJS8Ff91/hRvIUXqJ9P+0OBvrRx7EyhX2BeXoXrxtAJ6//Oa/KmYiIf6YQ8BMiq1aFp5 LCMw== X-Gm-Message-State: AOJu0YybCPwBEHZD5YJ/qA11m+GWtwxwCYqHoGnobFstw1meYlfQ55jq +sRFB88S+6LadnrBbZKsrZYNquTEf8JsF9XViCainEsnO6o7Lvl4tvZQLUzUNI4= X-Gm-Gg: ASbGncs0xMBy/lONCTdjDh0qA52oOBVkjbSy+IesJQzgmgoyg43ev2GmRlzLNZym1BX /u49k8qehbHyVUo26CEFkwhTem3+yEewW/NJs440WyURlldbj7Cuz2eSPCnex7dfuKp9Nz8Ddwv HRTw68TH3HyI18Vd3lak38PgggqMRh+zrdEKRiB9G2GRFArzc0SsqS7ocfoOCWtsv22wxAgPGSe +BL2tfgTFgNB53RjJhyKUK7cJr4qVEwbKSKIFMG0QqVvQfuOHjUVaT73TiL2uDMlT2KIfGs6I5K ongZbMo3Tgi8fmbk1X2XR88TJ+nJ8dAkHOCeLm7JwpmP7U4Qj5YzhPZsJgMr7ExNSKCDweF47w= = X-Google-Smtp-Source: AGHT+IFgnrB6Xi1DmmsezS76vJs01fy6PDZJgZsh76Mfnx4A8g6iZOvyt772etAwkiXourOCWwc8QQ== X-Received: by 2002:a05:622a:199b:b0:467:6901:7589 with SMTP id d75a77b69052e-470281d0e07mr38522361cf.29.1738771570680; Wed, 05 Feb 2025 08:06:10 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-46fdf0a768bsm71859021cf.7.2025.02.05.08.06.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Feb 2025 08:06:10 -0800 (PST) Date: Wed, 5 Feb 2025 11:06:08 -0500 From: Gregory Price To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: CXL Boot to Bash - Section 2: The Drivers Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C546C40003 X-Stat-Signature: qr5rpz3aiobw8isnedntd4s38iyomym7 X-Rspam-User: X-HE-Tag: 1738771571-786717 X-HE-Meta: U2FsdGVkX1/QNHBBoAfVqbzPl9Is+bhsiAV1GptBsic+X+80oHY5j0IquJMFQghBMBvvvriIbOryPR09rx7Y+u+8amt5mbjU5YJELRnvvIa3xWT2/yXSMSXnyfBI0VZoi/czORg+QpdgA8EEISYUTGH5FwVPHFDwRHZROMOos3RgP7gh0FdYO5OgvFh6gvFygDDHHnzs/fvfZ4bbXe21y7vPvMu8vlHo+xMsgG3zAeb6QsO1LjfwJVmnVd3xP59YUiYxdyIKS75yVBOUWKDue+bhpKz5bLBWRuRISLcXu/CKNVAqWRmCnaMRuHuykuJQ2xnmz7nMCUMvBV+0qXTZSReonJHPhiVqGTrWnro5xAIRiWJRF277fHheAcitH555Ipq5zWo9FUInVWPn0e7vbWEplY0x0Qd7U4nXNzJHpbR5eqpMpj1C7tZpBaN8vxA3I/dktQI+BrrDfx/cGRb0aLJvchqDclRLLG8dil4dYxegAWEXzJ0KajrsyCrSCFjCCQduNCkTmy3ulb+7G6VmVMiHzimkpXARZ98k5JV2ytUduT5PikUc0KC0vrHHmGq/WQ9xqMBr4SDM8KK3zcvYcnP8NZD003qS80Z4W8TrBikbXdMED9lEGJSJ9bi8InaZftDQRTmo92fqran12B3RS04BohKw0gMBP5tkZMgFsMPLEOUe74k3MDJrj6JXjTYj1Ns1/ftl4GX0LZPsKM68cM7q7wRQulLbg0e2A5eWl2Exbb4ZrW02cRHkl/j2Gdj7yLhcbyOC793FrQL45HvWJmnlrWV6PPFdF/30doDi+hJVjwXYmvhMwFZDGKzUcz7lewnGuh4UCkASoBkDG6xoeIBi4LDS/55tc5ajumNISbDQxFRnPsUcpzeAOdPQHyj0kO9MnRD5wapTS60RnAYMIPvxyuK/gYICai6TATrTlQcPCOiKeJTw0xlxhzp1EXI142PNBYGzi4Sjalg8F4l 5atJUCu5 x8cgsXJySjbgNrNhWPr6yDfAdFcGtGUEbvSyROa/VJ7wZwQi/C9H+Rpn7Y75ZSi35+Zx0FYxGvgBJdKhZTA7sM5U/M8272NOiDUmZnwqh77EAnZmnp8KSMHKNNoBNjDHw5fr1p7yS3nkkpLE0387ZCyZqWKww0KPcpoqGULJNAlIR8BOMWrQkex8tWK4Cqc1KRPyKIoPxqyqQoiyOik6iTouUV+j0Zo2frFVt1ZB1SVRgVi3ChxqXpZk/MfJfzUflY8xHzBH86A7Zvqp3eVNKyfVbHDHD9ym5cqqOeFhTwoEQ3eXFOSZ7Gl4zAV4l8Pxjx7DJDH0yDxPg+1KcZaZkD3bZj1zEU0g0evo+vCN8oBKI828XCqzufefYXApnmicpt33uIVMz3fvKm2+WBkLw9gesfZQ3jTt5wF1e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000010, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: (background reading as we build up complexity) Driver Management - Decoders, HPA/SPA, DAX, and RAS. The Drivers =========== ---------------------- The Story Up 'til Now. ---------------------- When we left the Platform arena, assuming we've configured with special purpose memory, we are left with an entry in the memory map like so: BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved /proc/iomem: c050000000-fcefffffff : Soft Reserved This resource (see mm/resource.c) is left unused until a driver comes along to actually surface it to allocators (or some other interface). In our case, the drivers involved (or at least the ones we'll reference) drivers/base/ : device probing, memory (block) hotplug drivers/acpi/ : device hotplug drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...) drivers/pci/ : PCI device probing drivers/cxl/ : CXL device probing drivers/dax/ : cxl device to memory resource association We don't necessarily care about the specifics of each driver, we'll focus on just the aspects that ultimately affect memory management. ------------------------------- Step 4: Basic build complexity. ------------------------------- To make a long story short: CXL Build Configurations: CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION DAX Build Configurations: CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM Without all of these enabled, your journey will end up cut short because some piece of the probe process will stop progressing. The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not being enabled. You end up with memory regions without dax devices. [/sys/bus/cxl/devices]# ls dax_region0 decoder0.0 decoder1.0 decoder2.0 ..... dax_region1 decoder0.1 decoder1.1 decoder3.0 ..... ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully surface as dax devices, which can then be converted to system ram. --------------------------------------------------------------- Step 5: The CXL driver associating devices and iomem resources. --------------------------------------------------------------- The CXL driver wires up the following devices: root : CXL root portN : An intermediate or endpoint destination for accesses memN : memory devices Each device in the heirarchy may have one or more decoders decoderN.M : Address routing and translation devices The driver will also create additional objects and associations regionN : device-to-iomem resource mapping dax_regionN : region-to-dax device mapping Most associations built by the driver are done by validating decoders against each other at each point in the heirarchy. Root decoders describe memory regions and route DMA to ports. Intermediate decoders route DMA through CXL fabric. Endpoint decoders translate addresses (Host to device). A Root port has 1 decoder per associated CFMW in the CEDT decoder0.0 -> `c050000000-fcefffffff : Soft Reserved` A region (iomem resource mapping) can be created for these decoders [/sys/bus/cxl/devices/region0]# cat resource size target0 0xc050000000 0x3ca0000000 decoder5.0 A dax_region surfaces these regions as a dax device [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource 0xc050000000 So in a simple environment with 1 device, we end up with a mapping that looks something like this. root --- decoder0.0 --- region0 -- dax_region0 -- dax0 | | | port1 --- decoder1.0 | | | | endpoint0 --- decoder3.0--------/ Much of the complexity in region creation stems from validating decoder programming and associating regions with targets (endpoint decoders). The take-away from this section is the existence of "decoders", of which there may be an arbitrary number between the root and endpoint. This will be relevant when we talk about RAS (Poison) and Interleave. --------------------------------------------------------------- Step 6: DAX surfacing Memory Blocks - First bit of User Policy. --------------------------------------------------------------- The last step in surfacing memory to allocators is to convert a dax device into memory blocks. On most default kernel builds, dax devices are not automatically converted to SystemRAM. Policy Choices userland policy: daxctl default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE or CONFIG_MHP_DEFAULT_ONLINE_TYPE_* or memhp_default_state=* To convert a dax device to SystemRAM utilizing daxctl: daxctl online-memory dax0.0 [--no-movable] By default the memory will online into ZONE_MOVABLE The --no-movable option will online the memory in ZONE_NORMAL Alternatively, this can be done at Build or Boot time using CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below) CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above) memhp_default_state=* (boot param predating cxl) I will save the discussion of ZONE selection to the next section, which will cover more memory-hotplug specifics. At this point, the memory blocks are exposed to the kernel mm allocators and may be used as normal System RAM. --------------------------------------------------------- Second bit of nuanced complexity: Memory Block Alignment. --------------------------------------------------------- In section 1, we introduced CEDT / CFMW and how they map to iomem resources. In this section we discussed out we surface memory blocks to the kernel allocators. However, at no time did platform, arch code, and driver communicate about the expected size of a memory block. In most cases, the size of a memory block is defined by the architecture - unaware of CXL. On x86, for example, the heuristic for memory block size is: 1) user boot-arg value 2) Maximize size (up to 2GB) if operating on bare metal 3) Use smallest value that aligns with the end of memory The problem is that [SOFT RESERVED] memory is not considered in the alignment calculation - and not all [SOFT RESERVED] memory *should* be considered for alignment. In the case of our working example (real system, btw): Subtable Type : 01 [CXL Fixed Memory Window Structure] Window base address : 000000C050000000 Window size : 0000003CA0000000 The base is 256MB aligned (the minimum for the CXL Spec), and the window size is 512MB. This results in a loss of almost a full memory block worth of memory (~1280MB on the front, and ~512MB on the back). This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB). [1] has been proposed to allow for drivers (specifically ACPI) to advise the memory hotplug system on the suggested alignment, and for arch code to choose how to utilize this advisement. [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/ -------------------------------------------------------------------- The Complexity story up til now (what's likely to show up in slides) -------------------------------------------------------------------- Platform and BIOS: May configure all the devices prior to kernel hand-off. May or may not support reconfiguring / hotplug. BIOS and EFI: EFI_MEMORY_SP - used to defer management to drivers Kernel Build and Boot: CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM nosoftreserve - Will always result in CXL as SystemRAM kexec - SystemRAM configs carry over to target Driver Build Options Required CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM User Policy CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13) CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14) memhp_default_state (boot param) daxctl online-memory daxN.Y (userland) Nuances Early-boot resource re-use Memory Block Alignment -------------------------------------------------------------------- Next Up: Memory (Block) Hotplug - Zones and Kernel Use of CXL RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE Interleave - RAS and Region Management (Hotplug-ability) ~Gregory