From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54B9BC282D1 for ; Thu, 6 Mar 2025 23:56:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8CD41280002; Thu, 6 Mar 2025 18:56:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 87D2F280001; Thu, 6 Mar 2025 18:56:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 746CF280002; Thu, 6 Mar 2025 18:56:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4F7C7280001 for ; Thu, 6 Mar 2025 18:56:17 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D9FF2801F6 for ; Thu, 6 Mar 2025 23:56:18 +0000 (UTC) X-FDA: 83192787636.23.284D699 Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf24.hostedemail.com (Postfix) with ESMTP id D7FA0180009 for ; Thu, 6 Mar 2025 23:56:16 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=a5IHKSDW; spf=pass (imf24.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.179 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741305377; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8SxknEtouNK3Zjilzhs/nL72q1E1f2BQDuWYXogJJkI=; b=fjzST5xEPXrOEvpJzv+aJ+pvp8wQACBml1y0lAuWzbPqpDnXt6LSslW6Z+vBxD/Wu6kZY2 WN0fy5y10SNjtGo2Gn3ZtMItlByEF6erpbx9zrSaegD7i1BxVJ3rjRpz6bWZx+lvUWo4Fl TFxmzyTc8sJckyw9GFJx+oqoRwTQM7A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741305377; a=rsa-sha256; cv=none; b=zuRSWe3I7NrwU0sFFGZb+Aci/JlByrZyQ5sf8Bie2QndjZdLTK6dQ2jNLnDRyhrAB4gr8C B7WESe5Zr7heXsHc9wIAE7/uARFZ79pqZ6HE6MORHKplEkzrqFF5cMZ56j36KxykKt6b4e 0XPF6BS4/IpMQv+ZBXW7y6XH17IpBnc= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=a5IHKSDW; spf=pass (imf24.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.179 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-7c3be0d212fso183855685a.2 for ; Thu, 06 Mar 2025 15:56:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1741305376; x=1741910176; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=8SxknEtouNK3Zjilzhs/nL72q1E1f2BQDuWYXogJJkI=; b=a5IHKSDWVwgLBc7v9uH/fe6Rwk/tUVKkPqWhsOp1yxXTYcAiMHLMCOtPIaIs7K/bzV 7kosqRRJiJbn23C9M+s/dvqrzAv8UooF0xhz9W7ofM9YuEoWFqd9uho8kixubP7PKvIY Y/FAas0ikcMhNv6TmWh/r0HmsKcK+5FkRXULaOlihPVMJlfoJVh7D6kbYQiCA6MZaDuV 2B8RkrTtN3wZ7i9XwfoRpWI1lgElKqzk5orlhggMN1sg8+nGxRVnCGF2igeH5yzYxvq+ q5afIQM0WDlLHXCKDhsl4eaMZDUdpDm6AK+OKLweofZR+H1DwgppCG/Al0h4MbdLmD8O wWiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741305376; x=1741910176; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8SxknEtouNK3Zjilzhs/nL72q1E1f2BQDuWYXogJJkI=; b=Rj+f9pmL9xcDN0TUiQqBO7auicijuEq+yU4QYw09ZZIFHBwL8P/vzLfLMCp33nxHIb JhSDCFxme3Ji97VJ7AG//L5uLE+7bbFY7G8A2mHU0/vtQRONESCCGmAXkSxPAXrg+tsu F4hhn546hsbPf7zV4piOye6jJdp3C/Qh0UTkMZxyipGOLLkmqG2dengIs7gutK8jymXS qucVl1ZDo1QvkhXiJZD2NfMz5FL2Mkifxqp9fHivEE2/qlVL5BRfSn2BPkqlZ3HDfuzZ sq8Av3E9OmasoC8JZlC37rUWcNslxxuqWtpWTEAGnj0yU3hbrERv+rEFmMYMsXgCyNUG SQ7Q== X-Gm-Message-State: AOJu0Yx09P46N+3xf4Ih9sYYPrZVW5tQBJ+N5IPL8DmiEhmmXu7iV6sx MWbX9shBV8tDB+RoO3nIAIUgGKDv1/VXNGx2pBdhyVVH1+uvnZ2Gz3U9UZjxDm0= X-Gm-Gg: ASbGnctasNEhP/xH+86BgtsbPSh86iyotoSIihfru4UU1n9q1O8Y82e2GYxapvi0kV8 WRUY/O4IRfjSTDWB/PR6aGmQjMs7c5GIzB8wkGsjJ3pPyPa0mCGGVa+ksIxMew8/XqZC0UT6uOm vTNMSYX4YoR/ExENRLAN4vYtRmH0tabd2dj1zAvUuQRcMr1+3szv6bt8GgfWDRcim4ADGRG0DJi wXESpDHUjDlqm2FVQhR5phlk9VuBZ5Si9DlZtZO1FLAHTmSwhwbEMkHi58TKMuxps8r1SITfswR pV+NHTcZDKgYMwLYnmV7mTdrmqLhNSuLultVR/E12JhzHgTKICRMZyaaL2D2IFe7VBTOS1tbfdF rx8J3/bIjkZ+V9lqcjL8pEYyrx/8= X-Google-Smtp-Source: AGHT+IHMfvSB5jDUQA+ycqlh2nLaumrbN/QnnIspplIKXIwSEH68cURXntQcFrU/O3HWCHFOeXosag== X-Received: by 2002:ad4:500c:0:b0:6e6:6c7c:984a with SMTP id 6a1803df08f44-6e9006776admr10608846d6.29.1741305375656; Thu, 06 Mar 2025 15:56:15 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6e8f707c39dsm12780056d6.23.2025.03.06.15.56.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Mar 2025 15:56:15 -0800 (PST) Date: Thu, 6 Mar 2025 18:56:13 -0500 From: Gregory Price To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam02 X-Stat-Signature: 6ny9itzhqr51mxkc7usrg634zeokcd1j X-Rspamd-Queue-Id: D7FA0180009 X-Rspam-User: X-HE-Tag: 1741305376-525756 X-HE-Meta: U2FsdGVkX19Npmv1PSXKErrC1eZBy59NOeGICf7fJmVsiIYpl+U+YQSprNLEmwdN683s2WNK1Pi4y0pszYpyuvOOtO3fod2MkGkkdDf5tFbZwk78Q2O6t1jSo3ko7O4/HYFpP02So0UggBabc9ZuUDsy7KdIabNJPma2UmNwhDsQ/R6eTZLzcz+EOHetkn5eoITEupbKnK5kHKQSI8iDY0yOORAt4aM+HL70I0VHfqssPVaqFVdATJeDog6we2NwhxmkQyiLk0yv1wFsVT1oSkN1rhdin490xavaatz/gmEu9O1OqsThy0Ge9aAEQsouc/pQqju/sXW792Wa/fGAkAaKmjoJ5iWrVRF/o6Rc/bnEb2D0x/YJElMCcVyprYoZMy3w1KH1uDzi8VwmnfeVoS8C84nJpMm5sjRfen0nS2YEt8HN3geg3BOCrTnL3OYqicYB+JpofuB9wWS0cJdLItUwrO14mTSVb4gDthgEnEpFoPHIojv2m0hqYH6kA9kWPBp3BtEN2el6tdQwtkKQGYmhZAiOSZZBNWm7Y1sHYMOtTRW03aGGn3H8B5GpTS4AL5a+Vyj1eEpv1uvQnOpMAHbjwtuL8KeUiY4JpOUIc8QJw+a/OFRhhe3V0Ce5InYxjDood0KWk+aRiPYfbWsJ0H43rH0YCRZHxYTu63s520/yyAP2q/dNj8emMRHcrIaQ0BKayYCx2fEv8IiYALkmmuzTPVY2Nqh+Ufbdfx59ZV/zssIQF5J0gAv7nrumvYcf+R8df74tt3coDE7rpBJZsOlEEgrhjLQ50AoVi5XjQqA5En24BBTMVlPL52NGIEJGb56jLNWM7OxOj3fz9N15WOzYKYrHim5FlREw5hU9yuBQ5IL8rGlqLVaEsYKZ5mwnUX+RS0Ow5mGAoVAe4yfZ0d1AMj8PYjmlmlo8BPJ2mrzhoCq3YQVyYs93gpKZPVncsEnzmGM/ek8msnKLoKo S9gXoK3k 4oRFhNSI/W0HTONQfCt639vGev+4mMRbuDnR0+E2C+rXx8LhNCD6wDDFdfrUW32ux2FpJ3ewuM0a48HdbsTK/1PEL1uRTvGwHAPAcrQ/zox6x/dcidIxdcDwW+0x4/0t2fpSgdqkm6WIExf7r+YI23+K+gmfhEAm/ZPoX7DHYmduAwjv8rma2X/2XsIBeZn+v0oRUMaRx0RKOaw+LuupAh245ndOlIzByL5EQ0Sz/BRCIxdAR0yM32UQtNoqNaX3ugAI4r8tZQVkZQd/WjurJJ+6MumNXSxqCxH9rvG352UXQIMwR7mX8mDjrmN1ypwqBsw1w5YmDW8dwpsUoXHy5cGYqlXo6m59Mfem7vsUVwHs/F6o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000174, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: I decided to dig into decoder programming as as an addendum to the Driver section - where I said I *wouldn't* do this. It's important though, when discussing interleave. So alas, we should at least have some base understanding of what the heck decoders are actually doing. This is not a regutitation of the spec, you can think of it closer to a "Theory of Operation" or whatever. I will show discrete examples of how ACPI tables, system memory map, and decoders relate. ---------------------------------------- Definitions: Addresses and HDM Decoders. ---------------------------------------- An HDM Decoder can be thought shorthand as a "routing" mechanism, where the a Physical Address is used to determine one of: 1) Fabric routing (i.e. which pipe to send a request down) 2) Address translation (Host to Device Physical Address) In section 2, I referenced a simple device-to-decoder mapping: root --- decoder0.0 -- Root Port Decoder | | port1 --- decoder1.0 -- Host Bridge Decoder | | endpoint0 --- decoder2.0 -- Endpoint Decoder Barring any special innovations (cough) - endpoint decoders should be the only decoders that actually "Translation" addresses - at least for basic volatile memory devices. All other decoders (Root, Host Bridge, Switch, etc) should simply forward DMA requests with the original Physical Address intact to the correct downstream device. For extra confusion, there are now 3 "Physical Address" domains System Physical Address (SPA) The physical address of some location according to linux. This is the address you see in the system memory map. Host Physical Address (HPA) An abstract address used by decoders (I'll explain later) Device Physical Address (DPA) A device-local physical address (e.g. if a device has 1TB of memory, it's DPA range might be 0-0x10000000000) ---------------------------- DMA Routing (No Interleave). ---------------------------- Ok, we have some decoders and confusing physical address definitions, how does a DMA actually go from processor to DRAM via these decoders? Lets consider our simple fabric with 256MB of memory at SPA base 4GB. Lets assume this was all set up statically by BIOS. We'd have the following CEDT CFMWS (See Section 0 - ACPI) and decoder programming. ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved Memory Map: [mem 0x0000000100000000-0x0000000110000000] usable <- SPA Decoders root --- decoder0.0 -- range=[0x100000000, 0x110000000] | | port1 --- decoder1.0 -- range=[0x100000000, 0x110000000] | | endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000] ``` When the CPU accessed an address in this range, the memory controller will send the request down the CXL fabric. The following steps occur: 0) CPU accesses SPA(0x101234567) 1) root decoder identifies HPA(0x101234567) is valid and forwards to host bridge associated with that address (port 1) 2) host bridge decoder identifies HPA(0x101234567) is valid and forwards to endpoint associated with that address (endpoint0) 3) endpoint decoder identifies HPA(0x101234567) is valid and translates that address to DPA(0x01234567). 4) The endpoint device uses DPA(0x01234567) to fulfill the request. In this scenario, our endpoint has a DPA range of (0, 0x10000000), but technically DPA address space is device-defined and may be sparse. As you can see, the root and host bridge decoders simply "route" the access to the next appropriate hop, while the endpoint decoder actually does the translation work. What if instead, we had two 256MB endpoints on the same host bridge? ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000020000000 <- 512MB Interleave Members (2^n) : 00 <- Not interleaved Memory Map: [mem 0x0000000100000000-0x0000000120000000] usable <- SPA Decoders decoder0.0 range=[0x100000000, 0x120000000] | decoder1.0 range=[0x100000000, 0x120000000] / \ decoded2.0 decoder3.0 range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000] ``` We still only have a single root port and host bridge decoder that covers the entire 512MB range, but there are now 2 differently programmed endpoint decoders. This makes the routing a little more obvious. The root and host bridge decoders cover the entire SPA space (512MB), while the endpoint decoders only cover their own address space (256MB). The host bridge in this case is responsible for routing the request to the correct endpoint. What if we had 2 endpoints, each attached to their own host bridges? In this case We'd have 2 root ports and host bridge decoders. ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000110000000 <- Memory Region 1 Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000006 <- Host Bridge _UID Memory Map - this may or may not be collapsed depending on Linux arch [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address Decoders decoder0.0 decoder1.0 - roots [0x100000000, 0x110000000] [0x110000000, 0x120000000] | | decoder2.0 decoder3.0 - host bridges [0x100000000, 0x110000000] [0x110000000, 0x120000000] | | decoder4.0 decoder5.0 - endpoints [0x100000000, 0x110000000] [0x110000000, 0x120000000] ``` This scenario looks functionally same as the first - with two distinct, non-overlapping sets of decoders (any given SPA may only be services by one device). The platform memory controller is responsible for routing the address to the correct root decoder. In Section 4 (Interleave) we'll discuss a bit how the interleave is accomplished - as this depends whether you are interleaving across host bridges (aggregation) or within a host bridge (bifurcation). --------------------------------------------- Nuance: Host Physical Address... translation? --------------------------------------------- You might have noticed that all the addresses in the examples I showed are direct subsets of their parent decoder address ranges. The root is assigned a System Physical Address according to the system memory map, and all decoders under it are a subset of that range. You may have even noticed routing steps suddenly change from SPA to HPA 0) CPU accesses SPA(0x101234567) 1) root decoder identifies HPA(0x101234567) is valid and forwards to host bridge associated with that address (port 1) So what the heck is a "Host Physical Address"? Why isn't everything just described as a "System Physical Address"? CXL HDM decoders *definitionally* handle HPA to DPA translations. That's it, that's the definition of an HPA. On MOST systems, what you see in the memory map is an SPA, and SPA=HPA, so all the decoders will appear to be programmed with SPA. The platform MAY perform translation before a request is routed to decoder complex. I will cover an example of this in-depth in an interleave addendum. So the answer is that some ambiguity exists regarding whether platforms can/should do translation prior to HDM decoders even being utilized. So for the sake of making everything more complicated and confusing for very little value: 1) decoders definitionally do "HPA to DPA" translation 2) most of the time "SPA=HPA" 3) so decoders mostly do "SPA to DPA" translation If you're confused, that's ok, I was too - and still am. But Hopefully between this section and Section 4 (Interleave) we can be marginally less confused together. ----------------------------------------------- Nuance: Memory Holes and Hotplug Memory Blocks! ----------------------------------------------- Help, BIOS split my memory device across non-contiguous memory regions! ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000080000000 <- 128MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000110000000 <- Memory Region 1 Window size : 0000000080000000 <- 128MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID Memory Map [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA [mem 0x0000000108000000-0x000000010FFFFFFF] reserved [mem 0x0000000110000000-0x0000000118000000] usable <- SPA ``` Take a breath. Everything will be ok. You can have multiple decoders at each point in the decoder complex! (Most devices should implement for multiple decoders). ``` Decoders Root Port 0 / \ decoder0.0 decoder0.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] \ / Host Bridge 7 / \ decoder1.0 decoder1.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] \ / Endpoint 0 / \ decoder2.0 decoder2.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] ``` If your BIOS adds a memory hole, it better also use multiple decoders. Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks having size and alignment issues! If your BIOS adds a memory hole, it better also do it on Linux hotplug memory block alignment (2GB on x86) or you'll lose 1 hotplug memory block of capacity per CFMWS. Oi, talk about some rough edges, right? :[ --------------------------------------- Nuance: BIOS vs OS Programmed Decoders. --------------------------------------- The driver can (and does) program these decoders. However, it's entirely normal for BIOS/EFI to program decoders prior to OS init. Earlier in section 2 I said: Most associations built by the driver are done by validating decoders What I meant by this is the driver does one of two things with decoders: 1) Detects BIOS programmed decoders and sanity checks them. If an unexpected configuration is found, it bails out. This memory is not accessible if EFI_MEMORY_SP is set. 2) Provide an interface for user policy configuration of the decoders For the most part, the mechanism is the same. This carve-out is to tell you if something isn't working, you should check whether the BIOS/EFI or driver programmed the decoders. It will help debug the issue quicker. In my experience, it's USUALLY a bad ACPI table. This distinction will be more important in Section 4 (Interleave) when we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave. ~Gregory