From: "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com>
To: Gregory Price <gourry@gourry.net>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
Date: Fri, 7 Mar 2025 00:57:18 +0000 [thread overview]
Message-ID: <570b18f4-3790-4e57-8d80-a5301e5d8af2@fujitsu.com> (raw)
In-Reply-To: <Z8o2HfVd0P_tMhV2@gourry-fedora-PF4VCD3F>
Hey Gregory,
Thank you so much for your detailed introduction to the entire CXL
software ecosystem, which I have thoroughly read. You are truly excellent.
On 07/03/2025 07:56, Gregory Price wrote:
> I decided to dig into decoder programming as as an addendum to the
> Driver section - where I said I *wouldn't* do this. It's important
> though, when discussing interleave. So alas, we should at least have
> some base understanding of what the heck decoders are actually doing.
>
> This is not a regutitation of the spec, you can think of it closer to
> a "Theory of Operation" or whatever. I will show discrete examples of
> how ACPI tables, system memory map, and decoders relate.
>
> ----------------------------------------
> Definitions: Addresses and HDM Decoders.
> ----------------------------------------
>
> An HDM Decoder can be thought shorthand as a "routing" mechanism,
> where the a Physical Address is used to determine one of:
>
> 1) Fabric routing (i.e. which pipe to send a request down)
> 2) Address translation (Host to Device Physical Address)
>
> In section 2, I referenced a simple device-to-decoder mapping:
>
> root --- decoder0.0 -- Root Port Decoder
> | |
> port1 --- decoder1.0 -- Host Bridge Decoder
> | |
> endpoint0 --- decoder2.0 -- Endpoint Decoder
Here, I noticed something that differs slightly from my understanding:
"root --- decoder0.0 -- Root Port Decoder."
From the perspective of the Linux Driver, decoder0.0 usually refers to
associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
the CXL Root Port (also known as R in the table) is not permitted to implement
the HDM decoder.
If I have misunderstood something, please let me know.
Thanks
Zhijian
>
> Barring any special innovations (cough) - endpoint decoders should
> be the only decoders that actually "Translation" addresses - at least
> for basic volatile memory devices.
>
> All other decoders (Root, Host Bridge, Switch, etc) should simply
> forward DMA requests with the original Physical Address intact to
> the correct downstream device.
>
> For extra confusion, there are now 3 "Physical Address" domains
>
> System Physical Address (SPA)
> The physical address of some location according to linux.
> This is the address you see in the system memory map.
>
> Host Physical Address (HPA)
> An abstract address used by decoders (I'll explain later)
>
> Device Physical Address (DPA)
> A device-local physical address (e.g. if a device has 1TB of
> memory, it's DPA range might be 0-0x10000000000)
>
>
> ----------------------------
> DMA Routing (No Interleave).
> ----------------------------
> Ok, we have some decoders and confusing physical address definitions,
> how does a DMA actually go from processor to DRAM via these decoders?
>
> Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
>
> Lets assume this was all set up statically by BIOS. We'd have the
> following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000110000000] usable <- SPA
>
> Decoders
> root --- decoder0.0 -- range=[0x100000000, 0x110000000]
> | |
> port1 --- decoder1.0 -- range=[0x100000000, 0x110000000]
> | |
> endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000]
> ```
>
> When the CPU accessed an address in this range, the memory controller
> will send the request down the CXL fabric. The following steps occur:
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> 2) host bridge decoder identifies HPA(0x101234567) is valid and
> forwards to endpoint associated with that address (endpoint0)
>
> 3) endpoint decoder identifies HPA(0x101234567) is valid and
> translates that address to DPA(0x01234567).
>
> 4) The endpoint device uses DPA(0x01234567) to fulfill the request.
>
> In this scenario, our endpoint has a DPA range of (0, 0x10000000),
> but technically DPA address space is device-defined and may be sparse.
>
> As you can see, the root and host bridge decoders simply "route" the
> access to the next appropriate hop, while the endpoint decoder actually
> does the translation work.
>
>
> What if instead, we had two 256MB endpoints on the same host bridge?
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000020000000 <- 512MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
>
> Decoders
> decoder0.0
> range=[0x100000000, 0x120000000]
> |
> decoder1.0
> range=[0x100000000, 0x120000000]
> / \
> decoded2.0 decoder3.0
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> ```
>
> We still only have a single root port and host bridge decoder that
> covers the entire 512MB range, but there are now 2 differently
> programmed endpoint decoders.
>
> This makes the routing a little more obvious. The root and host bridge
> decoders cover the entire SPA space (512MB), while the endpoint decoders
> only cover their own address space (256MB).
>
> The host bridge in this case is responsible for routing the request to
> the correct endpoint.
>
>
> What if we had 2 endpoints, each attached to their own host bridges?
> In this case We'd have 2 root ports and host bridge decoders.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000006 <- Host Bridge _UID
>
> Memory Map - this may or may not be collapsed depending on Linux arch
> [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address
> [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address
>
> Decoders
> decoder0.0 decoder1.0 - roots
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder2.0 decoder3.0 - host bridges
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder4.0 decoder5.0 - endpoints
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> ```
>
> This scenario looks functionally same as the first - with two distinct,
> non-overlapping sets of decoders (any given SPA may only be services by
> one device). The platform memory controller is responsible for routing
> the address to the correct root decoder.
>
> In Section 4 (Interleave) we'll discuss a bit how the interleave is
> accomplished - as this depends whether you are interleaving across
> host bridges (aggregation) or within a host bridge (bifurcation).
>
>
>
> ---------------------------------------------
> Nuance: Host Physical Address... translation?
> ---------------------------------------------
>
> You might have noticed that all the addresses in the examples I showed
> are direct subsets of their parent decoder address ranges. The root is
> assigned a System Physical Address according to the system memory map,
> and all decoders under it are a subset of that range.
>
> You may have even noticed routing steps suddenly change from SPA to HPA
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> So what the heck is a "Host Physical Address"?
> Why isn't everything just described as a "System Physical Address"?
>
> CXL HDM decoders *definitionally* handle HPA to DPA translations.
>
> That's it, that's the definition of an HPA.
>
> On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
> so all the decoders will appear to be programmed with SPA. The platform
> MAY perform translation before a request is routed to decoder complex.
>
> I will cover an example of this in-depth in an interleave addendum.
>
> So the answer is that some ambiguity exists regarding whether platforms
> can/should do translation prior to HDM decoders even being utilized. So
> for the sake of making everything more complicated and confusing for very
> little value:
>
> 1) decoders definitionally do "HPA to DPA" translation
> 2) most of the time "SPA=HPA"
> 3) so decoders mostly do "SPA to DPA" translation
>
> If you're confused, that's ok, I was too - and still am. But Hopefully
> between this section and Section 4 (Interleave) we can be marginally
> less confused together.
>
>
> -----------------------------------------------
> Nuance: Memory Holes and Hotplug Memory Blocks!
> -----------------------------------------------
> Help, BIOS split my memory device across non-contiguous memory regions!
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Memory Map
> [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA
> [mem 0x0000000108000000-0x000000010FFFFFFF] reserved
> [mem 0x0000000110000000-0x0000000118000000] usable <- SPA
> ```
>
> Take a breath. Everything will be ok.
>
> You can have multiple decoders at each point in the decoder complex!
> (Most devices should implement for multiple decoders).
>
> ```
> Decoders
> Root Port 0
> / \
> decoder0.0 decoder0.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Host Bridge 7
> / \
> decoder1.0 decoder1.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Endpoint 0
> / \
> decoder2.0 decoder2.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> ```
>
> If your BIOS adds a memory hole, it better also use multiple decoders.
>
> Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
> having size and alignment issues!
>
> If your BIOS adds a memory hole, it better also do it on Linux hotplug
> memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
> block of capacity per CFMWS.
>
> Oi, talk about some rough edges, right? :[
>
> ---------------------------------------
> Nuance: BIOS vs OS Programmed Decoders.
> ---------------------------------------
> The driver can (and does) program these decoders. However, it's
> entirely normal for BIOS/EFI to program decoders prior to OS init.
>
> Earlier in section 2 I said:
> Most associations built by the driver are done by validating decoders
>
> What I meant by this is the driver does one of two things with decoders:
>
> 1) Detects BIOS programmed decoders and sanity checks them.
> If an unexpected configuration is found, it bails out.
> This memory is not accessible if EFI_MEMORY_SP is set.
>
> 2) Provide an interface for user policy configuration of the decoders
>
> For the most part, the mechanism is the same. This carve-out is to tell
> you if something isn't working, you should check whether the BIOS/EFI or
> driver programmed the decoders. It will help debug the issue quicker.
>
> In my experience, it's USUALLY a bad ACPI table.
>
> This distinction will be more important in Section 4 (Interleave) when
> we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
>
> ~Gregory
>
next prev parent reply other threads:[~2025-03-07 0:57 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
2025-02-18 16:11 ` Gregory Price
2025-02-20 16:30 ` Jonathan Cameron
2025-02-20 16:52 ` Gregory Price
2025-03-04 0:32 ` Gregory Price
2025-03-13 16:12 ` Jonathan Cameron
2025-03-13 17:20 ` Gregory Price
2025-03-10 10:45 ` Yuquan Wang
2025-03-10 14:19 ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
2025-02-06 15:59 ` Gregory Price
2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu) [this message]
2025-03-07 15:07 ` Gregory Price
2025-03-11 2:48 ` Zhijian Li (Fujitsu)
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2025-04-02 14:18 ` Gregory Price
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
2025-04-08 4:14 ` Gregory Price
2025-04-08 5:37 ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:03 ` Gregory Price
2025-02-18 17:49 ` Yang Shi
2025-02-18 18:04 ` Gregory Price
2025-02-18 19:25 ` David Hildenbrand
2025-02-18 20:25 ` Gregory Price
2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
2025-02-19 8:53 ` David Hildenbrand
2025-02-19 16:14 ` Gregory Price
2025-02-20 17:50 ` Yang Shi
2025-02-20 18:43 ` Gregory Price
2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
2025-02-20 19:44 ` David Hildenbrand
2025-02-20 20:06 ` Gregory Price
2025-03-11 14:53 ` Zi Yan
2025-03-11 15:58 ` Gregory Price
2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
2025-03-05 23:34 ` Gregory Price
2025-03-05 23:41 ` Dave Jiang
2025-03-06 0:09 ` Gregory Price
2025-03-06 1:37 ` Yuquan Wang
2025-03-06 17:08 ` Gregory Price
2025-03-07 2:20 ` Yuquan Wang
2025-03-07 15:12 ` Gregory Price
2025-03-13 17:00 ` Jonathan Cameron
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20 ` Jonathan Cameron
2025-03-13 18:17 ` Gregory Price
2025-03-14 11:09 ` Jonathan Cameron
2025-03-14 13:46 ` Gregory Price
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
2025-03-14 11:14 ` Jonathan Cameron
2025-03-27 9:34 ` Yuquan Wang
2025-03-27 12:36 ` Gregory Price
2025-03-27 13:21 ` Dan Williams
2025-03-27 16:36 ` Gregory Price
2025-03-31 23:49 ` [Lsf-pc] " Dan Williams
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13 8:31 ` Yuquan Wang
2025-03-13 16:48 ` Gregory Price
2025-03-26 9:28 ` Yuquan Wang
2025-03-26 12:53 ` Gregory Price
2025-03-27 2:20 ` Yuquan Wang
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
2025-03-27 6:29 ` Yuquan Wang
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02 4:49 ` Gregory Price
[not found] ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14 ` Adam Manzanares
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=570b18f4-3790-4e57-8d80-a5301e5d8af2@fujitsu.com \
--to=lizhijian@fujitsu.com \
--cc=gourry@gourry.net \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox