linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com>
To: Gregory Price <gourry@gourry.net>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
Date: Fri, 7 Mar 2025 00:57:18 +0000	[thread overview]
Message-ID: <570b18f4-3790-4e57-8d80-a5301e5d8af2@fujitsu.com> (raw)
In-Reply-To: <Z8o2HfVd0P_tMhV2@gourry-fedora-PF4VCD3F>

Hey Gregory,

Thank you so much for your detailed introduction to the entire CXL
software ecosystem, which I have thoroughly read. You are truly excellent.


On 07/03/2025 07:56, Gregory Price wrote:
> I decided to dig into decoder programming as as an addendum to the
> Driver section - where I said I *wouldn't* do this. It's important
> though, when discussing interleave. So alas, we should at least have
> some base understanding of what the heck decoders are actually doing.
> 
> This is not a regutitation of the spec, you can think of it closer to
> a "Theory of Operation" or whatever.  I will show discrete examples of
> how ACPI tables, system memory map, and decoders relate.
> 
> ----------------------------------------
> Definitions: Addresses and HDM Decoders.
> ----------------------------------------
> 
> An HDM Decoder can be thought shorthand as a "routing" mechanism,
> where the a Physical Address is used to determine one of:
> 
>    1) Fabric routing (i.e. which pipe to send a request down)
>    2) Address translation (Host to Device Physical Address)
> 
> In section 2, I referenced a simple device-to-decoder mapping:
> 
>      root    ---  decoder0.0   -- Root Port Decoder
>       |               |
>     port1    ---  decoder1.0   -- Host Bridge Decoder
>       |               |
>    endpoint0 ---  decoder2.0   -- Endpoint Decoder



Here, I noticed something that differs slightly from my understanding:
"root --- decoder0.0 -- Root Port Decoder."

 From the perspective of the Linux Driver, decoder0.0 usually refers to
associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
the CXL Root Port (also known as R in the table) is not permitted to implement
the HDM decoder.

If I have misunderstood something, please let me know.


Thanks
Zhijian

> 
> Barring any special innovations (cough) - endpoint decoders should
> be the only decoders that actually "Translation" addresses - at least
> for basic volatile memory devices.
> 
> All other decoders (Root, Host Bridge, Switch, etc) should simply
> forward DMA requests with the original Physical Address intact to
> the correct downstream device.
> 
> For extra confusion, there are now 3 "Physical Address" domains
> 
> System Physical Address (SPA)
>    The physical address of some location according to linux.
>    This is the address you see in the system memory map.
> 
> Host Physical Address   (HPA)
>    An abstract address used by decoders (I'll explain later)
> 
> Device Physical Address (DPA)
>    A device-local physical address (e.g. if a device has 1TB of
>    memory, it's DPA range might be 0-0x10000000000)
> 
> 
> ----------------------------
> DMA Routing (No Interleave).
> ----------------------------
> Ok, we have some decoders and confusing physical address definitions,
> how does a DMA actually go from processor to DRAM via these decoders?
> 
> Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
> 
> Lets assume this was all set up statically by BIOS.  We'd have the
> following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
> 
> Memory Map:
>    [mem 0x0000000100000000-0x0000000110000000] usable    <- SPA
> 
> Decoders
>    root    ---  decoder0.0   -- range=[0x100000000, 0x110000000]
>     |               |
>   port1    ---  decoder1.0   -- range=[0x100000000, 0x110000000]
>     |               |
> endpoint0 ---  decoder2.0   -- range=[0x100000000, 0x110000000]
> ```
> 
> When the CPU accessed an address in this range, the memory controller
> will send the request down the CXL fabric. The following steps occur:
> 
>    0) CPU accesses SPA(0x101234567)
> 
>    1) root decoder identifies HPA(0x101234567) is valid and forwards
>       to host bridge associated with that address (port 1)
> 
>    2) host bridge decoder identifies HPA(0x101234567) is valid and
>       forwards to endpoint associated with that address (endpoint0)
> 
>    3) endpoint decoder identifies HPA(0x101234567) is valid and
>       translates that address to DPA(0x01234567).
> 
>    4) The endpoint device uses DPA(0x01234567) to fulfill the request.
> 
> In this scenario, our endpoint has a DPA range of (0, 0x10000000),
> but technically DPA address space is device-defined and may be sparse.
> 
> As you can see, the root and host bridge decoders simply "route" the
> access to the next appropriate hop, while the endpoint decoder actually
> does the translation work.
> 
> 
> What if instead, we had two 256MB endpoints on the same host bridge?
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region
>               Window size : 0000000020000000   <- 512MB
> Interleave Members (2^n) : 00                 <- Not interleaved
> 
> Memory Map:
>    [mem 0x0000000100000000-0x0000000120000000] usable  <- SPA
> 
> Decoders
>                              decoder0.0
>                    range=[0x100000000, 0x120000000]
>                                  |
>                              decoder1.0
>                    range=[0x100000000, 0x120000000]
>                    /                              \
>              decoded2.0                        decoder3.0
>    range=[0x100000000, 0x110000000]   range=[0x110000000, 0x120000000]
> ```
> 
> We still only have a single root port and host bridge decoder that
> covers the entire 512MB range, but there are now 2 differently
> programmed endpoint decoders.
> 
> This makes the routing a little more obvious.  The root and host bridge
> decoders cover the entire SPA space (512MB), while the endpoint decoders
> only cover their own address space (256MB).
> 
> The host bridge in this case is responsible for routing the request to
> the correct endpoint.
> 
> 
> What if we had 2 endpoints, each attached to their own host bridges?
> In this case We'd have 2 root ports and host bridge decoders.
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region 1
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000110000000   <- Memory Region 1
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000006           <- Host Bridge _UID
> 
> Memory Map - this may or may not be collapsed depending on Linux arch
>    [mem 0x0000000100000000-0x0000000110000000] usable  <- System Phys Address
>    [mem 0x0000000110000000-0x0000000120000000] usable  <- System Phys Address
> 
> Decoders
>              decoder0.0                     decoder1.0   - roots
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
>                  |                              |
>              decoder2.0                     decoder3.0   - host bridges
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
>                  |                              |
>              decoder4.0                     decoder5.0   - endpoints
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
> ```
> 
> This scenario looks functionally same as the first - with two distinct,
> non-overlapping sets of decoders (any given SPA may only be services by
> one device).  The platform memory controller is responsible for routing
> the address to the correct root decoder.
> 
> In Section 4 (Interleave) we'll discuss a bit how the interleave is
> accomplished - as this depends whether you are interleaving across
> host bridges (aggregation) or within a host bridge (bifurcation).
> 
> 
> 
> ---------------------------------------------
> Nuance: Host Physical Address... translation?
> ---------------------------------------------
> 
> You might have noticed that all the addresses in the examples I showed
> are direct subsets of their parent decoder address ranges.  The root is
> assigned a System Physical Address according to the system memory map,
> and all decoders under it are a subset of that range.
> 
> You may have even noticed routing steps suddenly change from SPA to HPA
> 
>    0) CPU accesses SPA(0x101234567)
> 
>    1) root decoder identifies HPA(0x101234567) is valid and forwards
>       to host bridge associated with that address (port 1)
> 
> So what the heck is a "Host Physical Address"?
> Why isn't everything just described as a "System Physical Address"?
> 
> CXL HDM decoders *definitionally* handle HPA to DPA translations.
> 
> That's it, that's the definition of an HPA.
> 
> On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
> so all the decoders will appear to be programmed with SPA.  The platform
> MAY perform translation before a request is routed to decoder complex.
> 
> I will cover an example of this in-depth in an interleave addendum.
> 
> So the answer is that some ambiguity exists regarding whether platforms
> can/should do translation prior to HDM decoders even being utilized.  So
> for the sake of making everything more complicated and confusing for very
> little value:
> 
> 1) decoders definitionally do "HPA to DPA" translation
> 2) most of the time "SPA=HPA"
> 3) so decoders mostly do "SPA to DPA" translation
> 
> If you're confused, that's ok, I was too - and still am.  But Hopefully
> between this section and Section 4 (Interleave) we can be marginally
> less confused together.
> 
> 
> -----------------------------------------------
> Nuance: Memory Holes and Hotplug Memory Blocks!
> -----------------------------------------------
> Help, BIOS split my memory device across non-contiguous memory regions!
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region 1
>               Window size : 0000000080000000   <- 128MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000110000000   <- Memory Region 1
>               Window size : 0000000080000000   <- 128MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
> Memory Map
>    [mem 0x0000000100000000-0x0000000107FFFFFF] usable  <- SPA
>    [mem 0x0000000108000000-0x000000010FFFFFFF] reserved
>    [mem 0x0000000110000000-0x0000000118000000] usable  <- SPA
> ```
> 
> Take a breath. Everything will be ok.
> 
> You can have multiple decoders at each point in the decoder complex!
> (Most devices should implement for multiple decoders).
> 
> ```
> Decoders
>                                Root Port 0
>                               /          \
>                      decoder0.0          decoder0.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
>                               \          /
>                              Host Bridge 7
>                               /          \
>                      decoder1.0          decoder1.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
>                               \          /
>                                Endpoint 0
>                               /          \
>                      decoder2.0          decoder2.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
> ```
> 
> If your BIOS adds a memory hole, it better also use multiple decoders.
> 
> Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
> having size and alignment issues!
> 
> If your BIOS adds a memory hole, it better also do it on Linux hotplug
> memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
> block of capacity per CFMWS.
> 
> Oi, talk about some rough edges, right? :[
> 
> ---------------------------------------
> Nuance: BIOS vs OS Programmed Decoders.
> ---------------------------------------
> The driver can (and does) program these decoders.  However, it's
> entirely normal for BIOS/EFI to program decoders prior to OS init.
> 
> Earlier in section 2 I said:
>    Most associations built by the driver are done by validating decoders
> 
> What I meant by this is the driver does one of two things with decoders:
> 
>     1) Detects BIOS programmed decoders and sanity checks them.
>        If an unexpected configuration is found, it bails out.
>        This memory is not accessible if EFI_MEMORY_SP is set.
> 
>     2) Provide an interface for user policy configuration of the decoders
> 
> For the most part, the mechanism is the same.  This carve-out is to tell
> you if something isn't working, you should check whether the BIOS/EFI or
> driver programmed the decoders. It will help debug the issue quicker.
> 
>          In my experience, it's USUALLY a bad ACPI table.
> 
> This distinction will be more important in Section 4 (Interleave) when
> we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
> 
> ~Gregory
> 

  reply	other threads:[~2025-03-07  0:57 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu) [this message]
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=570b18f4-3790-4e57-8d80-a5301e5d8af2@fujitsu.com \
    --to=lizhijian@fujitsu.com \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox