linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Gregory Price <gourry@gourry.net>
Cc: <lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
Date: Thu, 13 Mar 2025 17:20:04 +0000	[thread overview]
Message-ID: <20250313172004.00002236@huawei.com> (raw)
In-Reply-To: <Z8u4GTrr-UytqXCB@gourry-fedora-PF4VCD3F>

On Fri, 7 Mar 2025 22:23:05 -0500
Gregory Price <gourry@gourry.net> wrote:

> In the last section we discussed how the CEDT CFMWS and SRAT Memory
> Affinity structures are used by linux to "create" NUMA nodes (or at
> least mark them as possible). However, the examples I used suggested
> that there was a 1-to-1 relationship between CFMWS and devices or
> host bridges.
> 
> This is not true - in fact, CFMWS are a simply a carve out of System
> Physical Address space which may be used to map any number of endpoint
> devices behind the associated Host Bridge(s).
> 
> The limiting factor is what your platform vendor BIOS supports.
> 
> This section describes a handful of *possible* configurations, what NUMA
> structure they will create, and what flexibility this provides.
> 
> All of these CFMWS configurations are made up, and may or may not exist
> in real machines. They are a conceptual teching tool, not a roadmap.
> 
> (When discussing interleave in this section, please note that I am
>  intentionally omitting details about decoder programming, as this
>  will be covered later.)
> 
> 
> -------------------------------
> One 2GB Device, Multiple CFMWS.
> -------------------------------
> Lets imagine we have one 2GB device attached to a host bridge.
> 
> In this example, the device hosts 2GB of persistent memory - but we
> might want the flexibility to map capacity as volatile or persistent.

Fairly sure we block persistent in a volatile CFMWS in the kernel.
Any bios actually does this?

You might have a variable partition device but I thought in kernel at
least we decided that no one was building that crazy?

Maybe a QoS split is a better example to motivate one range, two places?

> 
> The platform vendor may decide that they want to reserve two entirely
> separate system physical address ranges to represent the capacity.
> 
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000100000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000200000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 000A               <- Bit(3) - Persistant
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
> NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
> ```
> 
> You might have a CEDT with two CFMWS as above, where the base addresses
> are `0x100000000` and `0x200000000` respectively, but whose window sizes
> cover the entire 2GB capacity of the device.  This affords the user 
> flexibility in where the memory is mapped depending on if it is mapped
> as volatile or persistent while keeping the two SPA ranges separate.
> 
> This is allowed because the endpoint decoders commit device physical
> address space *in order*, meaning no two regions of device physical
> address space can be mapped to more than one system physical address.
> 
> i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
> 
> (See Section 2a - decoder programming).
> 

> -------------------------------------------------------------
> Two Devices On One Host Bridge - With and Without Interleave.
> -------------------------------------------------------------
> What if we wanted some capacity on each endpoint hosted on its own NUMA
> node, and wanted to interleave a portion of each device capacity?

If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
checks on HPA kick in here and restrict flexibility a lot
(assuming I understand them correctly that is)

This is a good illustration of why we should at some point revisit
multiple NUMA nodes per CFMWS.  We have to burn SPA space just
to get nodes.  From a spec point of view all that is needed here
is a single CFMWS. 

> 
> We could produce the following CFMWS configuration.
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000100000000   <- Memory Region 1
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000200000000   <- Memory Region 2
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region 3
>              Window size : 0000000100000000   <- 4GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
> NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
> ```
> 
> In this configuration, we could still do what we did with the prior
> configuration (2 CFMWS), but we could also use the third root decoder
> to simplify decoder programming of interleave.
> 
> Since the third region has sufficient capacity (4GB) to cover both
> devices (2GB/each), we can actually associate the entire capacity of
> both devices in that region.
> 
> We'll discuss this decoder structure in-depth in Section 4.
> 



  reply	other threads:[~2025-03-13 17:20 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron [this message]
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250313172004.00002236@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox