Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yuquan Wang <wangyuquan1236@phytium.com.cn>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
Date: Thu, 13 Mar 2025 16:31:31 +0800	[thread overview]
Message-ID: <Z9KX4/zF6/yFdWLQ@phytium.com.cn> (raw)
In-Reply-To: <Z9DQnjPWbkjqrI9n@gourry-fedora-PF4VCD3F>

On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote:
> 
> -----------------------------
> Intra-Host-Bridge Interleave.
> -----------------------------
> Now lets consider a system where we've placed 2 CXL devices on the same
> Host Bridge.  Maybe each CXL device is only capable of x8 PCIE, and we
> want to make full use of a single x16 link.
> 
> This setup only requires the BIOS to create a CEDT CFMWS which reports
> the entire capacity of all devices under the host bridge, but does not
> need to set up any interleaving.
> 
> In the follow case, the BIOS has configured as single 4GB memory region
> which only targets the single host bridge, but maps the entire memory
> capacity of both devices (2GB).
> 
> ```
> CFMWS:
>           Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB

I think is "Window size : 0000000100000000   <- 4GB" here.

> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> ```
> 
> Assuming no other CEDT or SRAT entries exist, this will result in linux
> creating the following NUMA topology, where all CXL memory is in Node 1.
> 
> ```
> NUMA Structure:
>         ---------     --------   |    ----------
>         | cpu0  |-----| DRAM |---|----| Node 0 |
>         ---------     --------   |    ----------
>             |                    |
>          -------                 |    ----------
>          | HB0 |-----------------|----| Node 1 |
>          -------                 |    ----------
>         /       \                |
>    CXL Dev     CXL Dev           |
> ```
> 
> In this scenario, we program the decoders like so:
> ```
> Decoders
>                            CXL Root
>                               |
>                           decoder0.0
>                          IW:1  IG:256
>                   [0x300000000, 0x3FFFFFFFF]
>                               |
>                           Host Bridge
>                               |
>                           decoder1.0
>                          IW:2  IG:256
>                    [0x300000000, 0x3FFFFFFFF]
>                              /   \
>                    Endpoint 0     Endpoint 1
>                        |              |
>                    decoder2.0     decoder3.0
>                  IW:2  IG:256     IW:2  IG:256
>     [0x300000000, 0x3FFFFFFFF]    [0x300000000, 0x3FFFFFFFF]
> ```
> 
> The root decoder in this scenario does not participate in interleave,
> it simply forwards all accesses in this range to the host bridge.
> 
> The host bridge then applies the interleave across its connected devices
> and the decodes apply translation accordingly.
> 
> -----------------------
> Combination Interleave.
> -----------------------
> Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
> and we want to interleave the entire set.  This requires us to make use
> of both inter and intra host bridge interleave.
> 
> First, we can interleave this with the a single CEDT entry, the same as
> the first inter-host-bridge CEDT (now assuming 1GB per device).
> 
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000100000000   <- 4GB
> Interleave Members (2^n) : 01                 <- 2-way interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
>              Next Target : 00000006           <- Host Bridge _UID
> ```
> 
> This gives us a NUMA structure as follows:
> ```
> NUMA Structure:
> 
>          ----------     --------    |   ----------
>          |  cpu0  |-----| DRAM |----|---| Node 0 |
>          ----------     --------    |   ----------
>         /         \                 |
>     -------     -------             |   ----------
>     | HB0 |-----| HB1 |-------------|---| Node 1 |
>     -------     -------             |   ----------
>       / \         / \               |
>   CXL0   CXL1  CXL2  CXL3           |
> ```
> 
> And the respective decoder programming looks as follows
> ```
> Decoders:
>                              CXL  Root
>                                  |
>                              decoder0.0
>                             IW:2   IG:256
>                       [0x300000000, 0x3FFFFFFFF]
>                              /         \
>                 Host Bridge 7           Host Bridge 6
>                 /                                    \
>            decoder1.0                             decoder2.0
>           IW:2   IG:512                          IW:2   IG:512
>   [0x300000000, 0x3FFFFFFFFF]             [0x300000000, 0x3FFFFFFFF]
>             /    \                                  /    \
>    endpoint0      endpoint1                endpoint2      endpoint3
>       |               |                       |               |
>   decoder3.0      decoder4.0              decoder5.0      decoder6.0
>           IW:4  IG:256                            IW:4  IG:256
>   [0x300000000, 0x3FFFFFFFF]              [0x300000000, 0x3FFFFFFFF]
> ```
> 
> Notice at both the root and the host bridge, the Interleave Ways is 2.
> There are two targets at each level.  The host bridge has a granularity
> of 512 to capture its parent's ways and granularity (`2*256`).
> 
> Each decoder is programmed with the total number of targets (4) and the
> overall granularity (256B).

Is there any relationship between endpoints'decoder setup(IW&IG) and
others decoder?

> 
> We might use this setup if each CXL device is capable of x8 PCIE, and
> we have 2 Host Bridges capable of full x16 - utilizing all bandwidth
> available.
> 
> ---------------------------------------------
> Nuance: Hardware Interleave and Memory Holes.
> ---------------------------------------------
> You may encounter a system which cannot place the entire memory capacity
> into a single contiguous System Physical Address range.  That's ok,
> because we can just use multiple decoders to capture this nuance.
> 
> Most CXL devices allow for multiple decoders.
> 
> This may require an SRAT entry to keep these regions on the same node.
> (Obviously the relies on your platform vendor's BIOS)
> 
> ```
> CFMWS:
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge 7
> 
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000400000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge 7
> 
> SRAT:
>         Subtable Type : 01 [Memory Affinity]
>                Length : 28
>      Proximity Domain : 00000001          <- NUMA Node 1
>             Reserved1 : 0000
>          Base Address : 0000000300000000  <- Physical Memory Region
>        Address Length : 0000000080000000  <- first 2GB
> 
>         Subtable Type : 01 [Memory Affinity]
>                Length : 28
>      Proximity Domain : 00000001          <- NUMA Node 1
>             Reserved1 : 0000
>          Base Address : 0000000400000000  <- Physical Memory Region
>        Address Length : 0000000080000000  <- second 2GB
> ```
> 
> The SRAT entries allow us to keep the regions attached to the same node.
> ```
> 
> NUMA Structure:
>         ---------     --------   |    ----------
>         | cpu0  |-----| DRAM |---|----| Node 0 |
>         ---------     --------   |    ----------
>             |                    |
>          -------                 |    ----------
>          | HB0 |-----------------|----| Node 1 |
>          -------                 |    ----------
>         /       \                |
>    CXL Dev     CXL Dev           |
> ```
>
Hi, Gregory

Seeing this, I have an assumption to discuss.

If the same system uses tables like below:

CFMWS:
         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000400000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

SRAT:
        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000000          <- NUMA Node 0
            Reserved1 : 0000
         Base Address : 0000000300000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- first 2GB

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1
            Reserved1 : 0000
         Base Address : 0000000400000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- second 2GB


The first 2GB cxl memory region would locate at node0 with DRAM.

NUMA Structure:

        ---------     --------   |            ----------
        | cpu0  |-----| DRAM |---|------------| Node 0 |
        ---------     --------   |   /        ----------
            |                    |  /first 2GB
         -------                 | /          ----------
         | HB0 |-----------------|------------| Node 1 |
         -------                 |second 2GB  ----------
        /       \                |
   CXL Dev     CXL Dev           |
```

Is above configuration and structure valid?

Yuquan
> And the decoder programming would look like so
> ```
> Decoders:
>                                CXL  Root
>                              /           \
>                     decoder0.0           decoder0.1
>                   IW:1  IG:256           IW:1  IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
>                              \           /
>                               Host Bridge
>                              /           \
>                     decoder1.0           decoder1.1
>                   IW:2  IG:256           IW:2  IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
>               /   \                                /   \
>     Endpoint 0     Endpoint 1            Endpoint 0     Endpoint 1
>         |              |                     |              |
>     decoder2.0     decoder3.0            decoder2.1     decoder3.1
>             IW:2 IG:256                          IW:2 IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
> ```
> 
> Linux manages decoders in relation to the associated component, so
> decoders are N.M where N is the component and M is the decoder number.
> 
> If you look, you'll see each side of this tree looks individually
> equivalent to the intra-host-bridge interleave example, just with one
> half of the total memory each (matching the CFMWS ranges).
> 
> Each of the root decoders still has an interleave width of 1 because
> they both only target one host bridge (despite it being the same one).
> 
> 
> --------------------------------
> Software Interleave (Mempolicy).
> --------------------------------
> Linux provides a software mechanism to allow tasks to to interleave its
> memory across NUMA nodes - which may have different performance
> characteristics.  This component is called `mempolicy`, and is primarily
> operated on with the `set_mempolicy()` and `mbind()` syscalls.
> 
> These syscalls take a nodemask (bitmask representing NUMA node ids) as
> an argument to describe the intended allocation policy of the task.
> 
> The following policies are presently supported (as of v6.13)
> ```
> enum {
>         MPOL_DEFAULT,
>         MPOL_PREFERRED,
>         MPOL_BIND,
>         MPOL_INTERLEAVE,
>         MPOL_LOCAL,
>         MPOL_PREFERRED_MANY,
>         MPOL_WEIGHTED_INTERLEAVE,
> };
> ```
> 
> Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`.
> 
> To quote the man page:
> ```
> MPOL_INTERLEAVE
>     This  mode  interleaves  page  allocations  across the nodes specified
>     in nodemask in numeric node ID order.  This optimizes for bandwidth
>     instead of latency by spreading out pages and memory accesses to those
>     pages across multiple nodes.  However, accesses to a single page will
>     still be limited to the memory bandwidth of a single node.
> 
> MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
>     This mode interleaves page allocations across the nodes specified in
>     nodemask according to the weights in
>         /sys/kernel/mm/mempolicy/weighted_interleave
>     For example, if bits 0, 2, and 5 are set in nodemask and the contents of
>         /sys/kernel/mm/mempolicy/weighted_interleave/node0
>         /sys/ ... /node2
>         /sys/ ... /node5
>     are 4, 7, and 9, respectively, then pages in this region will be
>     allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
> ```
> 
> To put it simply, MPOL_INTERLEAVE will interleave allocations at a page
> granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while
> MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably
> map to the bandwidth of each respective node.
> 
> Or more concretely:
> 
> MPOL_INTERLEAVE
>     1:1 Interleave between two nodes.
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     ... and so on ...
> 
> MPOL_WEIGHTED_INTERLEAVE
>     2:1 Interleave between two nodes.
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     ... and so on ...
> 
> This is the preferred mechanism for *heterogeneous interleave* on Linux,
> as it allows for predictable performance based on the explicit (and
> visible) placement of memory.
> 
> It also allows for memory ZONE restrictions to enable better performance
> predictability (e.g. keeping kernel locks out of CXL while allowing
> workloads to leverage it for expansion or bandwidth).
> 
> ======================
> Mempolicy Limitations.
> ======================
> Mempolicy is a *per-task* allocation policy that is inherited by
> child-tasks on clone/fork. It can only be changed by the task itself,
> though cgroups may affect the effective nodemask via cpusets.
> 
> This means once a task has been launched, and external actor cannot
> change the policy of a running task - except possibly by migrating that
> task between cgroups or changing the cpusets.mems value of the cgroup
> the task lives in.
> 
> Additionally, If capacity on a given node is not available, allocations
> will fall back to another node in the node mask - which may cause
> interleave to become unbalanced.
> 
> ================================
> Hardware Interleave Limitations.
> ================================
> Granularities:
>    granularities are limited on hardware
>    (typically 256B up to 16KB by power of 2)
> 
> Ways:
>    Ways are limited by the CXL configuration to:
>    2,4,8,16,3,6,12
> 
> Balance:
>    Linux does not allow imbalanced interleave configurations
>    (e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another)
> 
> Depending on your platform vendor and type of interleave, you may not
> be able to deconstruct an interleave region at all (decoders may be
> locked).  In this case, you may not have the flexiblity to convert
> operation from interleaved to non-interleave via the driver interface.
> 
> In the scenario where your interleave configuration is entirely driver
> managed, you cannot adjust the size of an interleave set without
> deconstructing the entire set.
> 
> ------------------------------------------------------------------------
> 
> Next we'll discuss how memory allocations occur in a CXL-enabled system,
> which may be affected by things like Reclaim and Tiering systems.
> 
> ~Gregory

next prev parent reply	other threads:[~2025-03-13  8:31 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang [this message]
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9KX4/zF6/yFdWLQ@phytium.com.cn \
    --to=wangyuquan1236@phytium.com.cn \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox