linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: CXL Boot to Bash - Section 2: The Drivers
Date: Wed, 5 Feb 2025 11:06:08 -0500	[thread overview]
Message-ID: <Z6OMcLt3SrsZjgvw@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F>

(background reading as we build up complexity)

Driver Management - Decoders, HPA/SPA, DAX, and RAS.

The Drivers
===========
----------------------
The Story Up 'til Now.
----------------------

When we left the Platform arena, assuming we've configured with special
purpose memory, we are left with an entry in the memory map like so:

BIOS-e820:   [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
/proc/iomem: c050000000-fcefffffff : Soft Reserved

This resource (see mm/resource.c) is left unused until a driver comes
along to actually surface it to allocators (or some other interface).

In our case, the drivers involved (or at least the ones we'll reference)

drivers/base/     : device probing, memory (block) hotplug
drivers/acpi/     : device hotplug
drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
drivers/pci/      : PCI device probing
drivers/cxl/      : CXL device probing
drivers/dax/      : cxl device to memory resource association

We don't necessarily care about the specifics of each driver, we'll
focus on just the aspects that ultimately affect memory management.

-------------------------------
Step 4: Basic build complexity.
-------------------------------
To make a long story short:

CXL Build Configurations:
  CONFIG_CXL_ACPI
  CONFIG_CXL_BUS
  CONFIG_CXL_MEM
  CONFIG_CXL_PCI
  CONFIG_CXL_PORT
  CONFIG_CXL_REGION

DAX Build Configurations:
  CONFIG_DEV_DAX
  CONFIG_DEV_DAX_CXL
  CONFIG_DEV_DAX_KMEM

Without all of these enabled, your journey will end up cut short because
some piece of the probe process will stop progressing.

The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
being enabled. You end up with memory regions without dax devices.

[/sys/bus/cxl/devices]# ls
dax_region0  decoder0.0  decoder1.0  decoder2.0 .....
dax_region1  decoder0.1  decoder1.1  decoder3.0 .....

^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
surface as dax devices, which can then be converted to system ram.


---------------------------------------------------------------
Step 5: The CXL driver associating devices and iomem resources.
---------------------------------------------------------------

The CXL driver wires up the following devices:
   root        :  CXL root
   portN       :  An intermediate or endpoint destination for accesses
   memN        :  memory devices


Each device in the heirarchy may have one or more decoders
   decoderN.M  :  Address routing and translation devices


The driver will also create additional objects and associations
   regionN     :  device-to-iomem resource mapping
   dax_regionN :  region-to-dax device mapping


Most associations built by the driver are done by validating decoders
against each other at each point in the heirarchy.

  Root decoders describe memory regions and route DMA to ports.
  Intermediate decoders route DMA through CXL fabric.
  Endpoint decoders translate addresses (Host to device).


A Root port has 1 decoder per associated CFMW in the CEDT
   decoder0.0  ->  `c050000000-fcefffffff   : Soft Reserved`


A region (iomem resource mapping) can be created for these decoders
   [/sys/bus/cxl/devices/region0]# cat resource size target0
      0xc050000000   0x3ca0000000   decoder5.0


A dax_region surfaces these regions as a dax device
   [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
      0xc050000000


So in a simple environment with 1 device, we end up with a mapping
that looks something like this.

     root      ---   decoder0.0  --- region0 -- dax_region0 -- dax0
       |                |              |
     port1     ---   decoder1.0        |
       |                |              |
     endpoint0 ---   decoder3.0--------/


Much of the complexity in region creation stems from validating decoder
programming and associating regions with targets (endpoint decoders).

The take-away from this section is the existence of "decoders", of which
there may be an arbitrary number between the root and endpoint.

This will be relevant when we talk about RAS (Poison) and Interleave.


---------------------------------------------------------------
Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
---------------------------------------------------------------

The last step in surfacing memory to allocators is to convert a dax
device into memory blocks. On most default kernel builds, dax devices
are not automatically converted to SystemRAM.

Policy Choices
   userland policy:  daxctl
   default-online :  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
                     or
		     CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
		     or
		     memhp_default_state=*

To convert a dax device to SystemRAM utilizing daxctl:

  daxctl online-memory dax0.0 [--no-movable]

  By default the memory will online into ZONE_MOVABLE
  The --no-movable option will online the memory in ZONE_NORMAL


Alternatively, this can be done at Build or Boot time using
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE   (v6.13 or below)
  CONFIG_MHP_DEFAULT_ONLINE_TYPE_*       (v6.14 or above)
  memhp_default_state=*                  (boot param predating cxl)

I will save the discussion of ZONE selection to the next section,
which will cover more memory-hotplug specifics.

At this point, the memory blocks are exposed to the kernel mm allocators
and may be used as normal System RAM.


---------------------------------------------------------
Second bit of nuanced complexity: Memory Block Alignment.
---------------------------------------------------------
In section 1, we introduced CEDT / CFMW and how they map to iomem
resources.  In this section we discussed out we surface memory blocks
to the kernel allocators.

However, at no time did platform, arch code, and driver communicate
about the expected size of a memory block. In most cases, the size
of a memory block is defined by the architecture - unaware of CXL.

On x86, for example, the heuristic for memory block size is:
   1) user boot-arg value
   2) Maximize size (up to 2GB) if operating on bare metal
   3) Use smallest value that aligns with the end of memory

The problem is that [SOFT RESERVED] memory is not considered in the
alignment calculation - and not all [SOFT RESERVED] memory *should*
be considered for alignment.

In the case of our working example (real system, btw):

         Subtable Type : 01 [CXL Fixed Memory Window Structure]
   Window base address : 000000C050000000
           Window size : 0000003CA0000000

The base is 256MB aligned (the minimum for the CXL Spec), and the
window size is 512MB.  This results in a loss of almost a full memory
block worth of memory (~1280MB on the front, and ~512MB on the back).

This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).

[1] has been proposed to allow for drivers (specifically ACPI) to advise
the memory hotplug system on the suggested alignment, and for arch code
to choose how to utilize this advisement.

[1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/


--------------------------------------------------------------------
The Complexity story up til now (what's likely to show up in slides)
--------------------------------------------------------------------
Platform and BIOS:
  May configure all the devices prior to kernel hand-off.
  May or may not support reconfiguring / hotplug.
BIOS and EFI:
  EFI_MEMORY_SP              - used to defer management to drivers
Kernel Build and Boot:
  CONFIG_EFI_SOFT_RESERVE=n  - Will always result in CXL as SystemRAM
  nosoftreserve              - Will always result in CXL as SystemRAM
  kexec                      - SystemRAM configs carry over to target
Driver Build Options Required
  CONFIG_CXL_ACPI
  CONFIG_CXL_BUS
  CONFIG_CXL_MEM
  CONFIG_CXL_PCI
  CONFIG_CXL_PORT
  CONFIG_CXL_REGION
  CONFIG_DEV_DAX
  CONFIG_DEV_DAX_CXL
  CONFIG_DEV_DAX_KMEM
User Policy
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
  CONFIG_MHP_DEFAULT_ONLINE_TYPE       (>=v6.14)
  memhp_default_state                  (boot param)
  daxctl online-memory daxN.Y          (userland)
Nuances
  Early-boot resource re-use
  Memory Block Alignment

--------------------------------------------------------------------
Next Up:
   Memory (Block) Hotplug - Zones and Kernel Use of CXL
   RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
   Interleave - RAS and Region Management (Hotplug-ability)

~Gregory


  parent reply	other threads:[~2025-02-05 16:06 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` Gregory Price [this message]
2025-02-06  0:47   ` CXL Boot to Bash - Section 2: The Drivers Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6OMcLt3SrsZjgvw@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox