linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Gregory Price <gourry@gourry.net>, <lsf-pc@lists.linux-foundation.org>
Cc: <linux-mm@kvack.org>, <linux-cxl@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: CXL Boot to Bash - Section 2: The Drivers
Date: Wed, 5 Feb 2025 16:47:17 -0800	[thread overview]
Message-ID: <67a4069572eab_2d2c294d4@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <Z6OMcLt3SrsZjgvw@gourry-fedora-PF4VCD3F>

Gregory Price wrote:
> (background reading as we build up complexity)

Thanks for this taxonomy!

> 
> Driver Management - Decoders, HPA/SPA, DAX, and RAS.
> 
> The Drivers
> ===========
> ----------------------
> The Story Up 'til Now.
> ----------------------
> 
> When we left the Platform arena, assuming we've configured with special
> purpose memory, we are left with an entry in the memory map like so:
> 
> BIOS-e820:   [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
> /proc/iomem: c050000000-fcefffffff : Soft Reserved
> 
> This resource (see mm/resource.c) is left unused until a driver comes
> along to actually surface it to allocators (or some other interface).
> 
> In our case, the drivers involved (or at least the ones we'll reference)
> 
> drivers/base/     : device probing, memory (block) hotplug
> drivers/acpi/     : device hotplug
> drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
> drivers/pci/      : PCI device probing
> drivers/cxl/      : CXL device probing
> drivers/dax/      : cxl device to memory resource association
> 
> We don't necessarily care about the specifics of each driver, we'll
> focus on just the aspects that ultimately affect memory management.
> 
> -------------------------------
> Step 4: Basic build complexity.
> -------------------------------
> To make a long story short:
> 
> CXL Build Configurations:
>   CONFIG_CXL_ACPI
>   CONFIG_CXL_BUS
>   CONFIG_CXL_MEM
>   CONFIG_CXL_PCI
>   CONFIG_CXL_PORT
>   CONFIG_CXL_REGION
> 
> DAX Build Configurations:
>   CONFIG_DEV_DAX
>   CONFIG_DEV_DAX_CXL
>   CONFIG_DEV_DAX_KMEM
> 
> Without all of these enabled, your journey will end up cut short because
> some piece of the probe process will stop progressing.
> 
> The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
> being enabled. You end up with memory regions without dax devices.
> 
> [/sys/bus/cxl/devices]# ls
> dax_region0  decoder0.0  decoder1.0  decoder2.0 .....
> dax_region1  decoder0.1  decoder1.1  decoder3.0 .....
> 
> ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> surface as dax devices, which can then be converted to system ram.

At least for this problem the plan is to fall back to
CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.

There is also the panic button of efi=nosoftreserve which is the flag of
surrender if the kernel fails to parse the CXL configuration.

I am otherwise open to suggestions about a better model for how to
handle a type of memory capacity that elicits diverging opinions on
whether it should be treated as System RAM, dedicated application
memory, or some kind of cold-memory swap target.

[1]: http://lore.kernel.org/cover.1737046620.git.nathan.fontenot@amd.com

> ---------------------------------------------------------------
> Step 5: The CXL driver associating devices and iomem resources.
> ---------------------------------------------------------------
> 
> The CXL driver wires up the following devices:
>    root        :  CXL root
>    portN       :  An intermediate or endpoint destination for accesses
>    memN        :  memory devices
> 
> 
> Each device in the heirarchy may have one or more decoders
>    decoderN.M  :  Address routing and translation devices
> 
> 
> The driver will also create additional objects and associations
>    regionN     :  device-to-iomem resource mapping
>    dax_regionN :  region-to-dax device mapping
> 
> 
> Most associations built by the driver are done by validating decoders
> against each other at each point in the heirarchy.
> 
>   Root decoders describe memory regions and route DMA to ports.
>   Intermediate decoders route DMA through CXL fabric.
>   Endpoint decoders translate addresses (Host to device).
> 
> 
> A Root port has 1 decoder per associated CFMW in the CEDT
>    decoder0.0  ->  `c050000000-fcefffffff   : Soft Reserved`
> 
> 
> A region (iomem resource mapping) can be created for these decoders
>    [/sys/bus/cxl/devices/region0]# cat resource size target0
>       0xc050000000   0x3ca0000000   decoder5.0
> 
> 
> A dax_region surfaces these regions as a dax device
>    [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
>       0xc050000000
> 
> 
> So in a simple environment with 1 device, we end up with a mapping
> that looks something like this.
> 
>      root      ---   decoder0.0  --- region0 -- dax_region0 -- dax0
>        |                |              |
>      port1     ---   decoder1.0        |
>        |                |              |
>      endpoint0 ---   decoder3.0--------/
> 
> 
> Much of the complexity in region creation stems from validating decoder
> programming and associating regions with targets (endpoint decoders).
> 
> The take-away from this section is the existence of "decoders", of which
> there may be an arbitrary number between the root and endpoint.
> 
> This will be relevant when we talk about RAS (Poison) and Interleave.

Good summary. I often look at this pile of objects and wonder "why so
complex", but then I look at the heroics of drivers/edac/. Compared to
that wide range of implementation specific quirks of various memory
controllers, the CXL object hierarchy does not look that bad.

> ---------------------------------------------------------------
> Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> ---------------------------------------------------------------
> 
> The last step in surfacing memory to allocators is to convert a dax
> device into memory blocks. On most default kernel builds, dax devices
> are not automatically converted to SystemRAM.

I thought most distributions are shipping with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.

> Policy Choices
>    userland policy:  daxctl
>    default-online :  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
>                      or
> 		     CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> 		     or
> 		     memhp_default_state=*
> 
> To convert a dax device to SystemRAM utilizing daxctl:
> 
>   daxctl online-memory dax0.0 [--no-movable]

On RHEL at least it finds that udev already took care of it.

> 
>   By default the memory will online into ZONE_MOVABLE
>   The --no-movable option will online the memory in ZONE_NORMAL
> 
> 
> Alternatively, this can be done at Build or Boot time using
>   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE   (v6.13 or below)
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_*       (v6.14 or above)
>   memhp_default_state=*                  (boot param predating cxl)

Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.

> 
> I will save the discussion of ZONE selection to the next section,
> which will cover more memory-hotplug specifics.
> 
> At this point, the memory blocks are exposed to the kernel mm allocators
> and may be used as normal System RAM.
> 
> 
> ---------------------------------------------------------
> Second bit of nuanced complexity: Memory Block Alignment.
> ---------------------------------------------------------
> In section 1, we introduced CEDT / CFMW and how they map to iomem
> resources.  In this section we discussed out we surface memory blocks
> to the kernel allocators.
> 
> However, at no time did platform, arch code, and driver communicate
> about the expected size of a memory block. In most cases, the size
> of a memory block is defined by the architecture - unaware of CXL.
> 
> On x86, for example, the heuristic for memory block size is:
>    1) user boot-arg value
>    2) Maximize size (up to 2GB) if operating on bare metal
>    3) Use smallest value that aligns with the end of memory
> 
> The problem is that [SOFT RESERVED] memory is not considered in the
> alignment calculation - and not all [SOFT RESERVED] memory *should*
> be considered for alignment.
> 
> In the case of our working example (real system, btw):
> 
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>    Window base address : 000000C050000000
>            Window size : 0000003CA0000000
> 
> The base is 256MB aligned (the minimum for the CXL Spec), and the
> window size is 512MB.  This results in a loss of almost a full memory
> block worth of memory (~1280MB on the front, and ~512MB on the back).
> 
> This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).

This feels like an example, of "hey platform vendors, I understand
that spec grants you the freedom to misalign, please refrain from taking
advantage of that freedom".

> 
> [1] has been proposed to allow for drivers (specifically ACPI) to advise
> the memory hotplug system on the suggested alignment, and for arch code
> to choose how to utilize this advisement.
> 
> [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
> 
> 
> --------------------------------------------------------------------
> The Complexity story up til now (what's likely to show up in slides)
> --------------------------------------------------------------------
> Platform and BIOS:
>   May configure all the devices prior to kernel hand-off.
>   May or may not support reconfiguring / hotplug.
> BIOS and EFI:
>   EFI_MEMORY_SP              - used to defer management to drivers
> Kernel Build and Boot:
>   CONFIG_EFI_SOFT_RESERVE=n  - Will always result in CXL as SystemRAM
>   nosoftreserve              - Will always result in CXL as SystemRAM
>   kexec                      - SystemRAM configs carry over to target
> Driver Build Options Required
>   CONFIG_CXL_ACPI
>   CONFIG_CXL_BUS
>   CONFIG_CXL_MEM
>   CONFIG_CXL_PCI
>   CONFIG_CXL_PORT
>   CONFIG_CXL_REGION
>   CONFIG_DEV_DAX
>   CONFIG_DEV_DAX_CXL
>   CONFIG_DEV_DAX_KMEM
> User Policy
>   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE       (>=v6.14)
>   memhp_default_state                  (boot param)
>   daxctl online-memory daxN.Y          (userland)

    memory hotlpug udev rule		 (userland)

> Nuances
>   Early-boot resource re-use
>   Memory Block Alignment
> 
> --------------------------------------------------------------------
> Next Up:
>    Memory (Block) Hotplug - Zones and Kernel Use of CXL
>    RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
>    Interleave - RAS and Region Management (Hotplug-ability)

Really appreciate you organizing all of this information.


  reply	other threads:[~2025-02-06  0:47 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams [this message]
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67a4069572eab_2d2c294d4@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox