Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yuquan Wang <wangyuquan1236@phytium.com.cn>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
Date: Thu, 6 Mar 2025 09:37:49 +0800	[thread overview]
Message-ID: <Z8j8bZ5TS+gDV8+M@phytium.com.cn> (raw)
In-Reply-To: <Z8jORKIWC3ZwtzI4@gourry-fedora-PF4VCD3F>

On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote:
> ====
> SRAT
> ====
> The System/Static Resource Affinity Table describes resource (CPU,
> Memory) affinity to "Proximity Domains". This table is technically
> optional, but for performance information (see "HMAT") to be enumerated
> by linux it must be present.
> 
> 
> # Proximity Domain
> A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
> 1-to-1 mapping is not guaranteed.  There are scenarios where "Proximity
> Domain 4" may map to "NUMA Node 3", for example.  (See "NUMA Node Creation")
> 
> # Memory Affinity
> Generally speaking, if a host does any amount of CXL fabric (decoder)
> programming in BIOS - an SRAT entry for that memory needs to be present.
> 
> ```
>         Subtable Type : 01 [Memory Affinity]
>                Length : 28
>      Proximity Domain : 00000001          <- NUMA Node 1
>             Reserved1 : 0000
>          Base Address : 000000C050000000  <- Physical Memory Region
>        Address Length : 0000003CA0000000
>             Reserved2 : 00000000
> Flags (decoded below) : 0000000B
>              Enabled : 1
>        Hot Pluggable : 1
>         Non-Volatile : 0
> ```
> 
> # Generic Initiator / Port
> In the scenario where CXL devices are not present or configured by
> BIOS, we may still want to generate proximity domain configurations
> for those devices.   The Generic Initiator interfaces are intended to
> fill this gap, so that performance information can still be utilized
> when the devices become available at runtime.
> 
> I won't cover the details here, for now, but I will link to the
> proosal from Dan Williams and Jonathan Cameron if you would like
> more information.
> https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
> 
> ====
> HMAT
> ====
> The Heterogeneous Memory Attributes Table contains information such as
> cache attributes and bandwidth and latency details for memory proximity
> domains.  For the purpose of this document, we will only discuss the
> SSLIB entry.
> 
> # SLLBI
> The System Locality Latency and Bandwidth Information records latency
> and bandwidth information for proximity domains. This table is used by
> Linux to configure interleave weights and memory tiers.
> 
> ```
> Heavily truncated for brevity
>               Structure Type : 0001 [SLLBI]
>                    Data Type : 00         <- Latency
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
>                        Entry : 0080       <- DRAM LTC
>                        Entry : 0100       <- CXL LTC
> 
>               Structure Type : 0001 [SLLBI]
>                    Data Type : 03         <- Bandwidth
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
>                        Entry : 1200       <- DRAM BW
>                        Entry : 0200       <- CXL BW
> ```
> 
> 
> ---------------------------------
> Part 00: Linux Resource Creation.
> ---------------------------------
> 
> ==================
> NUMA node creation
> ===================
> NUMA nodes are *NOT* hot-pluggable.  All *POSSIBLE* NUMA nodes are
> identified at `__init` time, more specifically during `mm_init`.
> 
> What this means is that the CEDT and SRAT must contain sufficient
> `proximity domain` information for linux to identify how many NUMA
> nodes are required (and what memory regions to associate with them).
> 
Hi, Gregory.

Recently, I found a corner case in CXL numa node creation.

Condition:
1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS
2）Enable CONFIG_ACPI_NUMA

Results:
1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws()
2）If dynamically create cxl ram region, the cxl memory would be assigned
to node0 rather than a fake new node.

Confusions:
1) Does CXL memory usage require a numa system with SRAT? As you
mentioned in SRAT section: 

"This table is technically optional, but for performance information
to be enumerated by linux it must be present."

Hence, as I understand it, it seems a bug in kernel.

2) If it is a bug, could  we forbid this situation by adding fake_pxm
check and returning error in acpi_numa_init()?

3）If not,  maybe we can add some kernel logic to allow create these fake
nodes on a system without SRAT?

Yuquan
> The relevant code exists in: linux/drivers/acpi/numa/srat.c
> ```
> static int __init
> acpi_parse_memory_affinity(union acpi_subtable_headers *header,
>                            const unsigned long table_end)
> {
> ... heavily truncated for brevity
>         pxm = ma->proximity_domain;
>         node = acpi_map_pxm_to_node(pxm);
>         if (numa_add_memblk(node, start, end) < 0)
>             ....
>         node_set(node, numa_nodes_parsed);    <--- mark node N_POSSIBLE
> }
> 
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
>                                    void *arg, const unsigned long table_end)
> {
> ... heavily truncated for brevity
>         /*
>          * The SRAT may have already described NUMA details for all,
>          * or a portion of, this CFMWS HPA range. Extend the memblks
>          * found for any portion of the window to cover the entire
>          * window.
>          */
>         if (!numa_fill_memblks(start, end))
>                 return 0;
> 
>         /* No SRAT description. Create a new node. */
>         node = acpi_map_pxm_to_node(*fake_pxm);
>         if (numa_add_memblk(node, start, end) < 0)
> 	        ....
>         node_set(node, numa_nodes_parsed);    <--- mark node N_POSSIBLE
> }
> 
> int __init acpi_numa_init(void)
> {
> ...
>     if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
>         cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>                                     acpi_parse_memory_affinity, 0);
>     }
>     /* fake_pxm is the next unused PXM value after SRAT parsing */
>     acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
>                           &fake_pxm);
> 
> ```
> 
> Basically, the heuristic is as follows:
> 1) Add one NUMA node per Proximity Domain described in SRAT
> 2) If the SRAT describes all memory described by all CFMWS
>    - do not create nodes for CFMWS
> 3) If SRAT does not describe all memory described by CFMWS
>    - create a node for that CFMWS
> 
> Generally speaking, you will see one NUMA node per Host bridge, unless
> inter-host-bridge interleave is in use (see Section 4 - Interleave).
> 
> 
> ============
> Memory Tiers
> ============
> The `abstract distance` of a node dictates what tier it lands in (and
> therefore, what tiers are created).  This is calculated based on the
> following heuristic, using HMAT data:
> 
> ```
> int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
> {
>  ...
>     /*
>      * The abstract distance of a memory node is in direct proportion to
>      * its memory latency (read + write) and inversely proportional to its
>      * memory bandwidth (read + write).  The abstract distance, memory
>      * latency, and memory bandwidth of the default DRAM nodes are used as
>      * the base.
>      */
>     *adist = MEMTIER_ADISTANCE_DRAM *
>         (perf->read_latency + perf->write_latency) /
>         (default_dram_perf.read_latency + default_dram_perf.write_latency) *
>         (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
>         (perf->read_bandwidth + perf->write_bandwidth);
>     return 0;
> }
> ```
> 
> Debugging hint: If you have DRAM and CXL memory in separate numa nodes
>                 but only find 1 memory tier, validate the HMAT!
> 
> 
> ============================
> Memory Tier Demotion Targets
> ============================
> When `demotion` is enabled (see Section 5 - allocation), the reclaim
> system may opportunistically demote a page from one memory tier to
> another.  The selection of a `demotion target` is partially based on
> Abstract Distance and Performance Data.
> 
> ```
> An example of demotion targets from memory-tiers.c
> /* Example 1:
>  *
>  * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
>  *
>  * node distances:
>  * node   0    1    2    3
>  *    0  10   20   30   40
>  *    1  20   10   40   30
>  *    2  30   40   10   40
>  *    3  40   30   40   10
>  *
>  * memory_tiers0 = 0-1
>  * memory_tiers1 = 2-3
>  *
>  * node_demotion[0].preferred = 2
>  * node_demotion[1].preferred = 3
>  * node_demotion[2].preferred = <empty>
>  * node_demotion[3].preferred = <empty>
>  */
> ```
> 
> =============================
> Mempolicy Weighted Interleave
> =============================
> The `weighted interleave` functionality of `mempolicy` utilizes weights
> to distribute memory across NUMA nodes according to some set weight.
> There is a proposal to auto-configure these weights based on HMAT data.
> 
> https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u
> 
> See Section 4 - Interleave, for more information on weighted interleave.
> 
> 
> 
> --------------
> Build Options.
> --------------
> We can add these build configurations to our complexity picture.
> 
> CONFIG_NUMA        - req for ACPI numa, mempolicy, and memory tiers
> CONFIG_ACPI_NUMA   -- enables srat and cedt parsing
> CONFIG_ACPI_HMAT   -- enables hmat parsing
> 
> 
> ~Gregory

next prev parent reply	other threads:[~2025-03-06  1:38 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05  2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12   ` Yuquan Wang
2025-02-18 16:11     ` Gregory Price
2025-02-20 16:30   ` Jonathan Cameron
2025-02-20 16:52     ` Gregory Price
2025-03-04  0:32   ` Gregory Price
2025-03-13 16:12     ` Jonathan Cameron
2025-03-13 17:20       ` Gregory Price
2025-03-10 10:45   ` Yuquan Wang
2025-03-10 14:19     ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06  0:47   ` Dan Williams
2025-02-06 15:59     ` Gregory Price
2025-03-04  1:32   ` Gregory Price
2025-03-06 23:56   ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07  0:57     ` Zhijian Li (Fujitsu)
2025-03-07 15:07       ` Gregory Price
2025-03-11  2:48         ` Zhijian Li (Fujitsu)
2025-04-02  6:45     ` Zhijian Li (Fujitsu)
2025-04-02 14:18       ` Gregory Price
2025-04-08  3:10         ` Zhijian Li (Fujitsu)
2025-04-08  4:14           ` Gregory Price
2025-04-08  5:37             ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24   ` David Hildenbrand
2025-02-18 17:03     ` Gregory Price
2025-02-18 17:49   ` Yang Shi
2025-02-18 18:04     ` Gregory Price
2025-02-18 19:25       ` David Hildenbrand
2025-02-18 20:25         ` Gregory Price
2025-02-18 20:57           ` David Hildenbrand
2025-02-19  1:10             ` Gregory Price
2025-02-19  8:53               ` David Hildenbrand
2025-02-19 16:14                 ` Gregory Price
2025-02-20 17:50             ` Yang Shi
2025-02-20 18:43               ` Gregory Price
2025-02-20 19:26                 ` David Hildenbrand
2025-02-20 19:35                   ` Gregory Price
2025-02-20 19:44                     ` David Hildenbrand
2025-02-20 20:06                       ` Gregory Price
2025-03-11 14:53                   ` Zi Yan
2025-03-11 15:58                     ` Gregory Price
2025-03-11 16:08                       ` Zi Yan
2025-03-11 16:15                         ` Gregory Price
2025-03-11 16:35                         ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44   ` Dave Jiang
2025-03-05 23:34     ` Gregory Price
2025-03-05 23:41       ` Dave Jiang
2025-03-06  0:09         ` Gregory Price
2025-03-06  1:37   ` Yuquan Wang [this message]
2025-03-06 17:08     ` Gregory Price
2025-03-07  2:20       ` Yuquan Wang
2025-03-07 15:12         ` Gregory Price
2025-03-13 17:00           ` Jonathan Cameron
2025-03-08  3:23   ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20     ` Jonathan Cameron
2025-03-13 18:17       ` Gregory Price
2025-03-14 11:09         ` Jonathan Cameron
2025-03-14 13:46           ` Gregory Price
2025-03-13 16:55   ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30     ` Gregory Price
2025-03-14 11:14       ` Jonathan Cameron
2025-03-27  9:34     ` Yuquan Wang
2025-03-27 12:36       ` Gregory Price
2025-03-27 13:21         ` Dan Williams
2025-03-27 16:36           ` Gregory Price
2025-03-31 23:49             ` [Lsf-pc] " Dan Williams
2025-03-12  0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13  8:31   ` Yuquan Wang
2025-03-13 16:48     ` Gregory Price
2025-03-26  9:28   ` Yuquan Wang
2025-03-26 12:53     ` Gregory Price
2025-03-27  2:20       ` Yuquan Wang
2025-03-27  2:51         ` [Lsf-pc] " Dan Williams
2025-03-27  6:29           ` Yuquan Wang
2025-03-14  3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02  4:49   ` Gregory Price
     [not found]     ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14       ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8j8bZ5TS+gDV8+M@phytium.com.cn \
    --to=wangyuquan1236@phytium.com.cn \
    --cc=gourry@gourry.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox