From: Yuquan Wang <wangyuquan1236@phytium.com.cn>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
Date: Thu, 6 Mar 2025 09:37:49 +0800 [thread overview]
Message-ID: <Z8j8bZ5TS+gDV8+M@phytium.com.cn> (raw)
In-Reply-To: <Z8jORKIWC3ZwtzI4@gourry-fedora-PF4VCD3F>
On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote:
> ====
> SRAT
> ====
> The System/Static Resource Affinity Table describes resource (CPU,
> Memory) affinity to "Proximity Domains". This table is technically
> optional, but for performance information (see "HMAT") to be enumerated
> by linux it must be present.
>
>
> # Proximity Domain
> A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
> 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity
> Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation")
>
> # Memory Affinity
> Generally speaking, if a host does any amount of CXL fabric (decoder)
> programming in BIOS - an SRAT entry for that memory needs to be present.
>
> ```
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 000000C050000000 <- Physical Memory Region
> Address Length : 0000003CA0000000
> Reserved2 : 00000000
> Flags (decoded below) : 0000000B
> Enabled : 1
> Hot Pluggable : 1
> Non-Volatile : 0
> ```
>
> # Generic Initiator / Port
> In the scenario where CXL devices are not present or configured by
> BIOS, we may still want to generate proximity domain configurations
> for those devices. The Generic Initiator interfaces are intended to
> fill this gap, so that performance information can still be utilized
> when the devices become available at runtime.
>
> I won't cover the details here, for now, but I will link to the
> proosal from Dan Williams and Jonathan Cameron if you would like
> more information.
> https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
>
> ====
> HMAT
> ====
> The Heterogeneous Memory Attributes Table contains information such as
> cache attributes and bandwidth and latency details for memory proximity
> domains. For the purpose of this document, we will only discuss the
> SSLIB entry.
>
> # SLLBI
> The System Locality Latency and Bandwidth Information records latency
> and bandwidth information for proximity domains. This table is used by
> Linux to configure interleave weights and memory tiers.
>
> ```
> Heavily truncated for brevity
> Structure Type : 0001 [SLLBI]
> Data Type : 00 <- Latency
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 0080 <- DRAM LTC
> Entry : 0100 <- CXL LTC
>
> Structure Type : 0001 [SLLBI]
> Data Type : 03 <- Bandwidth
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 1200 <- DRAM BW
> Entry : 0200 <- CXL BW
> ```
>
>
> ---------------------------------
> Part 00: Linux Resource Creation.
> ---------------------------------
>
> ==================
> NUMA node creation
> ===================
> NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> identified at `__init` time, more specifically during `mm_init`.
>
> What this means is that the CEDT and SRAT must contain sufficient
> `proximity domain` information for linux to identify how many NUMA
> nodes are required (and what memory regions to associate with them).
>
Hi, Gregory.
Recently, I found a corner case in CXL numa node creation.
Condition:
1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS
2)Enable CONFIG_ACPI_NUMA
Results:
1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws()
2)If dynamically create cxl ram region, the cxl memory would be assigned
to node0 rather than a fake new node.
Confusions:
1) Does CXL memory usage require a numa system with SRAT? As you
mentioned in SRAT section:
"This table is technically optional, but for performance information
to be enumerated by linux it must be present."
Hence, as I understand it, it seems a bug in kernel.
2) If it is a bug, could we forbid this situation by adding fake_pxm
check and returning error in acpi_numa_init()?
3)If not, maybe we can add some kernel logic to allow create these fake
nodes on a system without SRAT?
Yuquan
> The relevant code exists in: linux/drivers/acpi/numa/srat.c
> ```
> static int __init
> acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> const unsigned long table_end)
> {
> ... heavily truncated for brevity
> pxm = ma->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> void *arg, const unsigned long table_end)
> {
> ... heavily truncated for brevity
> /*
> * The SRAT may have already described NUMA details for all,
> * or a portion of, this CFMWS HPA range. Extend the memblks
> * found for any portion of the window to cover the entire
> * window.
> */
> if (!numa_fill_memblks(start, end))
> return 0;
>
> /* No SRAT description. Create a new node. */
> node = acpi_map_pxm_to_node(*fake_pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> int __init acpi_numa_init(void)
> {
> ...
> if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> acpi_parse_memory_affinity, 0);
> }
> /* fake_pxm is the next unused PXM value after SRAT parsing */
> acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> &fake_pxm);
>
> ```
>
> Basically, the heuristic is as follows:
> 1) Add one NUMA node per Proximity Domain described in SRAT
> 2) If the SRAT describes all memory described by all CFMWS
> - do not create nodes for CFMWS
> 3) If SRAT does not describe all memory described by CFMWS
> - create a node for that CFMWS
>
> Generally speaking, you will see one NUMA node per Host bridge, unless
> inter-host-bridge interleave is in use (see Section 4 - Interleave).
>
>
> ============
> Memory Tiers
> ============
> The `abstract distance` of a node dictates what tier it lands in (and
> therefore, what tiers are created). This is calculated based on the
> following heuristic, using HMAT data:
>
> ```
> int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
> {
> ...
> /*
> * The abstract distance of a memory node is in direct proportion to
> * its memory latency (read + write) and inversely proportional to its
> * memory bandwidth (read + write). The abstract distance, memory
> * latency, and memory bandwidth of the default DRAM nodes are used as
> * the base.
> */
> *adist = MEMTIER_ADISTANCE_DRAM *
> (perf->read_latency + perf->write_latency) /
> (default_dram_perf.read_latency + default_dram_perf.write_latency) *
> (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
> (perf->read_bandwidth + perf->write_bandwidth);
> return 0;
> }
> ```
>
> Debugging hint: If you have DRAM and CXL memory in separate numa nodes
> but only find 1 memory tier, validate the HMAT!
>
>
> ============================
> Memory Tier Demotion Targets
> ============================
> When `demotion` is enabled (see Section 5 - allocation), the reclaim
> system may opportunistically demote a page from one memory tier to
> another. The selection of a `demotion target` is partially based on
> Abstract Distance and Performance Data.
>
> ```
> An example of demotion targets from memory-tiers.c
> /* Example 1:
> *
> * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> *
> * node distances:
> * node 0 1 2 3
> * 0 10 20 30 40
> * 1 20 10 40 30
> * 2 30 40 10 40
> * 3 40 30 40 10
> *
> * memory_tiers0 = 0-1
> * memory_tiers1 = 2-3
> *
> * node_demotion[0].preferred = 2
> * node_demotion[1].preferred = 3
> * node_demotion[2].preferred = <empty>
> * node_demotion[3].preferred = <empty>
> */
> ```
>
> =============================
> Mempolicy Weighted Interleave
> =============================
> The `weighted interleave` functionality of `mempolicy` utilizes weights
> to distribute memory across NUMA nodes according to some set weight.
> There is a proposal to auto-configure these weights based on HMAT data.
>
> https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u
>
> See Section 4 - Interleave, for more information on weighted interleave.
>
>
>
> --------------
> Build Options.
> --------------
> We can add these build configurations to our complexity picture.
>
> CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers
> CONFIG_ACPI_NUMA -- enables srat and cedt parsing
> CONFIG_ACPI_HMAT -- enables hmat parsing
>
>
> ~Gregory
next prev parent reply other threads:[~2025-03-06 1:38 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
2025-02-18 16:11 ` Gregory Price
2025-02-20 16:30 ` Jonathan Cameron
2025-02-20 16:52 ` Gregory Price
2025-03-04 0:32 ` Gregory Price
2025-03-13 16:12 ` Jonathan Cameron
2025-03-13 17:20 ` Gregory Price
2025-03-10 10:45 ` Yuquan Wang
2025-03-10 14:19 ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
2025-02-06 15:59 ` Gregory Price
2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
2025-03-07 15:07 ` Gregory Price
2025-03-11 2:48 ` Zhijian Li (Fujitsu)
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2025-04-02 14:18 ` Gregory Price
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
2025-04-08 4:14 ` Gregory Price
2025-04-08 5:37 ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:03 ` Gregory Price
2025-02-18 17:49 ` Yang Shi
2025-02-18 18:04 ` Gregory Price
2025-02-18 19:25 ` David Hildenbrand
2025-02-18 20:25 ` Gregory Price
2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
2025-02-19 8:53 ` David Hildenbrand
2025-02-19 16:14 ` Gregory Price
2025-02-20 17:50 ` Yang Shi
2025-02-20 18:43 ` Gregory Price
2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
2025-02-20 19:44 ` David Hildenbrand
2025-02-20 20:06 ` Gregory Price
2025-03-11 14:53 ` Zi Yan
2025-03-11 15:58 ` Gregory Price
2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
2025-03-05 23:34 ` Gregory Price
2025-03-05 23:41 ` Dave Jiang
2025-03-06 0:09 ` Gregory Price
2025-03-06 1:37 ` Yuquan Wang [this message]
2025-03-06 17:08 ` Gregory Price
2025-03-07 2:20 ` Yuquan Wang
2025-03-07 15:12 ` Gregory Price
2025-03-13 17:00 ` Jonathan Cameron
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20 ` Jonathan Cameron
2025-03-13 18:17 ` Gregory Price
2025-03-14 11:09 ` Jonathan Cameron
2025-03-14 13:46 ` Gregory Price
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
2025-03-14 11:14 ` Jonathan Cameron
2025-03-27 9:34 ` Yuquan Wang
2025-03-27 12:36 ` Gregory Price
2025-03-27 13:21 ` Dan Williams
2025-03-27 16:36 ` Gregory Price
2025-03-31 23:49 ` [Lsf-pc] " Dan Williams
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13 8:31 ` Yuquan Wang
2025-03-13 16:48 ` Gregory Price
2025-03-26 9:28 ` Yuquan Wang
2025-03-26 12:53 ` Gregory Price
2025-03-27 2:20 ` Yuquan Wang
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
2025-03-27 6:29 ` Yuquan Wang
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02 4:49 ` Gregory Price
[not found] ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14 ` Adam Manzanares
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8j8bZ5TS+gDV8+M@phytium.com.cn \
--to=wangyuquan1236@phytium.com.cn \
--cc=gourry@gourry.net \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox