From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2131DC19F32 for ; Wed, 5 Mar 2025 22:21:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E0624280005; Wed, 5 Mar 2025 17:20:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D68A9280003; Wed, 5 Mar 2025 17:20:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B948F280005; Wed, 5 Mar 2025 17:20:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 829BC280003 for ; Wed, 5 Mar 2025 17:20:56 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 834BC1C91FD for ; Wed, 5 Mar 2025 22:20:58 +0000 (UTC) X-FDA: 83188918596.25.0CC3863 Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) by imf27.hostedemail.com (Postfix) with ESMTP id 21B4540003 for ; Wed, 5 Mar 2025 22:20:55 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=M+wye5cn; spf=pass (imf27.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.173 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741213256; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BO0P4C0Cv5YPDEVOqjeYmeu2Ronxavw79A+lbaur1FI=; b=T4776g7uPjkIlYI6Zxv9zpjNFi4Cv71gueHdlFKUWuwd8DK7Bmoqn+xYt9SG0Ma8bdg6m8 S8ECN1QevPfqhZv0WSEo3QzHmw/rSh46hNdoNjgRFZiyOZN0z7bDv7jrbuUkmj1o3sg7vW JQunNjV7NhB48FUxcHNkhA2BBVC0dwc= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=M+wye5cn; spf=pass (imf27.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.173 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741213256; a=rsa-sha256; cv=none; b=8BGxKNDnU83QzEx9DU588Vv6eOIhhOXHshEZaUQVtjeZBStQEwzGpMp92UU57RKB2UbN3N zKWRX1m/lrMLh0xKQo7R2PMaQKO8aT49VIMCE0uifLtXvG91WPQjGBHKD3WVL8Ygh/VyNn jxnlK6YWRGLV0iy1Cwp4Mm4U2xl9l4E= Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-7be8f28172dso491585485a.3 for ; Wed, 05 Mar 2025 14:20:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1741213255; x=1741818055; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=BO0P4C0Cv5YPDEVOqjeYmeu2Ronxavw79A+lbaur1FI=; b=M+wye5cnRCX+jOCw1Ew2puTN3M+NoitUXjkD4Lbdp2FCOaLqKyRCH2LqZsoAd8uJ/6 VNfYEXRori70zAcWjYVC3hqThUogI0UcpZzU+XpVodEanG03hm5KUttudj7qrRmQT6PO wzWqs3LNtaJfliby4O4wnigMXJJUwB+tmfNgQ2LzZi0HhyKWoDssvkDtO3lzJnmKN3gt t1pNqE8Oxre2lUdfmNoGUYBL4EXFH7M7VqH9pAL5feplVdzOEySSj+TFJSVJt60VsNER SCaMmQu6+33CcrMeFKGWA0LXqSVhacOJvSf9JOuGK5veQG7epaWp+Id8cSnK8YRMTu8I /rjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741213255; x=1741818055; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=BO0P4C0Cv5YPDEVOqjeYmeu2Ronxavw79A+lbaur1FI=; b=XGqS5APCWVaDyijCQeQC4LMiXFaoIgSVc2yYf+CLJc8R6pFZwVMn6BZ198CuOcA1Vp 1RkHkdXpjpXHcUl8AibmXZlVvuxXH+9QccYYYlg7qgvYpxOkxVys0f7QT18ds5K6CZpq BslFF4WIfYZ8boCd4wmQonL6VexbpFG6jWRwqL0SBRLYVVXPe1FAgOlMx6ES9OFX6l95 LAhout1aB0kptDoz3Ofnh9nzPa0DP1Q1bEoJ39B89kCv3PWZHo++ggF1AxWN4xVUbBf0 v4MwvhLBnpWT1XyaOIW7NWnbWPMt0JDkbbB0jJv9aktAqyTETpp+7cbAJ3Zr5zbuLWN5 72tg== X-Gm-Message-State: AOJu0YwQ5Wr/+PPOBKsoeJ12ZpXiLVFm5sw+RzIm+Rl0sqygj/WvWAYo ivej3M28Xf3rNBZUqc3DM43tZoVO0LpAQUe70C0B5ds0deRoPA+CqMKDw04xzKY= X-Gm-Gg: ASbGncsiuSYd4YzVdkA+LimAj/cbxP02guEpJvMzHE8W3ErcSVOLYPSI1yDSw52gxVL OSfTKOZOUpN5HaljDYHw/3HQZnsKQcN3fXyyBFL3TTVh2KGUMztsLn+gX/LyWX6uAFJ36tupZb0 IvxriOyyGCczcBM5ovI5P2W01y9Uqb74yCPv4rzcyCNgYoX4ZJIYRsujsfKF2nnZ+kny1rg9lTn YrxqH6im25bJLGNxxO4mJNI0ySHxN0oyI90wEppKIFUvgaeBxcFYsSAwdT3NLZ1RCujI6y8aUE/ edM3HrYHNrGXkuwCNGfE5nngAkm4b8+O8Rkk2oPwCzCqxo1MhWQh1cvJ3nKeTo31WGzGOxNFp5c AyaI0E4h2TKuGVLVXcTvoOnAZvuU= X-Google-Smtp-Source: AGHT+IE8tBIvqNkBR0RqVk3FB2EWb5Dyn3WBPJBwmSSlDY+T4y6mohYuJuZsV1nkfng7jl+opX/Cwg== X-Received: by 2002:a05:620a:601c:b0:7c0:a63e:4622 with SMTP id af79cd13be357-7c3d8def6fcmr890597285a.31.1741213254811; Wed, 05 Mar 2025 14:20:54 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c3b8cd9d36sm480280285a.104.2025.03.05.14.20.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Mar 2025 14:20:54 -0800 (PST) Date: Wed, 5 Mar 2025 17:20:52 -0500 From: Gregory Price To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 21B4540003 X-Rspamd-Server: rspam09 X-Stat-Signature: 8wtdwi7hxex5t9hmbfcsunw3eexfu6fu X-HE-Tag: 1741213255-660570 X-HE-Meta: U2FsdGVkX18RlB1GHcHZM+o4dFKTzjH3o4dKrHXarDA5HMxonXJIgeCN08L23+7XuEifU0UIhz+fILC7YGxCSFBRbz+N51Zo7eTUbovxRULJvOGjT90St2ctg5V9TFfAn1LDTHojobXLQYKWMQ71EF3mPZjSak6BPOTdH9TA4v8vykNMM2ULqWMILnwAbfMZ78vq7y51+MincO5w5zBVuPcBFxGw7ivBt5sCkMTrh6zhcQ5H0Zh8KmR6fm7X0m4/VK5KFLjeFkSx8ivCuhnpDQTVr16F3HYz1W2MVlUCM4pz/sbvpacdICUJVIFeTwnTxWr3+C0ljcpf6CREaz/T5JZqfo3ObCFJgPHEcWEfnZP0VKugEwLgAyf2ZCXVEzIqrKbsXuhjGKRV7k6PW/v4y681rAkjc1CrdAGtm/xzFdTqtv6epJuNUYJtFTAfFOIRN3jjLewGwjdQPNGWwJpnP3kVeNq037BlcDTnstaYPimZpVgR+dvlTfbMLHh1qzpR+kiEVyabVN0xc+yJT6kSkaeb+NYspAL7y3XoEs+0auB/zbQDmK0OTl9paKVecfL3QfKIuOt0A63Ebbq+TQShxReyciroZEb7tCqXqcacgdV2So4hAH4N0pioH7ivgfYvrJQiirAnnXmlDsl90vAa0YhTpL+/Ly5+Az9oqYNBgLf4GslJ9DUf87iZkgtAAW5Z2rG8Xj2jlPyzECBT21aA1Av8ca4oOUZW8k48yQ1AQCAWEncz66WWPYWZ9oTNpHjpu5UOus8DuNFvcXPVi3gdiXLbzIbxuuQum850PjwckFrUBO/zDLLuWqEGXLxNyMQMqDv8V1AA2L26jiF50XyV5Vi5AavQFISOdSf+cOXD9JrYOpCd9GTTd/gCXIpdz6g92Q94hoiBKOP3K1+ipT/JNNbe0Al3N0GRzersWn0q8VJTcOfOEoy5mLo8s/nQWTN6SDYTMhzam4xRGvzvWK0 sib7CB7E DrXRASCDKnnE8u6l7+hX5IneDQjYTlO5WaJ1GhQTclEVE1MlDR3HiB6o+OIkqGKRGFjwP8lQoJK98rSwvGx+ES22XVzZ8NlYhHACTm+hrf1p2ODjV8cjPlKMsiYSWyzxpFU06U2Aqqevn9TjRPn3qSTyqIuNFRxjENuE+ORiSffx4BHh5vISBSYU9j4C4wX/zOdWn3UJyqYSCOvpBOJVR92/XP7ElDAK266ZItuJhG3X2wCljBWrCKQvMPFHHFou2TAJcAdcQHGacjPKCX/wJXv6B8btvCN//X+2QbdP3rHA62otJM/FAviOa7jPylGHG6xjk7+JSCFOdrxUcoEajoGNbiyWR/BPOSxQHl7L9aMWSt7INkA8yMMHnYLp7l2i5UZEhxMzWsFmtuB/6BXfD1vctzA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: -------------------- Part 0: ACPI Tables. -------------------- I considered publishing this section first, or at least under "Platform", but I've found this information largely useful in debugging interleave configurations and tiering mechanisms - which are higher level concepts. Much of the information in this section is most relevant to Interleave (yet to be published Section 4). I promise not to simply regurgitate the entire ACPI specification and limit this to necessary commentary to describe how these tables relate to actual Linux resources (like numa nodes and tiers). At the very least, if you find yourself trying to figure out why your CXL system isn't producing NUMA nodes, memory tiers, root decoders, memory regions - etc - I would check these tables first for aberrations. Almost all my personal strife has been associated with ACPI table misconfiguration. ACPI tables can be inspected with the `acpica-tools` package. mkdir acpi_tables && cd acpi_tables acpidump -b iasl -d * -- inpect the *.dsl files ==== CEDT ==== The CXL Early Discovery Table is generated by BIOS to describe the CXL devices present and configured (to some extent) at boot by the BIOS. # CHBS The CXL Host Bridge Structure describes CXL host bridges. Other than describing device register information, it reports the specific host bridge UID for this host bridge. These host bridge ID's will be referenced in other tables. Debug hint: check that the host bridge IDs between tables are consistent - stuff breaks oddly if they're not! ``` Subtable Type : 00 [CXL Host Bridge Structure] Reserved : 00 Length : 0020 Associated host bridge : 00000007 <- Host bridge _UID Specification version : 00000001 Reserved : 00000000 Register base : 0000010370400000 Register length : 0000000000010000 ``` # CFMWS The CXL Fixed Memory Window structure describes a memory region associated with one or more CXL host bridges (as described by the CHBS). It additionally describes any inter-host-bridge interleave configuration that may have been programmed by BIOS. (Section 4) ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 000000C050000000 <- Memory Region Window size : 0000003CA0000000 Interleave Members (2^n) : 01 <- Interleave configuration Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Next Target : 00000006 <- Host Bridge _UID ``` INTER-host-bridge interleave (multiple devices on one host bridge) is NOT reported in this structure, and is solely defined via CXL device decoder programming (host bridge and endpoint decoders). This will be described later (Section 4 - Interleave) ==== SRAT ==== The System/Static Resource Affinity Table describes resource (CPU, Memory) affinity to "Proximity Domains". This table is technically optional, but for performance information (see "HMAT") to be enumerated by linux it must be present. # Proximity Domain A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation") # Memory Affinity Generally speaking, if a host does any amount of CXL fabric (decoder) programming in BIOS - an SRAT entry for that memory needs to be present. ``` Subtable Type : 01 [Memory Affinity] Length : 28 Proximity Domain : 00000001 <- NUMA Node 1 Reserved1 : 0000 Base Address : 000000C050000000 <- Physical Memory Region Address Length : 0000003CA0000000 Reserved2 : 00000000 Flags (decoded below) : 0000000B Enabled : 1 Hot Pluggable : 1 Non-Volatile : 0 ``` # Generic Initiator / Port In the scenario where CXL devices are not present or configured by BIOS, we may still want to generate proximity domain configurations for those devices. The Generic Initiator interfaces are intended to fill this gap, so that performance information can still be utilized when the devices become available at runtime. I won't cover the details here, for now, but I will link to the proosal from Dan Williams and Jonathan Cameron if you would like more information. https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/ ==== HMAT ==== The Heterogeneous Memory Attributes Table contains information such as cache attributes and bandwidth and latency details for memory proximity domains. For the purpose of this document, we will only discuss the SSLIB entry. # SLLBI The System Locality Latency and Bandwidth Information records latency and bandwidth information for proximity domains. This table is used by Linux to configure interleave weights and memory tiers. ``` Heavily truncated for brevity Structure Type : 0001 [SLLBI] Data Type : 00 <- Latency Target Proximity Domain List : 00000000 Target Proximity Domain List : 00000001 Entry : 0080 <- DRAM LTC Entry : 0100 <- CXL LTC Structure Type : 0001 [SLLBI] Data Type : 03 <- Bandwidth Target Proximity Domain List : 00000000 Target Proximity Domain List : 00000001 Entry : 1200 <- DRAM BW Entry : 0200 <- CXL BW ``` --------------------------------- Part 00: Linux Resource Creation. --------------------------------- ================== NUMA node creation =================== NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are identified at `__init` time, more specifically during `mm_init`. What this means is that the CEDT and SRAT must contain sufficient `proximity domain` information for linux to identify how many NUMA nodes are required (and what memory regions to associate with them). The relevant code exists in: linux/drivers/acpi/numa/srat.c ``` static int __init acpi_parse_memory_affinity(union acpi_subtable_headers *header, const unsigned long table_end) { ... heavily truncated for brevity pxm = ma->proximity_domain; node = acpi_map_pxm_to_node(pxm); if (numa_add_memblk(node, start, end) < 0) .... node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE } static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, void *arg, const unsigned long table_end) { ... heavily truncated for brevity /* * The SRAT may have already described NUMA details for all, * or a portion of, this CFMWS HPA range. Extend the memblks * found for any portion of the window to cover the entire * window. */ if (!numa_fill_memblks(start, end)) return 0; /* No SRAT description. Create a new node. */ node = acpi_map_pxm_to_node(*fake_pxm); if (numa_add_memblk(node, start, end) < 0) .... node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE } int __init acpi_numa_init(void) { ... if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) { cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); } /* fake_pxm is the next unused PXM value after SRAT parsing */ acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, &fake_pxm); ``` Basically, the heuristic is as follows: 1) Add one NUMA node per Proximity Domain described in SRAT 2) If the SRAT describes all memory described by all CFMWS - do not create nodes for CFMWS 3) If SRAT does not describe all memory described by CFMWS - create a node for that CFMWS Generally speaking, you will see one NUMA node per Host bridge, unless inter-host-bridge interleave is in use (see Section 4 - Interleave). ============ Memory Tiers ============ The `abstract distance` of a node dictates what tier it lands in (and therefore, what tiers are created). This is calculated based on the following heuristic, using HMAT data: ``` int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) { ... /* * The abstract distance of a memory node is in direct proportion to * its memory latency (read + write) and inversely proportional to its * memory bandwidth (read + write). The abstract distance, memory * latency, and memory bandwidth of the default DRAM nodes are used as * the base. */ *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); return 0; } ``` Debugging hint: If you have DRAM and CXL memory in separate numa nodes but only find 1 memory tier, validate the HMAT! ============================ Memory Tier Demotion Targets ============================ When `demotion` is enabled (see Section 5 - allocation), the reclaim system may opportunistically demote a page from one memory tier to another. The selection of a `demotion target` is partially based on Abstract Distance and Performance Data. ``` An example of demotion targets from memory-tiers.c /* Example 1: * * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. * * node distances: * node 0 1 2 3 * 0 10 20 30 40 * 1 20 10 40 30 * 2 30 40 10 40 * 3 40 30 40 10 * * memory_tiers0 = 0-1 * memory_tiers1 = 2-3 * * node_demotion[0].preferred = 2 * node_demotion[1].preferred = 3 * node_demotion[2].preferred = * node_demotion[3].preferred = */ ``` ============================= Mempolicy Weighted Interleave ============================= The `weighted interleave` functionality of `mempolicy` utilizes weights to distribute memory across NUMA nodes according to some set weight. There is a proposal to auto-configure these weights based on HMAT data. https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u See Section 4 - Interleave, for more information on weighted interleave. -------------- Build Options. -------------- We can add these build configurations to our complexity picture. CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers CONFIG_ACPI_NUMA -- enables srat and cedt parsing CONFIG_ACPI_HMAT -- enables hmat parsing ~Gregory