From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7153C282EC for ; Sat, 8 Mar 2025 03:23:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2669B6B0085; Fri, 7 Mar 2025 22:23:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 215DB6B0088; Fri, 7 Mar 2025 22:23:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B7D26B0089; Fri, 7 Mar 2025 22:23:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DB2EE6B0085 for ; Fri, 7 Mar 2025 22:23:08 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 7F72D1A1375 for ; Sat, 8 Mar 2025 03:23:10 +0000 (UTC) X-FDA: 83196937740.06.3FAB93E Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) by imf17.hostedemail.com (Postfix) with ESMTP id 9C0C740004 for ; Sat, 8 Mar 2025 03:23:08 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=Pk7LzKgE; dmarc=none; spf=pass (imf17.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741404188; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9kkDaiYn8BNq9wXK9Bp+xqS98ju01c1lv7vbYke9egk=; b=A6NGq+5VYDcSUInWHwS3R/FTnqrMP9gFxOoq4nsjUXibinHb4gVq82CUIG2ozMAoGB9Ioh 8EPGO2ngg5qVbUGz3ntly3tTxxwtQNjkjNr4tPueHVxtP2ALaVasZfMh+45Aynt06XGolN UwCYrkYkk9OLsGC1eMcFyUfjT/yUc1U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741404188; a=rsa-sha256; cv=none; b=aI+oywRnOC6JGzmFWoBp1VcNalfcvoxQiCeQo1W7qAGB/teIlnPw0qEL6uH56oKmQ1Mnvv iaASRtqbsLiEg+5UOIyBEJY1B8sR2CZluMAN3+17LER7k8kdMNX+28gWLbvmapPE4TH665 x4PMZfXfXSKIVf3puimKBooc2NFHo3Q= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=Pk7LzKgE; dmarc=none; spf=pass (imf17.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-7c3bf231660so263173985a.0 for ; Fri, 07 Mar 2025 19:23:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1741404188; x=1742008988; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=9kkDaiYn8BNq9wXK9Bp+xqS98ju01c1lv7vbYke9egk=; b=Pk7LzKgExRTA+c4Dg+x8w+cONLx2kn+EA/JmVDoypLvKJleSs4J+ro91n5JDvMkh1w cDL+GkKibmQSEIIvrozu4CbRFKkoL/HykRZTuUSJwCuMu8d+QpQRcfmfjx9DjypnE4nH 7mAn1VQR/w/N2fkcAu2hAfhINjPz87rL6HWI9rZp3Y1F64pZBvGcQG+yD0WsTe0Yyv51 HoX2jHfZ2X7aOEYrWFbIz/MmtqO7wmWeKO0+kS6qrLzTu/KmccbAL8Ccw94yAHnQv0jo /X/iU+zFoqMcTMDQr7LyYjSkUSk/OUaH237Gh9WHhBdXXvclVfcY693zkS7AiUcUQw5j Y40g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741404188; x=1742008988; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=9kkDaiYn8BNq9wXK9Bp+xqS98ju01c1lv7vbYke9egk=; b=gPrXCZFRSX9UEahbenjY95Vb2Im7xF3owL+dGwJZuwkxqmMm0b4GnZkOGTrYR2jPpg 3w5m5C1Zg8s/eHjYZGwNsCaAmczEovkQWM4vCYUBD8eu4V3myPWZ8O/zFUtBQY7KWChU pZtIU/y86WFe4CVRrYeofbFyrjoEaJh/I+daMsoWRWCDmBqGCg2x+DFQ5oJlz6h23Ndu 8UERuSDY1xPUW4cZK/ciRGwsCfuzhHoqH9hDQC7bU7vAwDgIh17+TqRbrlrRYNsuCvFN /T9JMWPD8hD+nt6A+aLSSHmJMcnd1sQO3zOFOH7n9NJX2Jd6YcNjh5hBU+2GDWnXSYiZ luBw== X-Gm-Message-State: AOJu0YwkIuJRrbdFj7/pWJRXfg/ONXy4Yeu88YfdD4ldAaYLl4udi5lz dE/duYiUB1o8f84VWxjbDq7YWi2DH4RKfhvB/Ow6ccYIl/zScnv0mK4fbXqTLZs= X-Gm-Gg: ASbGnct4Vd5Jm8ewSqluiNvs9XeQPaSl+2aR0pbAFtEqr0XxgnshON569Nsaqg7nK7l DguoLjWm4BDVPRV9smAP8LMhfZBrzPyZJMOilqHv3lxzX9AkocCBKf1VTsxMQIiyeaufm0WJ8a3 epXqeB3O7+EhcflGg0vck5SjrbqorRHIAioGGlt1hgBtDeOXDABCkF+nclKHrIVoP0gAlxwi+1s UlWm4QNvuh/Ufp37YBynOyoL+k3NgCArlPY7F9egotcYv9hkD1iDKT5SMpUX4nru+wsbZDSERC6 FAdjQZFkg9tai/YPKt1n88DMuCmZKeNX9M1kA7XOdNcb4kOk00Qku/Fmgmwh5XIYM3EFxjEnK6y rx9xWMH+QaygagsFIUBSlJDTuq0U= X-Google-Smtp-Source: AGHT+IEcBkI69lUrjBbxyKItEz6GSDDrzJ/inNMnbTU6xoigcZCvrVlOh6zAGVKFHLwBZbuUjdRTMg== X-Received: by 2002:a05:620a:86ce:b0:7c3:d316:1d8e with SMTP id af79cd13be357-7c4e6194783mr917562385a.53.1741404187604; Fri, 07 Mar 2025 19:23:07 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c3e534e924sm331448985a.39.2025.03.07.19.23.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Mar 2025 19:23:07 -0800 (PST) Date: Fri, 7 Mar 2025 22:23:05 -0500 From: Gregory Price To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 9C0C740004 X-Rspamd-Server: rspam08 X-Stat-Signature: iqppaiq643qwrgy4ig585i19gmq6gjsa X-HE-Tag: 1741404188-326959 X-HE-Meta: U2FsdGVkX19w03jdIp2bRfLcKNjdoF85R27rHL7Up3tof0ruAao6rvoB88qzoDeRP4CgmASxgDIWeBrHlqo198YKv/LxtnfYvgr0bNA75gQdSjlpm7afZg+y1NNQAvmi3lMRjqDX/AUWRde+NTPGts8G6eu9cZ1PMHhjVd/DqPTR5EUENmhxkD6GwEpPag+id0ufP7kuwsI4tbtavVMzBBiVYjLW07JvFz5EmvE+6CSt0E3r2at7yPlrYtqd2Y0L0zc5llZpz19J2W5ilwLX0kiW8a9hojq1AbRKO+KMBWR4nU0FF1sylsiaKT4blMeb0uuCWUAS9CMRPzHvFkMvHH1g9YajFL57QS6yOc/EYdFcm5rRJ2SdVwgf8IO3TEbB9H6daobq56tiwCB6BEWRDQ/w7pXkQCVJd0Dcx1NYSgHcmLcjsaGO0cgU85P9KwQfFKhjqJE6xEpP9o5O6JV8UHVhft4sdihttIEBsJA3ZxwxQNOn46exI+ABecp2w9em5CCyqwwFkiJntI6qBm3mB0lKZhZ8lA6gqYzy17NGtktZYmldmX22ODHwWj5H5HDq1TJbT6V9gVcoWi4sqAp+kO6/IUAD2HWWsv6zI0CMEQ8iAYiufpWC0/a2i2U9UpusTsn5gj+dSUSPGBcoterDOUcxlPraWgm4teUET2BZc8cozxQtJY///UI2y/AOCY3rUrGiZUn2rNfbcpcwQJiRydvrEGluseor8g/FJrEnWmSwWiCiujwXCmdtNiq+ieYDrRWud37AQVZQeinHpgxS5Z/FlYBuN6AJyU3RkRs+8kIcjEO0K2NA2guLmDagbGuNesobMzNwi2QRnE0+NqmJ4TBleHHk52MbuweNxqWWwcrRkyouKmRi2sEZXQaBGuAVGSxsKO7jLCmJL8EYwKcSf+wch9U9xvYHZClJnspT5esH3s4r2ZIw2KFQ1iHZ/G4q5YunwoXyRURRKPn0WwB ZyCEYYjM hlnhCVEF4qs9rdva+CA52yE4SEGVtXjYt/I/pkKHREz0A57LKfEK81v+wwulZk4Gw35cbQ7x+N6DeNUo0j0utn5ByqvUpP+a81l032P4gPBXyBmzEyMtcjob38UaxSaJ1rVQOyLPZajQI8fmftrZ0+Y/OYxDXMrTtmBkgJwu8bdpoT+epKA8d7hggOm0+ekleg5ZfsuqWOupTUNQS7JbHvqcKaOVzABdVMYi5gAsy1PCj+1m/JlAoTCzuG28sLvWXFUwHWWxfXu3IPPinBoCLgXoFXF/hgC9gkwO6s/Vtn1spYlr4vNMKovqeVujLD6Gq8nAMxcSEJbE1YlvxxcxU9UkeQ90inGQPMQd7XfH+HejzXUM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In the last section we discussed how the CEDT CFMWS and SRAT Memory Affinity structures are used by linux to "create" NUMA nodes (or at least mark them as possible). However, the examples I used suggested that there was a 1-to-1 relationship between CFMWS and devices or host bridges. This is not true - in fact, CFMWS are a simply a carve out of System Physical Address space which may be used to map any number of endpoint devices behind the associated Host Bridge(s). The limiting factor is what your platform vendor BIOS supports. This section describes a handful of *possible* configurations, what NUMA structure they will create, and what flexibility this provides. All of these CFMWS configurations are made up, and may or may not exist in real machines. They are a conceptual teching tool, not a roadmap. (When discussing interleave in this section, please note that I am intentionally omitting details about decoder programming, as this will be covered later.) ------------------------------- One 2GB Device, Multiple CFMWS. ------------------------------- Lets imagine we have one 2GB device attached to a host bridge. In this example, the device hosts 2GB of persistent memory - but we might want the flexibility to map capacity as volatile or persistent. The platform vendor may decide that they want to reserve two entirely separate system physical address ranges to represent the capacity. ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000200000000 <- Memory Region Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 000A <- Bit(3) - Persistant QtgId : 0001 First Target : 00000007 <- Host Bridge _UID NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS) ``` You might have a CEDT with two CFMWS as above, where the base addresses are `0x100000000` and `0x200000000` respectively, but whose window sizes cover the entire 2GB capacity of the device. This affords the user flexibility in where the memory is mapped depending on if it is mapped as volatile or persistent while keeping the two SPA ranges separate. This is allowed because the endpoint decoders commit device physical address space *in order*, meaning no two regions of device physical address space can be mapped to more than one system physical address. i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000) (See Section 2a - decoder programming). ------------------------------- Two Devices On One Host Bridge. ------------------------------- Lets say we have two CXL 2GB devices behind a single host bridge, and we may or may not want to interleave some or all of those devices. There are (at least) 2 ways to provide this flexibility. First, we might simply have two CFMWS. ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000200000000 <- Memory Region Window size : 0000000080000000 Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS) ``` These CFMWS target the same host bridge, but are NOT necessarily limited to mapping memory from any one device. We could program decoders in either of the following ways. Example: Host bridge and endpoints are programmed WITHOUT interleave. ``` Decoders CXL Root / \ decoder0.0 decoder1.0 [0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF] \ / Host Bridge / \ decoder2.0 decoder2.1 [0x100000000, 0x17FFFFFFFF] [0x200000000, 0x27FFFFFFF] | | Endpoint 0 Endpoint 1 | | decoder4.0 decoder5.0 [0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF] NUMA effect: All of Endpoint 0 memory is on NUMA node A All of Endpoint 1 memory is on NUMA node B ``` Alternatively, these decoders could be programmed to interleave memory accesses across endpoints. We'll cover this configuration in-depth later. For now, just know the above structure means that each endpoint has its own NUMA node - but this is not required. ------------------------------------------------------------- Two Devices On One Host Bridge - With and Without Interleave. ------------------------------------------------------------- What if we wanted some capacity on each endpoint hosted on its own NUMA node, and wanted to interleave a portion of each device capacity? We could produce the following CFMWS configuration. ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000200000000 <- Memory Region 2 Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000300000000 <- Memory Region 3 Window size : 0000000100000000 <- 4GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS) ``` In this configuration, we could still do what we did with the prior configuration (2 CFMWS), but we could also use the third root decoder to simplify decoder programming of interleave. Since the third region has sufficient capacity (4GB) to cover both devices (2GB/each), we can actually associate the entire capacity of both devices in that region. We'll discuss this decoder structure in-depth in Section 4. ------------------------------------- Two devices on separate host bridges. ------------------------------------- We may have placed the devices on separate host bridges. In this case we may naturally have one CFMWS per host bridge. ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000200000000 <- Memory Region 2 Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000006 <- Host Bridge _UID NUMA Effects: 2 NUMA nodes marked POSSIBLE ``` But we may also want to interleave *across* host bridges. To do this, the platform vendor may add the following CFMWS (either by itself if done statically, or in addition to the above two for flexibility). ``` Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000300000000 <- Memory Region Window size : 0000000100000000 <- 4GB Interleave Members (2^n) : 01 <- 2-way interleave Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge 7 Next Target : 00000006 <- Host Bridge 6 NUMA Effect: an additional NUMA node marked POSSIBLE ``` This greatly simplifies the decoder programming structure, and allows us to aggregate bandwidth across host bridges. The decoder programming might look as follows in this setup. ``` Decoders: CXL Root | decoder0.0 [0x300000000, 0x3FFFFFFFF] / \ Host Bridge 7 Host Bridge 6 / \ decoder1.0 decoder2.0 [0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF] | | Endpoint 0 Endpoint 1 | | decoder3.0 decoder4.0 [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF] ``` We'll discuss this more in-depth in section 4 - but you can see how straight-forward this is. All the decoders are programmed the same. ---------- SRAT Note. ---------- If you remember from the first portion of Section 0, the SRAT may be used to statically assign memory regions to specific proximity domains. ``` Subtable Type : 01 [Memory Affinity] Length : 28 Proximity Domain : 00000001 <- NUMA Node 1 Reserved1 : 0000 Base Address : 000000C050000000 <- Physical Memory Region Address Length : 0000003CA0000000 ``` There is a careful dance between the CEDT and SRAT tables and how NUMA nodes are created. If things don't look quite the way you expect - check the SRAT Memory Affinity entries and CEDT CFMWS to determine what your platform actually supports in terms of flexible topologies. -------- Summary. -------- In the first part of Section 0 we showed how CFMWS and SRAT affect how Linux creates NUMA nodes. Here we demonstrated that CFMWS are not a 1-to-1 relationship to either CXL devices or Host Bridges. Instead, CFMWS are simply a System Physical Address carve out which can be used in a number of ways to define your memory topology in software. This is a core piece of the "Software Defined Memory" puzzle. How your platform vendor decides to program the CEDT will dictate how flexibly you can manage CXL devices in software. ~Gregory