From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 256B5C282DE for ; Thu, 13 Mar 2025 17:20:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A62F528000D; Thu, 13 Mar 2025 13:20:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A1152280001; Thu, 13 Mar 2025 13:20:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 900C928000D; Thu, 13 Mar 2025 13:20:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 72AE7280001 for ; Thu, 13 Mar 2025 13:20:08 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 053991C7993 for ; Thu, 13 Mar 2025 17:20:10 +0000 (UTC) X-FDA: 83217190980.21.F272F25 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf05.hostedemail.com (Postfix) with ESMTP id 10A94100015 for ; Thu, 13 Mar 2025 17:20:07 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf05.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741886408; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4BeKw7IM42Kn7DxB+0QuLwl8qn+jK3iE5ckOB9Q5rRc=; b=QkAvgGIrogW8ZqaeV0M+t8ov/usWkHs8dv1nuBmhVhYwOmT3YMrJpApI5A57sGuJwiEft6 nnrZPkoFrwRz9AMHsvRWotsozqdJd8KoH64YLo9FqbqqzaNmv9ebLJc2eVso2sV4W5SeVq KYdrHOGyYLLCIRAc3berW6Q5y3yWYEQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741886408; a=rsa-sha256; cv=none; b=XW5Q9t703RxV9bdGba4N3HyB5X5MZqy8+FgMshBy5JEiy01v5pSzZxlLwEoMumhO6dFSqw EGN97k6OsIVLeT3di+iiRS73aQ7bfFti7+8kpmjocSzoIcA60ECtLjJE4+FVIam8OsrS7Z kSOXXHrePQD1H0ElFXUyZum6svIvB2U= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf05.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZDDgq5jzNz67MmR; Fri, 14 Mar 2025 01:16:55 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id CA15A140418; Fri, 14 Mar 2025 01:20:05 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Thu, 13 Mar 2025 18:20:05 +0100 Date: Thu, 13 Mar 2025 17:20:04 +0000 From: Jonathan Cameron To: Gregory Price CC: , , , Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Message-ID: <20250313172004.00002236@huawei.com> In-Reply-To: References: X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml500011.china.huawei.com (7.191.174.215) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 10A94100015 X-Stat-Signature: 7k6uo54yr5ho73w54qay45kr7mh7jjh3 X-HE-Tag: 1741886407-483731 X-HE-Meta: U2FsdGVkX1/QB6yiGH0x2GKTN7CFM0cYtffTnwwaaGGy+alcNNMFwtL2SnauxXSgF648btXyw/I7ZBmXVwayW9JL4FKCsT4ZouS+S0tnFSQk1nwBvsnOawdwuU1Tvi+PSHQwckz3oebgDPayKyZ5keF9NAOABwJUfmdA0ZVPYtWI/hT6AtXnxvtnpI25S+KAh9sl9FjcsZoEdJM5BdJ4dwjY51rEIiJo4+Nr16/keAGH9ZKIOdG0Gp7DRPgp5v3w1nm9T2zrVwTqhY4g94c4Q3zGlYwWw9pWEF8R2HgzJTsTLzBzNxogMV+UkXV4hxx4szcNCmAnWg7q69rbPkKS7qYm7ETzVTjsS2ZJbmSQ/B+W5ELY5SVVGQpcV7HuM2xIwJAaTEi3iVYwRxrXlf0mkWPDlSeDhrw0oddX7QCnKfy1c0ZAHgOP2Z5aUXpELGmUFAklADtHgdTrc3wJwTyXARSk/tqKOkJRZ5HVnQTZ89tArMUNZ0+K2Y5JW6G8tuY5Rs9MSNHihHGRBcTHrBYNqzeiIWPMocyY7+emxIP2IltcSHU7DK2Flz6YXx1Ee6jqNiwSTSVFaCZX7wCV+i31SMDdE+APq4kayx1bDp4o1VG/KrANvLh6j9sAjYzzC/2B5fSSkp24R7G0e/GKfFe8cN0cB/t4IHXhSR/d81uJ6DM3AGoeF9gCxn+sWDTuV0Idd0NVROddPtie/YjL8cSaR14VmydcvdCLNnGWY0ykrMKvnUTatnvD63k3wrNZtq3PJkeokuEgRb+efY1YLpvojTXr2ikSie1gwt98GplKnzdYBYarBkSmEAjSrFll/FfcW/OD7qMBahKHW49nRiKbAoPRlyteDeSL0CZ4QTJ1VCE1bEcjfz3QYtB/OL9RhiEe727Xqn+e+MxtZpjDh+t3hiS30U+iXcwpXgJanHCKdlxgdoa51UjQG7U2aXWRaLBL3PFVwrcBaFQvhcK/RWi IYe+R4le gQrJJ6kk8ndFLfX+1mrz+aL5n+fv48l3qewUO6ilfN5Fh7llskUqtPQ+1AVg7HjgCEXbgwwndGsdbP+jFGZeW3Qqjx68EzJrIRTBjHu8w5x/BHNlBcdoYpqpgFMsicHy5+uH1yBWeeLDtDfrETSZxpjR3hE15IA1Si5dXC+xdeIm1paTOxtQiocnSrx9N8AsmPRwMmnGLtb0yQe8+JLuTSl02sT1b5jjH2TSKox9JEh+0NKQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 7 Mar 2025 22:23:05 -0500 Gregory Price wrote: > In the last section we discussed how the CEDT CFMWS and SRAT Memory > Affinity structures are used by linux to "create" NUMA nodes (or at > least mark them as possible). However, the examples I used suggested > that there was a 1-to-1 relationship between CFMWS and devices or > host bridges. > > This is not true - in fact, CFMWS are a simply a carve out of System > Physical Address space which may be used to map any number of endpoint > devices behind the associated Host Bridge(s). > > The limiting factor is what your platform vendor BIOS supports. > > This section describes a handful of *possible* configurations, what NUMA > structure they will create, and what flexibility this provides. > > All of these CFMWS configurations are made up, and may or may not exist > in real machines. They are a conceptual teching tool, not a roadmap. > > (When discussing interleave in this section, please note that I am > intentionally omitting details about decoder programming, as this > will be covered later.) > > > ------------------------------- > One 2GB Device, Multiple CFMWS. > ------------------------------- > Lets imagine we have one 2GB device attached to a host bridge. > > In this example, the device hosts 2GB of persistent memory - but we > might want the flexibility to map capacity as volatile or persistent. Fairly sure we block persistent in a volatile CFMWS in the kernel. Any bios actually does this? You might have a variable partition device but I thought in kernel at least we decided that no one was building that crazy? Maybe a QoS split is a better example to motivate one range, two places? > > The platform vendor may decide that they want to reserve two entirely > separate system physical address ranges to represent the capacity. > > ``` > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000200000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 000A <- Bit(3) - Persistant > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS) > ``` > > You might have a CEDT with two CFMWS as above, where the base addresses > are `0x100000000` and `0x200000000` respectively, but whose window sizes > cover the entire 2GB capacity of the device. This affords the user > flexibility in where the memory is mapped depending on if it is mapped > as volatile or persistent while keeping the two SPA ranges separate. > > This is allowed because the endpoint decoders commit device physical > address space *in order*, meaning no two regions of device physical > address space can be mapped to more than one system physical address. > > i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000) > > (See Section 2a - decoder programming). > > ------------------------------------------------------------- > Two Devices On One Host Bridge - With and Without Interleave. > ------------------------------------------------------------- > What if we wanted some capacity on each endpoint hosted on its own NUMA > node, and wanted to interleave a portion of each device capacity? If anyone hits the lock on commit (i.e. annoying BIOS) the ordering checks on HPA kick in here and restrict flexibility a lot (assuming I understand them correctly that is) This is a good illustration of why we should at some point revisit multiple NUMA nodes per CFMWS. We have to burn SPA space just to get nodes. From a spec point of view all that is needed here is a single CFMWS. > > We could produce the following CFMWS configuration. > ``` > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region 1 > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000200000000 <- Memory Region 2 > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000300000000 <- Memory Region 3 > Window size : 0000000100000000 <- 4GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS) > ``` > > In this configuration, we could still do what we did with the prior > configuration (2 CFMWS), but we could also use the third root decoder > to simplify decoder programming of interleave. > > Since the third region has sufficient capacity (4GB) to cover both > devices (2GB/each), we can actually associate the entire capacity of > both devices in that region. > > We'll discuss this decoder structure in-depth in Section 4. >