From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E597CC636CC for ; Tue, 31 Jan 2023 17:41:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 232176B0078; Tue, 31 Jan 2023 12:41:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E20E6B007B; Tue, 31 Jan 2023 12:41:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AA986B007D; Tue, 31 Jan 2023 12:41:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EE42F6B0078 for ; Tue, 31 Jan 2023 12:41:24 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B8CDEC012F for ; Tue, 31 Jan 2023 17:41:24 +0000 (UTC) X-FDA: 80415810888.29.A1E1D32 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf13.hostedemail.com (Postfix) with ESMTP id 9D9D32001B for ; Tue, 31 Jan 2023 17:41:21 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675186882; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6dXCSdMpc+rrHKFy5p3RekebaWcr+AR0DIqukFYxZmk=; b=4ts4ISeonWz3QhOe3FaBHzY4gqOZD5K2TiVjDz+6MYqLbwrJsDtLYZl4JOTo7GHjdUzlKt ByVRNhUU0UwgIDinc0+cMUcFSbLy/0kYCVPUG0VsSvyxjgB6WWHbnTBHKVs8bRReEYgxcd TTN4NrQTrVaVnZy4pwEA7+Izr5ONJ3U= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675186882; a=rsa-sha256; cv=none; b=ALASSEelUZEu3cY/WtTr/3Wu9r9jWahFf76Ja7YwPqpDXkLMGiYs845opJflagp0Pyxylj Q57T5r+A9hTjwm3+P3ytgxOVZZf/D18Wr4U96Q0S9BY2YSKygpwjr8T23BBks6F6cJpw5p k5Q7oG77JbG0Cd4JltcClT85tHK0/XQ= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4P5sgL5d9mz683mQ; Wed, 1 Feb 2023 01:37:02 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Tue, 31 Jan 2023 17:41:16 +0000 Date: Tue, 31 Jan 2023 17:41:15 +0000 From: Jonathan Cameron To: Viacheslav A.Dubeyko CC: , , , Dan Williams , "Adam Manzanares" , Cong Wang , Viacheslav Dubeyko Subject: Re: [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Message-ID: <20230131174115.00007493@Huawei.com> In-Reply-To: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100001.china.huawei.com (7.191.160.183) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 9D9D32001B X-Stat-Signature: pqxr78qofjjug5frmier8sph51s36ap1 X-HE-Tag: 1675186881-61124 X-HE-Meta: U2FsdGVkX1/p9lXaxRgnwDPRNMxf5KVtVlamZsTzcKLmdro2bfVPa2MyV17Wt1V6lTCIJzPjVHf6O64fgS9Pj46Q6BeI+cXzkVRnvYhW0J87v0IlGACW6nMFkT9ZjbMu9XOMou7Dn90Kqb2XzgHPWG98fQTBmtlx0TfFm55SLJJkDztocQrojHc0A7q76TyWJWDwEnDwLmulFiGeppn6EZ/P6bQtlKd9+9CfLLsw9ri5lWHTCiJztz+0AGlz/4ptYp/PH457vZzywODl1ZdPGX6Tp4kkKuZfu0MgzzXvTOz1CCDDFea4u5KkmWasUV5ADmThACzSG75Vs9x/Z1943dowOhItGliwZc735gSE0kQt11fh11z1Ap6afut8LvvRUOLcVuEgVyR/RHXKjsiACmlGAxGbke2QfGFML0jb3EVi56tKshJA++4rnX40BbZpdfb0lCZ3NQ55wdSHxnSWS2+joUxwnGr4XIvU/gWRM5ToTlp94qGG7hymXW5hXv3kJRDtZTlQ/dWiq6PC3EgrusMQTuTdmTUHsnGVPkE3WeEB7Ml36KPZlYYdo+fs9ZfTdAwoGvE2eqNdSRsdOEJ0vvml34Wm+WFAHTod0B8CVYVol2nvu1uBBT/Wos+XpONYlfkLrZ00bUgJ2WxKQd2TLRpuYkbi6mP/LePKaL74gUsL3vsplY0O6+9LiRfJ2sIe5bZlqPWAYjWySPA4zotJGyHai8+cPah9gP7spo23O7RuVHVfJevJrJdLMykM1j7jfsqlNYVQOMl2jGxzLJMeNivfgULyPtCOuXsCa5vxgiZuwLDfPPIUW5MTVPS+RUVbjsSjtlUZC0fKWBJyYbN3POJrXjXJ8ulx9XFHyAM5uYMzs9HUFxEy5QZcXLRzA83aqpEI1GC5vgTiXVal7XJzPVfS1lSEhOAFegiMg0pplc47J7DO/v1DuZkrPueg2eFkHEsRS1ku4BLNIjx7u+S YdVNrYXE uIFR8BX/zgwy863C571z1Wz8g46DiSPr/LLHsVmg0IlrtxERnylAgkCDkmwTWyWGk9/PgjrymKlnT4A57oG5Qx6k07XsQVRdCKbh2gKtuHtd3FtoD5XGac9/70xpZOQg6W4TRByXlb4v3Ir1+Jz7Lh0Yclj4LO0Jbl0NBjx56cWUVxeQb0oT/YJ9shs6KULEpoLNFKCOKVuswesAamzkXuodkbpvKSnmaPDbs6PUUzjHcX6NcPo7KEV6Aobz4Q28dXywyez2DkNDPqsWcTfF16mevT1BoSEUTOWtJ8wif/AcQB+5LiMm5F7tHUDA8hBHNS50RFpOS6ppalj0E9y8FHJOXLZuIX+hjsvwqFIoS9HXiYOsDO0xd3ei3vw5ppRla/o3TFFIYELkeYSQ/+NI9FmWdYZKsWcdzp+sjK8gmQH09Zw7y6N01kd9URErYRP61HEhW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 30 Jan 2023 11:11:23 -0800 "Viacheslav A.Dubeyko" wrote: > Hello, Hi Slava, I'll throw some opinions at this :) >=20 > I would like to suggest Fabric Manager (FM) architecture discussion. As f= ar as I can see, > FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) Q= EMU emulation > of CXL hardware features. FM daemon receives requests from configuration = tool and > executes commands by means of interaction with kernel-space subsystem and= CXL switch > (that can be emulated by QEMU). So, the key questions for discussion: Worth describing operating modes to be supported: You kind of cover this la= ter but I think pulling it out make it clearer that we want one bit of software= to do several different things. 1) FM separate from hosts and talked to by higher level orchestration softw= are but using a Switch CCI or MHD mailbox (over PCI) This one is fairly easy because any security / shooting self in foot pro= blems are an issue for higher level software.=20 2) FM on host. Probably mostly going be relevant for debug but may use the same mailbox as is being used by the existing CXL drivers (for Multi Head Device it might be the end point mailbox, for Multi Logical Device behind a switch it might be the switch mailbox). 3) All out of band (MCTP or similar - want some shared code, but no need for anything in kernel as far as I can tell). > (1) How to distribute functionality between user-space and kernel-space? Kernel for transport if mailbox based (switch or MHD). Possibly help in kernel with the host to Multiheaded device FM LD tunneling and host to switch to Multi Logical Device - Logical Device tunneling but that could also be left to userspace. If MCTP use the existing MCTP framework which is underlying transport indep= endent. I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C and some emulation) In the cover letter of the emulation PoC https://lore.kernel.org/linux-cxl/20220520170128.4436-1-Jonathan.Cameron@hu= awei.com/ I think everything else belongs in userspace. I believe there are redfish A= PIs etc that would then be used to query and drive the userspace program from an orchestrator or similar level software. > (2) Which functionality kernel-space needs to provide for implementation = FM features? > Which kernel-space functionality do we need to implement yet? Very little needed if we just expose the transport via PCI mailboxes. There is a possible concern that FM-API commands are frequently destructive and currently we don't let userspace poke destructive commands. That may just need a specific opt in to say we know we can shoot ourselves in the foot. > (3) Do we need MCTP (Management Component Transport Protocol) or some oth= er > protocol can be used for interaction between configuration tool, FM= daemon, and > CXL switch? Yes MCTP is needed. I don't think we want the actual management code to be different depending on transport / protocol. However we might layer it so that there is an interface program that sits between the management library / program = and the FM-API transport. Note I was struggling to find a suitable MCTP interface to emulate - so wou= ld welcome suggestions on that. I hacked the above PoC using an aspeed i2c controller that supported the right magic combination of features needed for MCTP over I2C but it doesn't have ACPI support which rather limits usage (and I doubt anyone will be keen on adding ACPI support just to test CXL related code :) If anyone knows of a suitable MCTP host we could use for this that would be great (MCTP over PCI VDM might be nice for example) > (4) What architecture FM implementation requires? > (5) Does it make sense to use Rust as implementation language? Take your pick ;) First person to write a lot of code gets to pick the lang= uage. >=20 > CXL Fabric Manager (FM) is the application logic responsible for system c= omposition and > allocation of resources. The FM can be embedded in the firmware of a devi= ce such as > a CXL switch, reside on a host, or could be run on a Baseboard Management= Controller (BMC). > CXL Specification 3.0 defines Fabric Management as: "CXL devices can be c= onfigured statically > or dynamically via a Fabric Manager (FM), an external logical process tha= t queries and configures > the system=E2=80=99s operational state using the FM commands defined in t= his specification. The FM is > defined as the logical process that decides when reconfiguration is neces= sary and initiates > the commands to perform configurations. It can take any form, including, = but not limited to, > software running on a host machine, embedded software running on a BMC, e= mbedded firmware > running on another CXL device or CXL switch, or a state machine running w= ithin the CXL device > itself.=E2=80=9D. CXL devices are configured by FM through the Fabric Man= ager Application Programming > Interface (FM API) command sets through a CCI (Component Command Interfac= e). A CCI is > exposed through a device=E2=80=99s Mailbox registers or through an MCTP-c= apable (Management > Component Transport Protocol) interface. >=20 > FM API Commands (defined by CXL Specification 3.0): > (1) Physical switch (Identify Switch Device, Get Physical Port State, Phy= sical Port Control, > Send PPB (PCI-to-PCI Bridge) CXL.io Configuration Request). > (2) Virtual Switch (Get Virtual CXL Switch Info, Bind vPPB (Virtual PCI-t= o-PCI Bridge), > Unbind vPPB, Generate AER (Advanced Error Reporting Event). > (3) MLD Port (Tunnel Management Command, Send LD (Logical Device) or > FMLD (Fabric Manager-owned Logical Device) CXL.io Configuration Requ= est, > Send LD CXL.io Memory Request). > (4) MLD Components (Get LD (Logical Device) Info, Get LD Allocations, Set= LD Allocations, > Get QoS Control, Set QoS Control, Get QoS Status, Get QoS Allocated = Bandwidth, > Set QoS Allocated Bandwidth, Get QoS Bandwidth Limit, Set QoS Bandwi= dth Limit). > (5) Multi-Headed Devices (Get Multi-Headed Info). > (6) DCD (Dynamic Capacity Device) Management (Get DCD Info, Get Host Dyna= mic > Capacity Region Configuration, Set Dynamic Capacity Region Configura= tion, Get DCD > Extent Lists, Initiate Dynamic Capacity Add, Initiate Dynamic Capaci= ty Release). >=20 > After the initial configuration is complete and a CCI on the switch is op= erational, an FM can > send Management Commands to the switch. An FM may perform the following d= ynamic > management actions on a CXL switch: (1) Query switch information and conf= iguration details, > (2) Bind or Unbind ports, (3) Register to receive and handle event notifi= cations from the switch > (e.g., hot plug, surprise removal, and failures). A switch with MLD (Mult= i-Logical Device) > requires an FM to perform the following management activities: (1) MLD di= scovery, > (2) LD (Logical Device) binding/unbinding, (3) Management Command Tunneli= ng. The FM can > connect to an MLD (Multi-Logical Device) over a direct connection or by t= unneling its > management commands through the CCI of the CXL switch to which the device= is connected. > The FM can perform the following operations: (1) Memory allocation and Qo= S Telemetry > management, (2) Security (e.g., LD erasure after unbinding), (3) Error ha= ndling. >=20 > FM configuration tool requires such commands: A command line tool is fine, but like the 'real' FM configuration interface= will be via a protocol (e.g. redfish). https://www.dmtf.org/standards/redfish There is a WIP for CXL, though not sure on latest status on this (document = on there is from 2021) So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfish= tool http://github.com/DMTF/RedFishTool etc that just makes it a bit easier to p= oke with common commands. I'm far from an expert of redfish so may have this all wrong. >=20 > Discover - discover available agents > Subcommands: > - fm_cli discover fm - discover FM instances If we are allowing more than one FM then I'd expect all the other commands to be directed at that by some sort of FM specific ID. If only one, what does this command do that isn't better done with fm get_info > - fm_cli discover cxl_devices - discover CXL devices > - fm_cli discover logical_devices - discover logical devices Discover switches as well. >=20 > FM - manage Fabric Manager > Subcommands: > - fm_cli fm get_info - get FM status/info > - fm_cli fm start - start FM instance > - fm_cli fm restart - restart FM instance > - fm_cli fm stop - stop FM instance > - fm_cli fm get_config - get FM configuration > - fm_cli fm set_config - set FM configuration I'd keep this slim for now. No idea what FM config we might want to set so don't bother listing command yet. > - fm_cli fm get_events - get event records Not sure what FM would have in the way of events (as opposed to things it is talking to). >=20 > Switch - manage CXL switch > Subcommands: > - fm_cli switch get_info - get CXL switch info/status These all need an ID field of some type to identify which switch. > - fm_cli switch get_config - get switch configuraiton > - fm_cli switch set_config - set switch configuration >=20 > Logical Device - manage logical devices > Subcommands: > - fm_cli multi_headed_device info - retrieves the number of heads, nu= mber of > supported LDs, and Head-to- LD mapping of a Multi-Headed device > - fm_cli logical_device bind - bind logical device > - fm_cli logical_device unbind - unbind logical device > - fm_cli logical_device connect - connect Multi Logical Device to CXL= switch > - fm_cli logical_device disconnect - disconnect Multi Logical Device = from CXL switch > - fm_cli logical_device get_allocation - Get LD Allocations (retrieve= s the memory > allocations of the MLD) > - fm_cli logical_device set_allocation - Set LD Allocations (sets the= memory allocation > for each LD) > - fm_cli logical_device get_qos_control - Get QoS Control (retrieves = the MLD=E2=80=99s QoS > control parameters) > - fm_cli logical_device set_qos_control - Set QoS Control (sets the M= LD=E2=80=99s QoS control > parameters) > - fm_cli logical_device get_qos_status - Get QoS Status (retrieves th= e MLD=E2=80=99s QoS Status) > - fm_cli logical_device get_qos_allocated_bandwidth - Get QoS Allocat= ed Bandwidth > (retrieves the MLD=E2=80=99s QoS allocated bandwidth on a per-L= D basis) > - fm_cli logical_device set_qos_allocated_bandwidth - Set QoS Allocat= ed Bandwidth > (sets the MLD=E2=80=99s QoS allocated bandwidth on a per-LD bas= is) > - fm_cli logical_device get_qos_bandwidth_limit - Get QoS Bandwidth L= imit (retrieves the > MLD=E2=80=99s QoS bandwidth limit on a per-LD basis) > - fm_cli logical_device set_qos_bandwidth_limit - Set QoS Bandwidth L= imit (sets the > MLD=E2=80=99s QoS bandwidth limit on a per-LD basis) > - fm_cli logical_device erase - secure erase after unbinding >=20 > PCI-to-PCI Bridge - manage PPB (PCI-to-PCI Bridge) > Subcommands: > - fm_cli ppb config - Send PPB (PCI-to-PCI Bridge) CXL.io Configurati= on Request That one may want a more convenient interface as likely a lot of commands w= ould be sent if aim is to configure a device before binding. Also CXL.io Memory request= s want to be here I think. > - fm_cli ppb bind - Bind vPPB (Virtual PCI-to-PCI Bridge inside a CXL= switch that is > host-owned) > - fm_cli ppb unbind - Unbind vPPB (unbinds the physical port or LD fr= om the virtual > hierarchy PPB) >=20 > Physical Port - manage physical ports > Subcommands: > - fm_cli physical_port get_info - get state of physical port > - fm_cli physical_port control - control unbound ports and MLD ports,= including issuing > resets and controlling sidebands > - fm_cli physical_port bind - bind physical port to vPPB (Virtual PCI= -to-PCI Bridge) > - fm_cli physical_port unbind - unbind physical port from vPPB (Virtu= al PCI-to-PCI Bridge) >=20 > MLD (Multi-Logical Device) Port - manage Multi-Logical Device ports > Subcommands: > - fm_cli mld_port tunnel - Tunnel Management Command (tunnels the pro= vided command > to LD FFFFh of the MLD on the specified port) Make if clear how nesting of commands in a tunnel would be specified. > - fm_cli mld_port send_config - Send LD (Logical Device) or FMLD (Fab= ric > Manager-owned Logical Device) CXL.io Configuration Request > - fm_cli mld_port send_memory_request - Send LD CXL.io Memory Request >=20 > DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device > Subcommands: > - fm_cli dcd get_info - Get DCD Info (retrieves the number of support= ed hosts, > total Dynamic Capacity of the device, and supported region confi= gurations) > - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region C= onfiguration > (retrieves the Dynamic Capacity configuration for a specified ho= st) > - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Config= uration > (sets the configuration of a DC Region) > - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dy= namic Capacity > Extent List for a specified host) > - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates = the addition of > Dynamic Capacity to the specified region on a host) That one is complex ;) Probably needs a whole man page to itself. > - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (in= itiates the release of > Dynamic Capacity from a host) >=20 > FM daemon receives requests from configuration tool and executes commands= by means of > interaction with kernel-space subsystems. The responsibility of FM daemon= could be: > - Execute configuration tool commands > - Manage hot-add and hot-removal of devices In what sense? I'd expect it to notify some higher level entity (orchestrator or similar) but not sure I see what management the FM would do. =20 > - Manage surprise removal of devices Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea what to do in the way of managing this. Scream loudly? > - Receive and handle even notifications from the CXL switch > - Logging events > - Memory allocation and QoS Telemetry management > - Error/Failure handling I'm not sure on separation of role between this component and higher level policy / admin driven software. For memory allocation it might take a 'give host A this much memory with this characteristic set' command and own the allocations across all present devices, or it might just act as an interface layer to higher level software that does the fine detail of figuring out which device to allocate memory from to satisfy such a request. Whilst I agree having a broad vision for an interface is good there are a lot of subtle details in some of these commands so I'd not spend too long refining the whole lot. Probably better to look at them one at a time and then just have whoever ends up maintaining / reviewing this thing responsible for making sure the parameter format etc is consistent across commands. Fun fun fun Jonathan >=20 > Thanks, > Slava. >=20