From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16895C636D4 for ; Thu, 2 Feb 2023 09:54:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3AA736B0071; Thu, 2 Feb 2023 04:54:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 35AB86B0072; Thu, 2 Feb 2023 04:54:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1FB566B0073; Thu, 2 Feb 2023 04:54:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 083C26B0071 for ; Thu, 2 Feb 2023 04:54:14 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C6D95A0347 for ; Thu, 2 Feb 2023 09:54:13 +0000 (UTC) X-FDA: 80421891186.23.DDC5A33 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf15.hostedemail.com (Postfix) with ESMTP id 60483A0011 for ; Thu, 2 Feb 2023 09:54:10 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675331651; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BGHRq2BoB0fmZLp4/A1e3PZjKKs3N4FZiIs/mKRLy8w=; b=ojNJitVHImED0ZwE/2u9HI+1TZTtN86OunhT9f38kLA2fEdkrRccdN/EGvuMTyDZltg3n4 ZJrCS4zZGSBNs/nABY0YK/1cT7f6ilId2UlNsp8J/5N2vwQxJptnV5ismX8TV2V92LUxng /3M5nLBahVngF7P1C04gtEjWSMhNasM= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675331651; a=rsa-sha256; cv=none; b=L5rwccaT2VMpSB7omX3dkAgq2jMmpEcnc1andiwqQAzSM2hrRdv4vUMlG33H6DJsMoyk0X pZfmjQRUIVJCWYZwimTUizdm3pRFmrBKeZaPHiIe1zmjaJyxxcr1opGjjkwLhwMIvNpX/R rrzb48LRBeyl0SfoZZURECCo80ZtCqQ= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4P6vCG6FzJz6J7Pf; Thu, 2 Feb 2023 17:49:46 +0800 (CST) Received: from localhost (10.81.211.68) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Thu, 2 Feb 2023 09:54:03 +0000 Date: Thu, 2 Feb 2023 09:54:02 +0000 From: Jonathan Cameron To: Viacheslav A.Dubeyko CC: , , , Dan Williams , "Adam Manzanares" , Cong Wang , Viacheslav Dubeyko Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Message-ID: <20230202095402.0000585d@Huawei.com> In-Reply-To: <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.81.211.68] X-ClientProxiedBy: lhrpeml100004.china.huawei.com (7.191.162.219) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Stat-Signature: efj8ekotif43ufg4skqqjozmry4keqp3 X-Rspam-User: X-Rspamd-Queue-Id: 60483A0011 X-Rspamd-Server: rspam06 X-HE-Tag: 1675331650-715819 X-HE-Meta: U2FsdGVkX1+JsJEhkS63K6w1UWs8UbEBz3ezZ1JNgnKbmdcaR8ENP7EWSWv9kItxiQBP15vwEuek48XmIIl9qb2QQIUkkMUlciLog5L2nRJ1gdiEKNUWZu8Aw0BDedN+NAu4x/SJ2YvI4rUxAnz2ytCNPA5nWHsfmeP77Z+9YCsFAm5XO2l08LPuStWYI26CPE5Zu0Uskyy2DgBuulKjQHNNq9YQLYxVjMRDGE7Fk2iSgFcdopq21hZ0ck1ftshcEyjRk1+A1DouIJyIvo7lYeKfFE8JJ189WEtPoESNQtb4NLm26QgM5W32ptHxF+5eoZxEgplkgwReqF7OwtIPA2+SAoQX0P5PjizV/liBWtKYJWWDvW+TnOGMEIdd6/c+U5X8Z1ujzibtkZ2cDVJn6ni8eHmZa+dnoyLyST5DFPqdoTW/7LQUA4K/1RU1msLKO57BAf/ydnj/JRvS7hphLrv5fKaF8Uax66fOF7HmTS5KSy2WDQ+WlD65FRuiFV8kXHvD8MFyLQ90nsM5kU915SnA1uc9JGAQ9L3vjbZGUXNtLGULJ+/xQ8/LGt5iHkrGXUG6EC6y1RQwCWMco5vC/LnoDjjJlEuL3keQqqW8tcLx9AWuxG3DKq/+lgHAcQ7znT2n1WlnBhKI0NqpjPE+E9QitthCzRvhH0K6SusvZrNBpcbqMzn5HzNUg+Q7t6KRUciVRrLZr3NUmvteH9FEd49wdNEDIDBO1wGcfPBaRpIqGSsJyToN9Gc1nYd8zIzneDRTKBKY9CiWCBao0HDe0C4GVf5he9HpzKBW+HCQWqo2QZ6Th6nMH0wcSVN37XF7VTphBS6vc36CgQ1234LurhwmygwqGThZ1D7pPfiXIPp30uDDXbbmBC/DsY6b/CPWuNWMu8mzk5jHfuoZ6XNT8UW7JmoKxD2mssmhUsrpxf4+AL7h6a36oESShZAejhFSQdRBuGdSjUMC7JAysxF 0GU12KNl AjU5aUNHUkZnaRXCmmDMJpKFnhC1JPADDzLp/0pKp0u+GcMXMAJDhBb9ub/pMwjoXzsg040TOsrOfa7zBE7/bcfh9bbt9SPm5b44vuBlA+e0QUw6PsaKD1YmmBcJGBGJ/juJsFP/LFcPbjnlqkku5YwIY+Nvl6v/HXQnPdQ80F2L7XZ1Qv5X8t5Rct+X5SryfAHNjkzs7GFtXzcaK0mKC4QtSzuU8VLE8a4QRY6pbctn0T8sNE7Hded2vYhQOjxbyneucr3Q35jWYbANcKvjPSAvHLT6Er4cznT5aJO6tDHgStMjxuVbwgUYUiA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 1 Feb 2023 12:04:56 -0800 "Viacheslav A.Dubeyko" wrote: > Hi Jonathan, >=20 > > On Jan 31, 2023, at 9:41 AM, Jonathan Cameron wrote: > >=20 > > On Mon, 30 Jan 2023 11:11:23 -0800 > > "Viacheslav A.Dubeyko" wrote: > > =20 > >> Hello, =20 > >=20 > > Hi Slava, > >=20 > > I'll throw some opinions at this :) > > =20 > >>=20 > >> I would like to suggest Fabric Manager (FM) architecture discussion. A= s far as I can see, > >> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3= ) QEMU emulation > >> of CXL hardware features. FM daemon receives requests from configurati= on tool and > >> executes commands by means of interaction with kernel-space subsystem = and CXL switch > >> (that can be emulated by QEMU). So, the key questions for discussion: = =20 > >=20 > > Worth describing operating modes to be supported: You kind of cover thi= s later > > but I think pulling it out make it clearer that we want one bit of soft= ware to > > do several different things. > >=20 > > 1) FM separate from hosts and talked to by higher level orchestration s= oftware > > but using a Switch CCI or MHD mailbox (over PCI) > > This one is fairly easy because any security / shooting self in foot = problems > > are an issue for higher level software.=20 > > 2) FM on host. Probably mostly going be relevant for debug but may use > > the same mailbox as is being used by the existing CXL drivers (for Mu= lti > > Head Device it might be the end point mailbox, for Multi Logical Devi= ce > > behind a switch it might be the switch mailbox). > > 3) All out of band (MCTP or similar - want some shared code, but no > > need for anything in kernel as far as I can tell). > > =20 >=20 > Most probably, we will have multiple FM implementations in firmware. > Yes, FM on host could be important for debug and to verify correctness > firmware-based implementations. But FM daemon on host could be important > to receive notifications and react somehow on these events. Also, journal= ling > of events/messages/events could be important responsibility of FM daemon > on host.=20 I agree with an FM daemon somewhere (potentially running on the BMC type ch= ip that also has the lower level FM-API access). I think it is somewhat separate from the rest of this on basis it may well just be talking redfish to the FM and there are lots of tools for that sort of handling already. >=20 > > =20 > >> (1) How to distribute functionality between user-space and kernel-spac= e? =20 > >=20 > > Kernel for transport if mailbox based (switch or MHD). > > Possibly help in kernel with the host to Multiheaded device FM LD tunne= ling > > and host to switch to Multi Logical Device - Logical Device tunneling > > but that could also be left to userspace. > > =20 >=20 > People loves to move everything in user-space now. But I believe we could= have > as kernel-space as user-space solutions. I think we ned to check what way= could be > more efficient and elegant solution. Agreed - though I think we need to remember running this on the host that is using the devices isn't likely to be a common actual usecase. So we should design for that to 'work' but not to be the assumed method. Hence if any sync type activity is needed it might be a case of don't do the wrong thing rather than hard protections. >=20 > > If MCTP use the existing MCTP framework which is underlying transport i= ndependent. > > I posted a PoC for how this might work a while ago (hack on top of MCTP= -I2C > > and some emulation) In the cover letter of the emulation PoC > > =20 >=20 > Sounds interesting. Let me check it. But I believe it could not be not th= e first task > in this implementation. :) Some level of MCTP support needs to be early enough that we don't get any design decisions wrong. For MCTP I think the vast majority of handling has to be in userspace. I don't want to end up with duplication because we = did some of that down in the kernel for the mailbox solution. >=20 > >=20 > > I think everything else belongs in userspace. I believe there are redfi= sh APIs > > etc that would then be used to query and drive the userspace program fr= om an > > orchestrator or similar level software. > > =20 >=20 > I need to check the redfish API. It sounds reasonable to employ some exis= ting > framework. >=20 > >> (2) Which functionality kernel-space needs to provide for implementati= on FM features? > >> Which kernel-space functionality do we need to implement yet? =20 > >=20 > > Very little needed if we just expose the transport via PCI mailboxes. > > There is a possible concern that FM-API commands are frequently > > destructive and currently we don't let userspace poke destructive > > commands. That may just need a specific opt in to say we know we > > can shoot ourselves in the foot. > > =20 >=20 > I think this is why we need kernel. It sounds for me that we have to have= user-space > and kernel-space collaboration here. I think it will be lightweight and looks like the existing CXL mailbox user= space interface (some commands are the same). >=20 > >> (3) Do we need MCTP (Management Component Transport Protocol) or some = other > >> protocol can be used for interaction between configuration tool, = FM daemon, and > >> CXL switch? =20 > >=20 > > Yes MCTP is needed. > > I don't think we want the actual management code to be different > > depending on transport / protocol. However we might layer it so that t= here > > is an interface program that sits between the management library / prog= ram and > > the FM-API transport. > >=20 > > Note I was struggling to find a suitable MCTP interface to emulate - so= would > > welcome suggestions on that. I hacked the above PoC using an aspeed i2c > > controller that supported the right magic combination of features needed > > for MCTP over I2C but it doesn't have ACPI support which rather limits > > usage (and I doubt anyone will be keen on adding ACPI support just to > > test CXL related code :) If anyone knows of a suitable MCTP host we > > could use for this that would be great (MCTP over PCI VDM might be nice= for > > example) > > =20 >=20 > Let us start some command/feature implementation and we will figure it ou= t. > But, I assume we need to start from something like CXL devices discovery = at first. Sure - some of the kernel side of that was present in the switch-cci mailbo= x PoC Obviously tooling was a test hack though ;) >=20 > >> (4) What architecture FM implementation requires? > >> (5) Does it make sense to use Rust as implementation language? =20 > >=20 > > Take your pick ;) First person to write a lot of code gets to pick the = language. > > =20 >=20 > Yeah, I see the point. Rust can provide some benefits (memory safety mode= l, for example). > But it could introduce some issue with collaboration and makes implementa= tion more > slow. Everybody develops in C language. But switching on Rust could be no= t so easy > target. >=20 > >=20 > >>=20 > >>=20 > >> FM configuration tool requires such commands: =20 > >=20 > > A command line tool is fine, but like the 'real' FM configuration inter= face will be via > > a protocol (e.g. redfish). > > There is a WIP for CXL, though not sure on latest status on this (docum= ent on there is from > > 2021) > >=20 > > So ultimately I'd expect fm_cli to be a wrapper around libredfish / red= fishtoo > > that just makes it a bit easier to poke > > with common commands. > >=20 > > I'm far from an expert of redfish so may have this all wrong. > > =20 >=20 > Sounds reasonable to me. Let me check how good it could be for this proje= ct. >=20 > >>=20 > >> Discover - discover available agents > >> Subcommands: > >> - fm_cli discover fm - discover FM instances =20 > >=20 > > If we are allowing more than one FM then I'd expect all the > > other commands to be directed at that by some sort of FM specific > > ID. If only one, what does this command do that isn't better > > done with fm get_info > > =20 >=20 > Yes, we need to identify every object somehow. And it=E2=80=99s interesti= ng point. > From point of view, some human-friendly names could be good. > But firmware-based FM implementation needs to follow the same rules. > And it sounds for me that CXL specification should define how CXL FM or > CXL device identify itself. Anyway, we need to ask CXL device and it shou= ld > return to us some ID. Probably, it will be some GUID or likewise number. >=20 > > =20 > >> - fm_cli discover cxl_devices - discover CXL devices > >> - fm_cli discover logical_devices - discover logical devices =20 > >=20 > > Discover switches as well. > > =20 >=20 > I assumed that CXL switch is a subclass of CXL devices. Do you mean that > it is independent case? Maybe simpler broken out. What you do with a switch is often very different form type 3 devices. >=20 > >>=20 > >> FM - manage Fabric Manager > >> Subcommands: > >> - fm_cli fm get_info - get FM status/info > >> - fm_cli fm start - start FM instance > >> - fm_cli fm restart - restart FM instance > >> - fm_cli fm stop - stop FM instance > >> - fm_cli fm get_config - get FM configuration > >> - fm_cli fm set_config - set FM configuration =20 > >=20 > > I'd keep this slim for now. No idea what FM config we might want to > > set so don't bother listing command yet. > > =20 >=20 > Yeah, it=E2=80=99s not completely clear yet. But I assume we can consider= such > configuration options like: > (1) register to receive event notifications > (2) logging of events > (3) errors handling >=20 > >> - fm_cli fm get_events - get event records =20 > > Not sure what FM would have in the way of events (as opposed to > > things it is talking to). > > =20 >=20 > I think FM can log events. If we consider FM daemon on host, then it > could issue messages to end user as reaction to some events. >=20 > >>=20 > >> Switch - manage CXL switch > >> Subcommands: > >> - fm_cli switch get_info - get CXL switch info/status =20 > >=20 > > These all need an ID field of some type to identify which switch. > > =20 >=20 > Yeah, it is exactly what we need for every command. We need to identify > an object for a request. >=20 > >> - fm_cli switch get_config - get switch configuraiton > >> - fm_cli switch set_config - set switch configuration =20 >=20 > >=20 > >>=20 > >> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device > >> Subcommands: > >> - fm_cli dcd get_info - Get DCD Info (retrieves the number of suppo= rted hosts, > >> total Dynamic Capacity of the device, and supported region con= figurations) > >> - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region= Configuration > >> (retrieves the Dynamic Capacity configuration for a specified = host) > >> - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Conf= iguration > >> (sets the configuration of a DC Region) > >> - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the = Dynamic Capacity > >> Extent List for a specified host) > >> - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiate= s the addition of > >> Dynamic Capacity to the specified region on a host) =20 > >=20 > > That one is complex ;) Probably needs a whole man page to itself. > > =20 >=20 > Currently, it=E2=80=99s only declaration of command set. Yeah, implementa= tion will be complex. :) >=20 > >> - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (= initiates the release of > >> Dynamic Capacity from a host) > >>=20 > >> FM daemon receives requests from configuration tool and executes comma= nds by means of > >> interaction with kernel-space subsystems. The responsibility of FM dae= mon could be: > >> - Execute configuration tool commands > >> - Manage hot-add and hot-removal of devices =20 > >=20 > > In what sense? I'd expect it to notify some higher level entity > > (orchestrator or similar) but not sure I see what management the > > FM would do. =20 > > =20 >=20 > I assume that if FM manages some metadata, then hot-add or hot-removal co= uld > require some metadata corrections. Also, hot-add and hot-removal can gene= rate some > events that FM can receive and process somehow. For example, it is possib= le to log > event messages into some journal. Ok. Potentially stuff there - though exactly which layer ends up managing t= his stuff isn't obvious to me yet. >=20 > >> - Manage surprise removal of devices =20 > >=20 > > Likewise, beyond reporting I wouldn't expect the FM daemon to have any = idea > > what to do in the way of managing this. Scream loudly? > > =20 >=20 > Maybe, it could require application(s) notification. Let=E2=80=99s imagin= e that application > uses some resources from removed device. Maybe, FM can manage kernel-space > metadata correction and helping to manage application requests to not exi= sting > entities. Notifications for the host are likely to come via inband means - so type3 d= river handling rather than related to FM. As far as the host is concerned this i= s the same as case where there is no FM and someone ripped a device out. There might indeed be meta data to manage, but doubt it will have anything = to do with kernel. >=20 > >> - Receive and handle even notifications from the CXL switch > >> - Logging events > >> - Memory allocation and QoS Telemetry management > >> - Error/Failure handling =20 > >=20 > > I'm not sure on separation of role between this component and > > higher level policy / admin driven software. > >=20 > > For memory allocation it might take a 'give host A this much > > memory with this characteristic set' command and own the > > allocations across all present devices, or it might just > > act as an interface layer to higher level software that does > > the fine detail of figuring out which device to allocate memory > > from to satisfy such a request. > >=20 > > Whilst I agree having a broad vision for an interface is good > > there are a lot of subtle details in some of these commands > > so I'd not spend too long refining the whole lot. Probably better > > to look at them one at a time and then just have whoever ends > > up maintaining / reviewing this thing responsible for making sure the > > parameter format etc is consistent across commands. > > =20 >=20 > Yes, I agree. Let=E2=80=99s do it step by step. I believe we need to star= t from > implementation the application that process commands and do nothing > at first. And first command that needs to be implemented is a discovery > of CXL devices, switches, and FM instances because we need to identify > CXL object somehow for any other command. Agreed discover of devices and capabilities is definitely where to start + I think presenting that as a redfish model. Jonathan >=20 > Thanks, > Slava. >=20