From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42805C64EC5 for ; Thu, 9 Feb 2023 11:05:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 953006B0071; Thu, 9 Feb 2023 06:05:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 902B66B0072; Thu, 9 Feb 2023 06:05:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CAA06B0074; Thu, 9 Feb 2023 06:05:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6CA106B0071 for ; Thu, 9 Feb 2023 06:05:10 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 36F7940EBE for ; Thu, 9 Feb 2023 11:05:10 +0000 (UTC) X-FDA: 80447471580.20.8DA6243 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf20.hostedemail.com (Postfix) with ESMTP id DC2951C0019 for ; Thu, 9 Feb 2023 11:05:06 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675940708; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=R3BG6e1rz/UZtFOkgbX7sZJAQ8I/3HDjsiOoKOIDqzg=; b=Ugewhmv80XVtcbqDZlmn0Kp02rxtmKlNb17MY/sM1T+CKD4SHk/TXSbm56lTo/cVuCsQ/7 QS73s5j8BB8IC8QZJ0jqAEmVpUigeKdwZJ+U95Kxu7OssBX/H4mm95IMhg1kSDpl0cAFwa 2XbuMf/uRIzzy+PMDm3/S9slINvzyLk= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675940708; a=rsa-sha256; cv=none; b=GAzsHc4ZLrCA9Ei/qryYo8RnfpFcFFbh5DGkWZgxsGF5iJuUyWwEOyO4I4lMTvkUsA1GVc iPJr9QWHmpXuNDmEVxbV3y+lr61qJC7IMC+TyrKHEwXFsuWk/e3GmZ3CEFscaopaFUjJPy /gk3jSXpEGZHxN6lV5UateFOJXuHvh0= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4PCDS949l2z67N3D; Thu, 9 Feb 2023 19:00:57 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.17; Thu, 9 Feb 2023 11:05:03 +0000 Date: Thu, 9 Feb 2023 11:05:02 +0000 From: Jonathan Cameron To: Viacheslav A.Dubeyko CC: Adam Manzanares , "lsf-pc@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-cxl@vger.kernel.org" , Dan Williams , "Cong Wang" , Viacheslav Dubeyko Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Message-ID: <20230209110502.00001a7a@Huawei.com> In-Reply-To: <7E864E85-A36F-487B-8B70-C8C49FBECD73@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> <20230202095402.0000585d@Huawei.com> <20230208163844.GA407917@bgt-140510-bm01> <7E864E85-A36F-487B-8B70-C8C49FBECD73@bytedance.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml500002.china.huawei.com (7.191.160.78) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: DC2951C0019 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: krexj599n1ctjo9odbdtgcgw4nsju6gj X-HE-Tag: 1675940706-606525 X-HE-Meta: U2FsdGVkX19wd66OJ0adHF3n08OTyrZJ7JxwXBJWpStu8eUDG+hPQj4rCECyyHdh7CFGHA+a2aMfjSUGzGs32iS3CDkCy8zsK5UejE0oo6sCew2BbvhzsLKL2l8/IMcJ7FvbKRQ/P8SVJ8oL7oyiXlOXwSJAOBT9VjAA7NPrTPrj0yalq/G/RdRU9R/2eJEpU4MJw2OQ+86G6wCQGK5u0Rqg5ZBQXEKaG1hMMDZS56RdIv8dAHQQ/hoZ962Vah/O1wARqFHMxsLCL/MWdkYj2ImFAOcX4H71tovbqib9EXGrL97XZ7QmyFQaLppThYNMTvDyleTXxtZsvK93a69LCS7MMJUvJVdqJBQJgbu2N02+jHfVX8lKCIC+XcdZjP446i5xo0FFdqfTgCEZFur/imASkaozPtcFq3DjkeSLArF7RAIxGJANWO9XmkhugDgaW2T+LCSseVvoY7eG2+5Qgobh/uHrNXd+aRcsZSuSwpULbJ0h0HiGHgeoZxsZ0jxlVUQ/05KPUJ1NJIVQF0i6NPSK/p3+zXWo2XCi2HINMCTpToOyvP/UaKk3Dxrtt0cpkrue11yOwwTDIv5O3mqqgwIXY2I4nuHqrKD8xSA59vWCLvDzpA/V50DNp4xM1FVImbPtYbTsZCQAMs05/BdIakPa2Ct46UI6plDG0HWJQf+DKiG8koXf46Q9KJ3AQbAIRkFIDrzhYz929MZ2pTJtM8u/0fIrXunoPgMHiT8isgMoli1NmcyoZH0KBMBRAHqAFS/XnePXnzIxt6hMeDkjvwBaZLue9pUTJgghX18HFNtBS3xgzf2cG+8kGWNjbktfSnKqddc4Vu1gdy25A5/as5ZPWcUOFdRQl8aZ2B6UmlQnDdM+KyrCcg7JCuXfkb8HCxeGMpgw8u28EVwEezcwPFHeRI0FG6IimZjD68xqfXxTi+WjDeOVXwVJnl/Zmio8fG1AAqGO4/cAlr5/4vE AHWTCvEK F/2PqZH0arwTEOcAzDXvh+4vf9GxZqQVeG8pjpFHk6rSLI1CekwscNOe6bSUkxz588jFVDnKGW+NzSloxMHlkKiIujl2fLa4r8oYDR2c+u/MvgP77iwO/ibTAgmyajrZyc9Z8My/E2XVYVj8LkdpN3J7dWYOX4NvkKYPsYzT4UrZys6Lg5uM62geyX1L/CZmHcqqRo4wD8aXo6Dx0r/n+UC9MNdkQempA6PcAKPE4yK3vVZMccvML5DWyXXzw142DbyovZeUJOwsvD989LFPW9xRO2MG2RDvwF/zauwB2IsfwhIME4P9b4BaPxMEHsOoFPCTj1j/fJzmn9SLlMsGFV0ojgG/aVqQsX3So X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 8 Feb 2023 10:03:57 -0800 "Viacheslav A.Dubeyko" wrote: > > On Feb 8, 2023, at 8:38 AM, Adam Manzanares = wrote: > >=20 > > On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote: =20 > >> On Wed, 1 Feb 2023 12:04:56 -0800 > >> "Viacheslav A.Dubeyko" wrote: > >> =20 > >>>> =20 >=20 > >=20 > >>>=20 > >>> Most probably, we will have multiple FM implementations in firmware. > >>> Yes, FM on host could be important for debug and to verify correctness > >>> firmware-based implementations. But FM daemon on host could be import= ant > >>> to receive notifications and react somehow on these events. Also, jou= rnalling > >>> of events/messages/events could be important responsibility of FM dae= mon > >>> on host. =20 > >>=20 > >> I agree with an FM daemon somewhere (potentially running on the BMC ty= pe chip > >> that also has the lower level FM-API access). I think it is somewhat > >> separate from the rest of this on basis it may well just be talking re= dfish > >> to the FM and there are lots of tools for that sort of handling alread= y. > >> =20 > >=20 > > I would be interested in particpating in a BOF about this topic. I wond= er what > > happens when we have multiple switches with multiple FMs each on a sepa= rate BMC. > > In this case, does it make more sense to have an owner of the global FM= state=20 > > be a user space application. Is this the job of the orchestrator? This partly comes down to terminology. Ultimately there is an FM that is responsible for the whole fabric (could be distributed software) and that in turn will talk to a the various BMCs that then talk to the switches. Depending on the setup it may not be necessary for any entity to see the whole fabric. Interesting point in general though. I think it boils down to getting layering in any software correct and that is easier done from outset. I don't know whether the redfish stuff is flexible enough to cover this, but if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs and in turn presenting redfish to the orchestrator. Any of these components might run on separate machines, or in firmware on some device, or indeed all run on one server that is acting as the FM and a node in the orchestrator layer. > >=20 > > The BMC based FM seems to have scalability issues, but will we hit them= in > > practice any time soon. =20 Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.0 then we definitely will in the medium term. >=20 > I had discussion recently and it looks like there are interesting points: > (1) If we have multiple CXL switches (especially with complex hierarchy),= then it is > very compute-intensive activity. So, potentially, FM on firmware side cou= ld be not > capable to digest and executes all responsibilities without potential per= formance > degradation. There is firmware and their is firmware ;) It's not uncommon for BMCs to be significant devices in their own right and run Linux or other heavy weight = OSes. > (2) However, if we have FM on host side, then there is security concerns = because > FM sees everything and all details of multiple hosts and subsystems. Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but= in at lest some implementations it will be running on a capable Linux machine. In large fabrics that may be very capable indeed (basically a server dedica= ted to this role). > (3) Technically speaking, there is one potential capability that user-spa= ce FM daemon > can run as on host side as on CXL switch side. I mean here that if we imp= lement > user-space FM daemon, then it could be used to execute FM functionality o= n CXL > switch side (maybe????). :) Sure, anything could run anywhere. We should draw up some 'reference' arch= itectures though to guide discussion down the line. Mind you I think there are a lot= of steps along the way and starting point should be a simple PoC where all the= FM stuff is in linux userspace (other than comms). That's easy enough to do. If I get a quiet week or so I'll hammer out what we need on emulation side = to start playing with this. Jonathan >=20 > >=20 > >>>>> - Manage surprise removal of devices =20 > >>>>=20 > >>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have a= ny idea > >>>> what to do in the way of managing this. Scream loudly? > >>>> =20 > >>>=20 > >>> Maybe, it could require application(s) notification. Let=E2=80=99s im= agine that application > >>> uses some resources from removed device. Maybe, FM can manage kernel-= space > >>> metadata correction and helping to manage application requests to not= existing > >>> entities. =20 > >>=20 > >> Notifications for the host are likely to come via inband means - so ty= pe3 driver > >> handling rather than related to FM. As far as the host is concerned t= his is the > >> same as case where there is no FM and someone ripped a device out. > >>=20 > >> There might indeed be meta data to manage, but doubt it will have anyt= hing to > >> do with kernel. > >> =20 > >=20 > > I've also had similar thoughts, I think the OS responds to notification= s that > > are generated in-band after changes to the state of the FM are made thr= ough=20 > > OOB means. > >=20 > > I envision the host sends REDFISH requests to a switch BMC that has an = FM > > implementation. Once the changes are implemented by the FM it would sho= w up > > as changes to the PCIe hierarchy on a host, which is capable of respond= ing to > > such changes. > > =20 >=20 > I think I am not completely follow your point. :) First of all, I assume = that if host > sends REDFISH request, then it will be expected the confirmation of reque= st execution. > It means for me that host needs to receive some packet that informs that = request > executed successfully or failed. It means that some subsystem or applicat= ion requested > this change and only after receiving the confirmation requested capabilit= ies can be used. > And if FM is on CXL switch side, then how FM will show up the changes? It= sounds for me > that some FM subsystem should be on the host side to receive confirmation= /notification > and to execute the real changes in PCIe hierarchy. Am missing something h= ere? Another terminology issue I think. FM from CXL side of things is an abstra= ct thing (potentially highly layered / distributed) that acts on instructions from an orchestrator (also potentially highly distributed, one implementation is ho= sts can be the orchestrator) and configures the fabric. The downstream APIs to the switches and EPs are all in FM-API (CXL spec) Upstream probably all Redfish. What happens in between is impdef (though obviously mapping to Redfish or FM-API as applicable may make it more reuseable and flexible). I think some diagrams of what is where will help. I think we need (note I've always kept the controller hosts as normal hosts= as well as that includes the case where it never uses the Fabric - so BMC type case= s as a subset without needing to double the number of diagrams). 1) Diagram of single host with the FM as one 'thing' on that host - direct = interfaces to a single switch - interfaces options include switch CCI MB, mctp of P= CI VDM, mctp over say i2c. 2) Diagram of same as above, with a multiple head device all connected to o= ne host. 3) Diagram of 1 (maybe with MHD below switches), but now with multiple host= s, one of which is responsible for fabric management. FM in that manager= host and orchestrator) - agents on other hosts able to send requests for serv= ices to that host. 4) Diagram of 3, but now with multiple switches, each with separate control= ling host. Some other hosts that don't have any fabric control. Distributed FM across the controlling hosts. 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts all t= alk to the orchestrator, that then talks to the FM. 6) 4, but push some management entities down into switches (from architectu= re point of view this is no different from layered case with a separate BMC per swit= ch - there is still either a distribute FM or a layered FM, which the orchestrator = talks to.) Can mess with exactly distribution of who does what across the various laye= rs. I can sketch this lot up (and that will probably make some gaps in these ca= ses apparent) but will take a little while, hence text descriptions in the meantime. I come back to my personal view though - which is don't worry too much at t= his early stage, beyond making sure we have some layering in code so that we can dist= ribute it across a distributed or layered architecture later! =20 Jonathan >=20 > Thanks, > Slava. >=20 >=20