From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53985C05027 for ; Fri, 10 Feb 2023 12:33:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DDCDA280009; Fri, 10 Feb 2023 07:33:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D8C84280003; Fri, 10 Feb 2023 07:33:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7AF6280009; Fri, 10 Feb 2023 07:33:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B858B280003 for ; Fri, 10 Feb 2023 07:33:09 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 80A04120359 for ; Fri, 10 Feb 2023 12:33:09 +0000 (UTC) X-FDA: 80451322098.11.D61B5B4 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf20.hostedemail.com (Postfix) with ESMTP id 4C5DD1C001F for ; Fri, 10 Feb 2023 12:33:06 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676032387; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eJ72x8MicMp57sTSMslvsfz4bG8uPqSR20FSNNZkmdc=; b=gRWujIX+fY5/zrbhzql0vhAVIy4X1dEtI9mQbIScGhDuMjKfM5mxwbJFphqabnaw577Oiv uxuyzzX5YualXztPo0f7CtGqNH7o5kBSGHzCXe+oy/3SXzeq5LBp5Tpb/ZcCixy6oWFgk/ G6/n4bLsmJguFLOARiAv0YXvy6Gn9fU= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676032387; a=rsa-sha256; cv=none; b=iszEBl1o6wNwExAT+Lk1EL3JRYQ1T5kvbmZZqoGuyEmMg1M6qYaAdDbsGVOwJUiY5odUH1 rpRCAyRgORKt2obljUHzuKVvYiBxqU26azUbFW+o3su4LzPtNNDl2UliHnaUoX9o+jwOgo bYmWk8nkeae0xuwb838QujcEQz2H4WI= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4PCtQB1KrXz6J9Zd; Fri, 10 Feb 2023 20:31:30 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.17; Fri, 10 Feb 2023 12:32:58 +0000 Date: Fri, 10 Feb 2023 12:32:57 +0000 From: Jonathan Cameron To: Viacheslav A.Dubeyko CC: Adam Manzanares , "lsf-pc@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-cxl@vger.kernel.org" , Dan Williams , "Cong Wang" , Viacheslav Dubeyko Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Message-ID: <20230210123257.000029a9@Huawei.com> In-Reply-To: <89DC75A8-0507-4AA1-B121-4AC398F615BC@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> <20230202095402.0000585d@Huawei.com> <20230208163844.GA407917@bgt-140510-bm01> <7E864E85-A36F-487B-8B70-C8C49FBECD73@bytedance.com> <20230209110502.00001a7a@Huawei.com> <89DC75A8-0507-4AA1-B121-4AC398F615BC@bytedance.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4C5DD1C001F X-Rspam-User: X-Stat-Signature: gpzziyafx1aym9yutadjcixhuy46eqch X-HE-Tag: 1676032386-896340 X-HE-Meta: U2FsdGVkX18Pse2Mpyfy0qETNUFpUJBaHOpI3K1aW3Bl5oS2oAyPHqAqCu0XDvaiK46LpIO8CSzo4kHsEygr8onH9ytjumP5vjhADCGtUdI28UOHJ7K6qYy9W6v86tc8dwSpBq69SODZAT+xWSfAZxHNVvbPr38IUlCmYStwBCVhTh5FSHX80KWIHDipawpD8GJKKs4HC2SjWPu8Ht0ayNv3DqWAHERhSLCI7ipQIAqg5V1vAlLtWgXOn+xsBIJoJgLslhziHQyTggiHpySIS7mWrGAVZfrb3+2GJSlp4QT4bB8FpCbfirfiHx92ofSfOZIt6TCm9EiAlF9q+Z9I6JbG4v9VKGh7IOcJM3byZZYTtuHoajzvKk6oq9ZS4qOKgVBNyOB6q82PRoP2TMzzCgJm/JubplFXu/Vfi/H+jta4xbT3za6FYVV6A+EHmHuKlJoVzz4FyqWIEg9sLKYfTaGVVVnGhgiq7SzlWLy3KAGyNXvmM98qVQWCz0VO/uLJxcLrQ/uXLKlJfulPpB1kuMryqMM2PhhisaoosaM9ynC2ESGnWGhoI6RwVZxb+ftZpUPT9Ja+pT9VjAL8eWZEHkXrHiZrEWm9zGsfRfzuKpwyIBayEUpu+XlyuT/r3nteJGdEJgGNXhqQqkhwVGfoIzsM0VNhuSJgyT5HkN2RMi0O4YoAZG9sLkVSnnQ5W4Cseprl7mCWd5mGZBgM9akNhyTvMwwhmJ03ozZnOfm7IjrWJJ4sAj0Wc6+la8BLOoJ2p4l79dQfUQi5yZVbJM9iEkbqyHoGqPZxKcp0EluQTIConfcdSfSM1Cc8CZ/15xHNmFaH7FwPVJNlQw53bhUDQpJeqWDyJ4mTSZC2bW9nkCGLSjwH8K56kMFfzyJYDu1M2rabtWlwarRnxNCHltBa1AgD3lzTDhiRsCD4kpsiC25v9s/S/KdW4dXjbVdeP8qCcUJvKP/7wSEQoWg2arH PpwBN496 aVCbDUWAhuVzPJyRIjpITToQSvoCBydrSg31677MI6v1ZoLedCETEeNw4ls4X79A0qgNp+Gtz1xRS7yxB71Fltief3DvJh9zorVosyEf3e8r1BgGz7n4cO2Xk52xOpPhbiBGROAfKediyMi2MW+ceR2n4pAnLPhBdbtg1t3IDCtgKb2vE/HDpyz+uq7pBnDqeE8hgdozlSkYA9PpjLaSHoTMuI8isoza34vLd9xv9djLDYW3vb0+86rKx2fIiggTiebr/KAMSpR4VThS+mUx/Q63o/1bPBR8ov0bPE43gOh0IlodMXAwX15k0TWTfvx3BmLeg6VY8+n38gkK60YJzzOZVD2IL/BFWfvISJYLX40YpfMgeHoNF0fqang== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 9 Feb 2023 14:04:13 -0800 "Viacheslav A.Dubeyko" wrote: > > On Feb 9, 2023, at 3:05 AM, Jonathan Cameron wrote: > >=20 > > On Wed, 8 Feb 2023 10:03:57 -0800 > > "Viacheslav A.Dubeyko" wrote: > > =20 > >>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares wrote: > >>>=20 > >>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote: = =20 > >>>> On Wed, 1 Feb 2023 12:04:56 -0800 > >>>> "Viacheslav A.Dubeyko" wrote: > >>>> =20 > >>>>>> =20 > >>=20 > >> > >> =20 > >>>>>=20 > >>>>> Most probably, we will have multiple FM implementations in firmware. > >>>>> Yes, FM on host could be important for debug and to verify correctn= ess > >>>>> firmware-based implementations. But FM daemon on host could be impo= rtant > >>>>> to receive notifications and react somehow on these events. Also, j= ournalling > >>>>> of events/messages/events could be important responsibility of FM d= aemon > >>>>> on host. =20 > >>>>=20 > >>>> I agree with an FM daemon somewhere (potentially running on the BMC = type chip > >>>> that also has the lower level FM-API access). I think it is somewhat > >>>> separate from the rest of this on basis it may well just be talking = redfish > >>>> to the FM and there are lots of tools for that sort of handling alre= ady. > >>>> =20 > >>>=20 > >>> I would be interested in particpating in a BOF about this topic. I wo= nder what > >>> happens when we have multiple switches with multiple FMs each on a se= parate BMC. > >>> In this case, does it make more sense to have an owner of the global = FM state=20 > >>> be a user space application. Is this the job of the orchestrator? =20 > >=20 > > This partly comes down to terminology. Ultimately there is an FM that is > > responsible for the whole fabric (could be distributed software) and th= at > > in turn will talk to a the various BMCs that then talk to the switches. > >=20 > > Depending on the setup it may not be necessary for any entity to see the > > whole fabric. > >=20 > > Interesting point in general though. I think it boils down to getting > > layering in any software correct and that is easier done from outset. > >=20 > > I don't know whether the redfish stuff is flexible enough to cover this= , but > > if it is, I'd envision, the actual FM talking redfish to a bunch of sub= -FMs > > and in turn presenting redfish to the orchestrator. > >=20 > > Any of these components might run on separate machines, or in firmware = on > > some device, or indeed all run on one server that is acting as the FM a= nd > > a node in the orchestrator layer. > > =20 > >>>=20 > >>> The BMC based FM seems to have scalability issues, but will we hit th= em in > >>> practice any time soon. =20 > >=20 > > Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.0 = then > > we definitely will in the medium term. > > =20 > >>=20 > >> I had discussion recently and it looks like there are interesting poin= ts: > >> (1) If we have multiple CXL switches (especially with complex hierarch= y), then it is > >> very compute-intensive activity. So, potentially, FM on firmware side = could be not > >> capable to digest and executes all responsibilities without potential = performance > >> degradation. =20 > >=20 > > There is firmware and their is firmware ;) It's not uncommon for BMCs = to be > > significant devices in their own right and run Linux or other heavy wei= ght OSes. > > =20 > >> (2) However, if we have FM on host side, then there is security concer= ns because > >> FM sees everything and all details of multiple hosts and subsystems. = =20 > >=20 > > Agreed. Other than testing I wouldn't expect the FM to run on a 'host',= but in > > at lest some implementations it will be running on a capable Linux mach= ine. > > In large fabrics that may be very capable indeed (basically a server de= dicated to > > this role). > > =20 > >> (3) Technically speaking, there is one potential capability that user-= space FM daemon > >> can run as on host side as on CXL switch side. I mean here that if we = implement > >> user-space FM daemon, then it could be used to execute FM functionalit= y on CXL > >> switch side (maybe????). :) =20 > >=20 > > Sure, anything could run anywhere. We should draw up some 'reference' = architectures > > though to guide discussion down the line. Mind you I think there are a= lot of > > steps along the way and starting point should be a simple PoC where all= the FM > > stuff is in linux userspace (other than comms). That's easy enough to = do. > > If I get a quiet week or so I'll hammer out what we need on emulation s= ide to > > start playing with this. > >=20 > > Jonathan > >=20 > >=20 > > =20 > >>=20 > >> > >> =20 > >>>>>>> - Manage surprise removal of devices =20 > >>>>>>=20 > >>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have= any idea > >>>>>> what to do in the way of managing this. Scream loudly? > >>>>>> =20 > >>>>>=20 > >>>>> Maybe, it could require application(s) notification. Let=E2=80=99s = imagine that application > >>>>> uses some resources from removed device. Maybe, FM can manage kerne= l-space > >>>>> metadata correction and helping to manage application requests to n= ot existing > >>>>> entities. =20 > >>>>=20 > >>>> Notifications for the host are likely to come via inband means - so = type3 driver > >>>> handling rather than related to FM. As far as the host is concerned= this is the > >>>> same as case where there is no FM and someone ripped a device out. > >>>>=20 > >>>> There might indeed be meta data to manage, but doubt it will have an= ything to > >>>> do with kernel. > >>>> =20 > >>>=20 > >>> I've also had similar thoughts, I think the OS responds to notificati= ons that > >>> are generated in-band after changes to the state of the FM are made t= hrough=20 > >>> OOB means. > >>>=20 > >>> I envision the host sends REDFISH requests to a switch BMC that has a= n FM > >>> implementation. Once the changes are implemented by the FM it would s= how up > >>> as changes to the PCIe hierarchy on a host, which is capable of respo= nding to > >>> such changes. > >>> =20 > >>=20 > >> I think I am not completely follow your point. :) First of all, I assu= me that if host > >> sends REDFISH request, then it will be expected the confirmation of re= quest execution. > >> It means for me that host needs to receive some packet that informs th= at request > >> executed successfully or failed. It means that some subsystem or appli= cation requested > >> this change and only after receiving the confirmation requested capabi= lities can be used. > >> And if FM is on CXL switch side, then how FM will show up the changes?= It sounds for me > >> that some FM subsystem should be on the host side to receive confirmat= ion/notification > >> and to execute the real changes in PCIe hierarchy. Am missing somethin= g here? =20 > >=20 > > Another terminology issue I think. FM from CXL side of things is an ab= stract thing > > (potentially highly layered / distributed) that acts on instructions fr= om an > > orchestrator (also potentially highly distributed, one implementation i= s hosts > > can be the orchestrator) and configures the fabric. > > The downstream APIs to the switches and EPs are all in FM-API (CXL spec) > > Upstream probably all Redfish. What happens in between is impdef (thou= gh > > obviously mapping to Redfish or FM-API as applicable may make it more > > reuseable and flexible). > >=20 > > I think some diagrams of what is where will help. > > I think we need (note I've always kept the controller hosts as normal h= osts as well > > as that includes the case where it never uses the Fabric - so BMC type = cases as > > a subset without needing to double the number of diagrams). > >=20 > > 1) Diagram of single host with the FM as one 'thing' on that host - dir= ect interfaces > > to a single switch - interfaces options include switch CCI MB, mctp o= f PCI VDM, > > mctp over say i2c. > >=20 > > 2) Diagram of same as above, with a multiple head device all connected = to one host. > >=20 > > 3) Diagram of 1 (maybe with MHD below switches), but now with multiple = hosts, > > one of which is responsible for fabric management. FM in that mana= ger host > > and orchestrator) - agents on other hosts able to send requests for s= ervices to that host. > >=20 > > 4) Diagram of 3, but now with multiple switches, each with separate con= trolling host. > > Some other hosts that don't have any fabric control. > > Distributed FM across the controlling hosts. > >=20 > > 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts a= ll talk to the > > orchestrator, that then talks to the FM. > >=20 > > 6) 4, but push some management entities down into switches (from archit= ecture point of > > view this is no different from layered case with a separate BMC per s= witch - there > > is still either a distribute FM or a layered FM, which the orchestrat= or talks to.) > >=20 > > Can mess with exactly distribution of who does what across the various = layers. > >=20 > > I can sketch this lot up (and that will probably make some gaps in thes= e cases apparent) > > but will take a little while, hence text descriptions in the meantime. > >=20 > > I come back to my personal view though - which is don't worry too much = at this early > > stage, beyond making sure we have some layering in code so that we can = distribute > > it across a distributed or layered architecture later! =20 > > =20 >=20 > I had slightly more simplified image in my mind. :) We definitely need to= have diagrams > to clarify the vision. But which collaboration tool could we use to work = publicly on diagrams? > Any suggestion? Ascii art :) To have a broad discussion it needs to be mailing list and th= at is effectively only option. >=20 > Thanks, > Slava. >=20