From: "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Adam Manzanares <a.manzanares@samsung.com>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
Cong Wang <cong.wang@bytedance.com>,
Viacheslav Dubeyko <slava@dubeyko.com>
Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
Date: Fri, 17 Feb 2023 10:31:15 -0800 [thread overview]
Message-ID: <2EA73B59-7E5B-4FF6-9830-6C4C24FDDB6C@bytedance.com> (raw)
In-Reply-To: <20230210123257.000029a9@Huawei.com>
> On Feb 10, 2023, at 4:32 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 9 Feb 2023 14:04:13 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>
>>> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>>>
>>> On Wed, 8 Feb 2023 10:03:57 -0800
>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>>
>>>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
>>>>>
>>>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
>>>>>> On Wed, 1 Feb 2023 12:04:56 -0800
>>>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>>>>>
>>>>>>>>
>>>>
>>>> <skipped>
>>>>
>>>>>>>
>>>>>>> Most probably, we will have multiple FM implementations in firmware.
>>>>>>> Yes, FM on host could be important for debug and to verify correctness
>>>>>>> firmware-based implementations. But FM daemon on host could be important
>>>>>>> to receive notifications and react somehow on these events. Also, journalling
>>>>>>> of events/messages/events could be important responsibility of FM daemon
>>>>>>> on host.
>>>>>>
>>>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>>>>>> that also has the lower level FM-API access). I think it is somewhat
>>>>>> separate from the rest of this on basis it may well just be talking redfish
>>>>>> to the FM and there are lots of tools for that sort of handling already.
>>>>>>
>>>>>
>>>>> I would be interested in particpating in a BOF about this topic. I wonder what
>>>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
>>>>> In this case, does it make more sense to have an owner of the global FM state
>>>>> be a user space application. Is this the job of the orchestrator?
>>>
>>> This partly comes down to terminology. Ultimately there is an FM that is
>>> responsible for the whole fabric (could be distributed software) and that
>>> in turn will talk to a the various BMCs that then talk to the switches.
>>>
>>> Depending on the setup it may not be necessary for any entity to see the
>>> whole fabric.
>>>
>>> Interesting point in general though. I think it boils down to getting
>>> layering in any software correct and that is easier done from outset.
>>>
>>> I don't know whether the redfish stuff is flexible enough to cover this, but
>>> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
>>> and in turn presenting redfish to the orchestrator.
>>>
>>> Any of these components might run on separate machines, or in firmware on
>>> some device, or indeed all run on one server that is acting as the FM and
>>> a node in the orchestrator layer.
>>>
>>>>>
>>>>> The BMC based FM seems to have scalability issues, but will we hit them in
>>>>> practice any time soon.
>>>
>>> Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.0 then
>>> we definitely will in the medium term.
>>>
>>>>
>>>> I had discussion recently and it looks like there are interesting points:
>>>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
>>>> very compute-intensive activity. So, potentially, FM on firmware side could be not
>>>> capable to digest and executes all responsibilities without potential performance
>>>> degradation.
>>>
>>> There is firmware and their is firmware ;) It's not uncommon for BMCs to be
>>> significant devices in their own right and run Linux or other heavy weight OSes.
>>>
>>>> (2) However, if we have FM on host side, then there is security concerns because
>>>> FM sees everything and all details of multiple hosts and subsystems.
>>>
>>> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
>>> at lest some implementations it will be running on a capable Linux machine.
>>> In large fabrics that may be very capable indeed (basically a server dedicated to
>>> this role).
>>>
>>>> (3) Technically speaking, there is one potential capability that user-space FM daemon
>>>> can run as on host side as on CXL switch side. I mean here that if we implement
>>>> user-space FM daemon, then it could be used to execute FM functionality on CXL
>>>> switch side (maybe????). :)
>>>
>>> Sure, anything could run anywhere. We should draw up some 'reference' architectures
>>> though to guide discussion down the line. Mind you I think there are a lot of
>>> steps along the way and starting point should be a simple PoC where all the FM
>>> stuff is in linux userspace (other than comms). That's easy enough to do.
>>> If I get a quiet week or so I'll hammer out what we need on emulation side to
>>> start playing with this.
>>>
>>> Jonathan
>>>
>>>
>>>
>>>>
>>>> <skipped>
>>>>
>>>>>>>>> - Manage surprise removal of devices
>>>>>>>>
>>>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>>>>>> what to do in the way of managing this. Scream loudly?
>>>>>>>>
>>>>>>>
>>>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>>>>>> metadata correction and helping to manage application requests to not existing
>>>>>>> entities.
>>>>>>
>>>>>> Notifications for the host are likely to come via inband means - so type3 driver
>>>>>> handling rather than related to FM. As far as the host is concerned this is the
>>>>>> same as case where there is no FM and someone ripped a device out.
>>>>>>
>>>>>> There might indeed be meta data to manage, but doubt it will have anything to
>>>>>> do with kernel.
>>>>>>
>>>>>
>>>>> I've also had similar thoughts, I think the OS responds to notifications that
>>>>> are generated in-band after changes to the state of the FM are made through
>>>>> OOB means.
>>>>>
>>>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
>>>>> implementation. Once the changes are implemented by the FM it would show up
>>>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
>>>>> such changes.
>>>>>
>>>>
>>>> I think I am not completely follow your point. :) First of all, I assume that if host
>>>> sends REDFISH request, then it will be expected the confirmation of request execution.
>>>> It means for me that host needs to receive some packet that informs that request
>>>> executed successfully or failed. It means that some subsystem or application requested
>>>> this change and only after receiving the confirmation requested capabilities can be used.
>>>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
>>>> that some FM subsystem should be on the host side to receive confirmation/notification
>>>> and to execute the real changes in PCIe hierarchy. Am missing something here?
>>>
>>> Another terminology issue I think. FM from CXL side of things is an abstract thing
>>> (potentially highly layered / distributed) that acts on instructions from an
>>> orchestrator (also potentially highly distributed, one implementation is hosts
>>> can be the orchestrator) and configures the fabric.
>>> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
>>> Upstream probably all Redfish. What happens in between is impdef (though
>>> obviously mapping to Redfish or FM-API as applicable may make it more
>>> reuseable and flexible).
>>>
>>> I think some diagrams of what is where will help.
>>> I think we need (note I've always kept the controller hosts as normal hosts as well
>>> as that includes the case where it never uses the Fabric - so BMC type cases as
>>> a subset without needing to double the number of diagrams).
>>>
>>> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
>>> to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
>>> mctp over say i2c.
>>>
>>> 2) Diagram of same as above, with a multiple head device all connected to one host.
>>>
>>> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
>>> one of which is responsible for fabric management. FM in that manager host
>>> and orchestrator) - agents on other hosts able to send requests for services to that host.
>>>
>>> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
>>> Some other hosts that don't have any fabric control.
>>> Distributed FM across the controlling hosts.
>>>
>>> 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts all talk to the
>>> orchestrator, that then talks to the FM.
>>>
>>> 6) 4, but push some management entities down into switches (from architecture point of
>>> view this is no different from layered case with a separate BMC per switch - there
>>> is still either a distribute FM or a layered FM, which the orchestrator talks to.)
>>>
>>> Can mess with exactly distribution of who does what across the various layers.
>>>
>>> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
>>> but will take a little while, hence text descriptions in the meantime.
>>>
>>> I come back to my personal view though - which is don't worry too much at this early
>>> stage, beyond making sure we have some layering in code so that we can distribute
>>> it across a distributed or layered architecture later!
>>>
>>
>> I had slightly more simplified image in my mind. :) We definitely need to have diagrams
>> to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
>> Any suggestion?
>
> Ascii art :) To have a broad discussion it needs to be mailing list and that
> is effectively only option.
>
I tried to prepare some diagram based on ascii art. :) It looks pretty terrible in email:
---------------------------- ------------------
| --------- ------ | | |
| | Agent | <---> | FM | | | |
| --------- ------ |<------->| CXL switch |
| Host | | |
| | | |
---------------------------- —————————
I think we need to use some online resource, anyway. We are discussing with Adam what we
can do here.
You introduced Orchestrator entity. I realized that I am not completely follow the responsibility
of this subsystem. Do you imply some central point of management of multiple FM instances?
Something like a router that has knowledge base and can redirect the request to proper FM
instance. Am I correct? It sounds to me that orchestrator needs to implement some
sub-API of FM. Or, maybe, it needs to parse REDFISH packets, for example, and only
redirects the packets.
Thanks,
Slava.
next prev parent reply other threads:[~2023-02-17 18:31 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-30 19:11 Viacheslav A.Dubeyko
2023-01-31 17:41 ` Jonathan Cameron
2023-02-01 20:04 ` [External] " Viacheslav A.Dubeyko
2023-02-02 9:54 ` Jonathan Cameron
2023-02-08 16:38 ` Adam Manzanares
2023-02-08 18:03 ` Viacheslav A.Dubeyko
2023-02-09 11:05 ` Jonathan Cameron
2023-02-09 22:04 ` Viacheslav A.Dubeyko
2023-02-10 12:32 ` Jonathan Cameron
2023-02-17 18:31 ` Viacheslav A.Dubeyko [this message]
2023-02-20 11:59 ` Jonathan Cameron
2023-02-09 22:10 ` Adam Manzanares
2023-02-09 22:22 ` Viacheslav A.Dubeyko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2EA73B59-7E5B-4FF6-9830-6C4C24FDDB6C@bytedance.com \
--to=viacheslav.dubeyko@bytedance.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=a.manzanares@samsung.com \
--cc=cong.wang@bytedance.com \
--cc=dan.j.williams@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=slava@dubeyko.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox