From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0801CC61DA4 for ; Thu, 9 Feb 2023 22:04:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 593D76B00A6; Thu, 9 Feb 2023 17:04:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 51C1B6B00A7; Thu, 9 Feb 2023 17:04:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 36F546B00A8; Thu, 9 Feb 2023 17:04:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 221936B00A6 for ; Thu, 9 Feb 2023 17:04:31 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A151314104C for ; Thu, 9 Feb 2023 22:04:30 +0000 (UTC) X-FDA: 80449133100.21.F5EEA69 Received: from mail-oi1-f177.google.com (mail-oi1-f177.google.com [209.85.167.177]) by imf17.hostedemail.com (Postfix) with ESMTP id 7E41D4000A for ; Thu, 9 Feb 2023 22:04:27 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="zwyER/97"; spf=pass (imf17.hostedemail.com: domain of viacheslav.dubeyko@bytedance.com designates 209.85.167.177 as permitted sender) smtp.mailfrom=viacheslav.dubeyko@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675980268; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HwJYigqp1NOB2US5BTlKwSn4w4HVBuo71UExWBXGYuA=; b=kbpBZAP09zfcEUOw91xUwv6BvQNlacz2lzgc0CWbxNpY6uLoKTODsuMfBNbxMIHxFi3oY8 pbg7OHuKpO/lU+5EP7Speo9//qKWiZxDn1fOIIJ51rd2AxAkTOacFaFq+vEGeZfyDgP+ED V4sc9tGCSlccz61oUuXkw3ejw1QRPhA= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="zwyER/97"; spf=pass (imf17.hostedemail.com: domain of viacheslav.dubeyko@bytedance.com designates 209.85.167.177 as permitted sender) smtp.mailfrom=viacheslav.dubeyko@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675980268; a=rsa-sha256; cv=none; b=5U+HBHVx88DBki0LIHwHv3ryVfz00zinCanZZOnlU8Py8rMlPSbTBhattOZt3uVMnuJWjW a70Y2G1jsJ7CfCMUGKjNMOsczTWVBtg+pRXP1LocFtaEMGcengSyuJ5PdFVbCJjD/Rr0tC z7x0LRXwRGxOA7qmL4EAi3ErQ+QUJmQ= Received: by mail-oi1-f177.google.com with SMTP id bi19so2930139oib.2 for ; Thu, 09 Feb 2023 14:04:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HwJYigqp1NOB2US5BTlKwSn4w4HVBuo71UExWBXGYuA=; b=zwyER/97b2g0RXSTqXCn7dmB204mNODZ5MYLhFA95V7EmmLc87epHcCnVfHOd4wYau Py+GY/ivUX7AQHeA89OGN+hsi+Alm26JH+U3oZAKxj20wSW6CaFA82sKzZtjT9sE2C9u y6W5xS1SgkbwWvOxNZpm49KapT25cBHvF7SgoVNTy9RYqdHLBrDbZZnNym4xodxMqosm ElVi9d+EGcqdWDcaOplfGsP+YNM1u4fjbGr/Mn8HBLu1GlR91F8e7OeqWfGmm/Nh+e2D Oi+514EJsr5LGXUkq73okW+QKWH9hOpBPzaRu6AFQmh3sPGJWAInIhQBffs8ywUzsBs4 wGHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HwJYigqp1NOB2US5BTlKwSn4w4HVBuo71UExWBXGYuA=; b=6VD7G7VdcWtOeFYhEoCektMtWy2Fvpimwn6r3ua1aDgOKSgxWXxVmznZH+qyEVD/QB isZVHYfh2TGr+4PpqQm+cytz9IFkY8Dy1YagkYosbpzjlshwBTEa6GpUD/AancioOIfm 491O/EwYS/TR9jr6gJQRp6eyEEkqTriLJgwV0YxLAyZSf6GyM+3nG214pJG917ys4B81 k+SaK1+Ak1NlSOp4L5SfedCtg/9eyNejiQQRQ7/7FIAHaFXxRIp0FrXWIlJW6xaJwtXM BIUqH9IWsyplUxQDuYF2yPLrbveP9ekLmjMJLmNJkpYnBo38V2W4Oz5777ZEn3T1K6WN myug== X-Gm-Message-State: AO0yUKWSQrByMDCMHs8qn6niMFQBVv5x+LF7fFx46l+HktoI3X7zt2me 4DipUCKnsqYmItcu/yUzscTAXw== X-Google-Smtp-Source: AK7set8Im6wtw4TpH9maC0dmf/kyymdY4q/bU4ANDxUgAWEZEi/gkvGOCRNXtCXkZkkOEV5ZUlK4LQ== X-Received: by 2002:a54:4081:0:b0:37a:f63b:45f with SMTP id i1-20020a544081000000b0037af63b045fmr5358846oii.16.1675980266275; Thu, 09 Feb 2023 14:04:26 -0800 (PST) Received: from smtpclient.apple (172-125-78-211.lightspeed.sntcca.sbcglobal.net. [172.125.78.211]) by smtp.gmail.com with ESMTPSA id i126-20020acaea84000000b0037887ca2150sm1417526oih.22.2023.02.09.14.04.24 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Feb 2023 14:04:25 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.300.101.1.3\)) Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture From: "Viacheslav A.Dubeyko" In-Reply-To: <20230209110502.00001a7a@Huawei.com> Date: Thu, 9 Feb 2023 14:04:13 -0800 Cc: Adam Manzanares , "lsf-pc@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-cxl@vger.kernel.org" , Dan Williams , Cong Wang , Viacheslav Dubeyko Content-Transfer-Encoding: quoted-printable Message-Id: <89DC75A8-0507-4AA1-B121-4AC398F615BC@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> <20230202095402.0000585d@Huawei.com> <20230208163844.GA407917@bgt-140510-bm01> <7E864E85-A36F-487B-8B70-C8C49FBECD73@bytedance.com> <20230209110502.00001a7a@Huawei.com> To: Jonathan Cameron X-Mailer: Apple Mail (2.3731.300.101.1.3) X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7E41D4000A X-Rspam-User: X-Stat-Signature: g4rfkc1jubhnz7txebn39ypas5p7pirr X-HE-Tag: 1675980267-3878 X-HE-Meta: U2FsdGVkX1+fdGN2FK1szm7oiajIDA4vO2AkKk1U9gD9yIPKmEwd8Be69u2bg9eUMHOEeACXtkoQcfLXiRgOG5LfxsC5a1S8eB75aQiY7ZuxT1+KdIEDFtxOE01/HRmFdzxTkueHfRjdpRDDc19aFSnjEJNigFtRTdXvaxF5Em7+sf3+wDjxHgNUghegHzCcZHB/1V3zFYMM/UGdgnn0zvTRNcKyEYRGtJhjrocgNJdnDB2IQtKuPt6dJJVgn2TqWvun6HrP+9Mpsj6hLy12IQTqGys7WG9qMO2Fksn+GqGkfstbewRsNbbJrhL+qUcm9Cd1N8Sl1OV0PYGO7+I+Y8vaKNn2zIz00ypVqpVmWNpt0VwoxGzIt17GkOFUjiWcPAkRoUkWgCyrq1qyH6/Oj6C7L4dvjI4NuPJTCWXm6aLaxtIktUZi/6uFREI2mvMgbOjzWjECqfP1eel14FJCghITTkQ35nXjgZd/+6UCNALvvgUuwWieCNCaMBYIyJS326PD8bzv55xDQY00Op2NS9R4CMGMjaLEiNL+GrbdRXcxH6cSXmHG0vIMjpmEbQocR9y6NjIn0G0ZHG59CwasapH1C9cij62JSwe3qOEp/h3EOY2ToXi8aEB0GSOYfczyCOE/7/PhAQzFkhfhnOvOtdw1Ezl2tVpOYj5sTFz6qB73MWJjU3jOZTJtKAR0qfLp8JPyKSEv32/5nvBMoOMgziCcNMrqDRmul6JyjLmjyeYR+pmU5hD4Q5MeAwh/146ArBwYjLtRVaTbAtj6rldaVNUrb/D+bWRW+3IphdS66IXW2Y/NF9hEAVdSAy54Cz2MPAXyj4BGMxMHC53+hjnXsKNKRCYp2hjRIfdw8WirABLRBSNXgJeVg8LzbA6JpNdpsuEOrMVnOwPEqxl6btr+W2fv3irHS77dGyVSiSUrmhY1aBfKr0V0tF1tsM2zGLS8beO+anaDgmbrvhEwEjM UN75rXuU +oGvwAcjMgBKMHaaqe4z8AFugmbgJTz/TQc9mVEkJn+Sms0Um8Oky2CMTdGd1t7rVfIZwkCIJTw4iOwqCbLUdZC+hV/q5qoxUHQ45XBwclCupa2iYKqyqwUJnaWp+NXEQntksO4ARsyzmOm1okru2zLk0Oby/L5Ei1JsTH8YWdSu+xA7Mlz7LPzEXYyFAeEO2S+Z/flr1T5SVWcHvAkf71aa/SLxw7+4NgyIWJnNPOadGmJia1f/InNKn6nZf+vsz3BaSJj6F49vL2vRTftwrZyC53n0zLSvT+oHY6/hYyPd6B3PpxO1nFIttppLoKHXLhPhoXu6vruqfT15E+sRgCHf1WuY/vBggYWmot+OlshPVN5aBO3KJNp74/gEeFDJJ7igPCiSyIRJXPPU3V7pYA/nMY9sZdqnBEKb4Pm+pH1lkKoEdPgeuP7XhzcUSfxZS0z4p2RQo+NjLk0VGbqdveWbJxSHmwQpedyeuhpNlP2GwxCJMXXzqbMHmAR1ZbImjcVImGOr55aKIm9cZYwXeSL59OS7K+Hr/4N97HBQ2AQochKm3j0Dp4KHY/RaSopbRywPRNEqza7ZWCupH/7odseJaSFuBHcf7Ufu7ceTWWYc+dBg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Feb 9, 2023, at 3:05 AM, Jonathan Cameron = wrote: >=20 > On Wed, 8 Feb 2023 10:03:57 -0800 > "Viacheslav A.Dubeyko" wrote: >=20 >>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares = wrote: >>>=20 >>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote: =20= >>>> On Wed, 1 Feb 2023 12:04:56 -0800 >>>> "Viacheslav A.Dubeyko" wrote: >>>>=20 >>>>>>=20 >>=20 >> >>=20 >>>>>=20 >>>>> Most probably, we will have multiple FM implementations in = firmware. >>>>> Yes, FM on host could be important for debug and to verify = correctness >>>>> firmware-based implementations. But FM daemon on host could be = important >>>>> to receive notifications and react somehow on these events. Also, = journalling >>>>> of events/messages/events could be important responsibility of FM = daemon >>>>> on host. =20 >>>>=20 >>>> I agree with an FM daemon somewhere (potentially running on the BMC = type chip >>>> that also has the lower level FM-API access). I think it is = somewhat >>>> separate from the rest of this on basis it may well just be talking = redfish >>>> to the FM and there are lots of tools for that sort of handling = already. >>>>=20 >>>=20 >>> I would be interested in particpating in a BOF about this topic. I = wonder what >>> happens when we have multiple switches with multiple FMs each on a = separate BMC. >>> In this case, does it make more sense to have an owner of the global = FM state=20 >>> be a user space application. Is this the job of the orchestrator? >=20 > This partly comes down to terminology. Ultimately there is an FM that = is > responsible for the whole fabric (could be distributed software) and = that > in turn will talk to a the various BMCs that then talk to the = switches. >=20 > Depending on the setup it may not be necessary for any entity to see = the > whole fabric. >=20 > Interesting point in general though. I think it boils down to getting > layering in any software correct and that is easier done from outset. >=20 > I don't know whether the redfish stuff is flexible enough to cover = this, but > if it is, I'd envision, the actual FM talking redfish to a bunch of = sub-FMs > and in turn presenting redfish to the orchestrator. >=20 > Any of these components might run on separate machines, or in firmware = on > some device, or indeed all run on one server that is acting as the FM = and > a node in the orchestrator layer. >=20 >>>=20 >>> The BMC based FM seems to have scalability issues, but will we hit = them in >>> practice any time soon. =20 >=20 > Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.0 = then > we definitely will in the medium term. >=20 >>=20 >> I had discussion recently and it looks like there are interesting = points: >> (1) If we have multiple CXL switches (especially with complex = hierarchy), then it is >> very compute-intensive activity. So, potentially, FM on firmware side = could be not >> capable to digest and executes all responsibilities without potential = performance >> degradation. >=20 > There is firmware and their is firmware ;) It's not uncommon for BMCs = to be > significant devices in their own right and run Linux or other heavy = weight OSes. >=20 >> (2) However, if we have FM on host side, then there is security = concerns because >> FM sees everything and all details of multiple hosts and subsystems. >=20 > Agreed. Other than testing I wouldn't expect the FM to run on a = 'host', but in > at lest some implementations it will be running on a capable Linux = machine. > In large fabrics that may be very capable indeed (basically a server = dedicated to > this role). >=20 >> (3) Technically speaking, there is one potential capability that = user-space FM daemon >> can run as on host side as on CXL switch side. I mean here that if we = implement >> user-space FM daemon, then it could be used to execute FM = functionality on CXL >> switch side (maybe????). :) >=20 > Sure, anything could run anywhere. We should draw up some 'reference' = architectures > though to guide discussion down the line. Mind you I think there are = a lot of > steps along the way and starting point should be a simple PoC where = all the FM > stuff is in linux userspace (other than comms). That's easy enough to = do. > If I get a quiet week or so I'll hammer out what we need on emulation = side to > start playing with this. >=20 > Jonathan >=20 >=20 >=20 >>=20 >> >>=20 >>>>>>> - Manage surprise removal of devices =20 >>>>>>=20 >>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to = have any idea >>>>>> what to do in the way of managing this. Scream loudly? >>>>>>=20 >>>>>=20 >>>>> Maybe, it could require application(s) notification. Let=E2=80=99s = imagine that application >>>>> uses some resources from removed device. Maybe, FM can manage = kernel-space >>>>> metadata correction and helping to manage application requests to = not existing >>>>> entities. =20 >>>>=20 >>>> Notifications for the host are likely to come via inband means - so = type3 driver >>>> handling rather than related to FM. As far as the host is = concerned this is the >>>> same as case where there is no FM and someone ripped a device out. >>>>=20 >>>> There might indeed be meta data to manage, but doubt it will have = anything to >>>> do with kernel. >>>>=20 >>>=20 >>> I've also had similar thoughts, I think the OS responds to = notifications that >>> are generated in-band after changes to the state of the FM are made = through=20 >>> OOB means. >>>=20 >>> I envision the host sends REDFISH requests to a switch BMC that has = an FM >>> implementation. Once the changes are implemented by the FM it would = show up >>> as changes to the PCIe hierarchy on a host, which is capable of = responding to >>> such changes. >>>=20 >>=20 >> I think I am not completely follow your point. :) First of all, I = assume that if host >> sends REDFISH request, then it will be expected the confirmation of = request execution. >> It means for me that host needs to receive some packet that informs = that request >> executed successfully or failed. It means that some subsystem or = application requested >> this change and only after receiving the confirmation requested = capabilities can be used. >> And if FM is on CXL switch side, then how FM will show up the = changes? It sounds for me >> that some FM subsystem should be on the host side to receive = confirmation/notification >> and to execute the real changes in PCIe hierarchy. Am missing = something here? >=20 > Another terminology issue I think. FM from CXL side of things is an = abstract thing > (potentially highly layered / distributed) that acts on instructions = from an > orchestrator (also potentially highly distributed, one implementation = is hosts > can be the orchestrator) and configures the fabric. > The downstream APIs to the switches and EPs are all in FM-API (CXL = spec) > Upstream probably all Redfish. What happens in between is impdef = (though > obviously mapping to Redfish or FM-API as applicable may make it more > reuseable and flexible). >=20 > I think some diagrams of what is where will help. > I think we need (note I've always kept the controller hosts as normal = hosts as well > as that includes the case where it never uses the Fabric - so BMC type = cases as > a subset without needing to double the number of diagrams). >=20 > 1) Diagram of single host with the FM as one 'thing' on that host - = direct interfaces > to a single switch - interfaces options include switch CCI MB, mctp = of PCI VDM, > mctp over say i2c. >=20 > 2) Diagram of same as above, with a multiple head device all connected = to one host. >=20 > 3) Diagram of 1 (maybe with MHD below switches), but now with multiple = hosts, > one of which is responsible for fabric management. FM in that = manager host > and orchestrator) - agents on other hosts able to send requests for = services to that host. >=20 > 4) Diagram of 3, but now with multiple switches, each with separate = controlling host. > Some other hosts that don't have any fabric control. > Distributed FM across the controlling hosts. >=20 > 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts = all talk to the > orchestrator, that then talks to the FM. >=20 > 6) 4, but push some management entities down into switches (from = architecture point of > view this is no different from layered case with a separate BMC per = switch - there > is still either a distribute FM or a layered FM, which the = orchestrator talks to.) >=20 > Can mess with exactly distribution of who does what across the various = layers. >=20 > I can sketch this lot up (and that will probably make some gaps in = these cases apparent) > but will take a little while, hence text descriptions in the meantime. >=20 > I come back to my personal view though - which is don't worry too much = at this early > stage, beyond making sure we have some layering in code so that we can = distribute > it across a distributed or layered architecture later! =20 >=20 I had slightly more simplified image in my mind. :) We definitely need = to have diagrams to clarify the vision. But which collaboration tool could we use to work = publicly on diagrams? Any suggestion? Thanks, Slava.