From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3573AC05027 for ; Wed, 1 Feb 2023 20:05:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7EB0C6B0072; Wed, 1 Feb 2023 15:05:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 79B6B6B0074; Wed, 1 Feb 2023 15:05:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63B4D6B0075; Wed, 1 Feb 2023 15:05:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4D0D86B0072 for ; Wed, 1 Feb 2023 15:05:17 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 19F5D81005 for ; Wed, 1 Feb 2023 20:05:17 +0000 (UTC) X-FDA: 80419802274.24.AD10F57 Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf20.hostedemail.com (Postfix) with ESMTP id DFA431C0011 for ; Wed, 1 Feb 2023 20:05:13 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=cS27BKHj; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf20.hostedemail.com: domain of viacheslav.dubeyko@bytedance.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=viacheslav.dubeyko@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675281914; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iHV/icvPfY1z87W3W1IfU3kDGkXwgJusq1AsVWbG+lA=; b=09ixuyA2ol0frupjkzNAfhpb8pqp98Nk8c6lf9W04HjOoS44t9/81h1ok0ydaBUpmVIihj bMxQsGdaSRWBNY5TM3eN51AbhxcPyVrQjCXxhPi1riOnor6923juxA/UheJ/QIsQjSNzYV ujhubyHMY+ULOW1WniwvnWIwbbmC4zo= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=cS27BKHj; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf20.hostedemail.com: domain of viacheslav.dubeyko@bytedance.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=viacheslav.dubeyko@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675281914; a=rsa-sha256; cv=none; b=LjgOdRtjB766eJHnDN1PIb3ulfo0uXO2W4TxBIxuDW6racHCyFwSdzwttzvEZGOInR8fgY KJRgPmC6glDpEMmkn4j4fEZQtPtIUuGoHstb5l57oHC0IuMAINEzD0Vih4gfdUvbqdry4l 025qq1TJ8+l1b+pKdWItvzvUvv+mMcM= Received: by mail-qt1-f177.google.com with SMTP id s4so18310199qtx.6 for ; Wed, 01 Feb 2023 12:05:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=iHV/icvPfY1z87W3W1IfU3kDGkXwgJusq1AsVWbG+lA=; b=cS27BKHj3RadMinpIo9/1LSF4WUOksHfy9gDkZzJ5wQ4i4vwnIJD1jKQnXtGMkZQWH i82d1SlXerC1Tt+5BVLlWcL4IP0pgEWqUNM6dWDihcA6KNnwiBUtdKxHCITsdoP4fMT0 Pr+amvK8cc7YZPEN9ua57w6q8wX89+606Vl/adUX3erODgWvOecWZpjzbUXE1XMh0CEg +O6vKu+Pl0zzm0yv5N0nSKWXtcZN9dcwcQ5DzyWZzo+vaPFJqngAGuAWjao7xrncaHvl S4yUxxFtVXTladlNYja0QEeCbSFLAzHcjkN9z2cd3PZK1ev88T1lVaxe7PqAAOXtAW6a ORYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iHV/icvPfY1z87W3W1IfU3kDGkXwgJusq1AsVWbG+lA=; b=0xil7j6OrYefqXnXuFeFiuKXO3pIu5hFMflffJgUgXj7t/8iyTa2cUk3/hoJknsAIt Z1VnfzcqsHAg0D18x8Lc7oQF/HrOWiF9UyEYSFkBaKx2ACTrTVdgeJ5vhmzp7GSrCT1p V644V5QdUuPRBq/UW7Dlfml+cyv1NOlTW4vzrvMlfHbc6PPun8BnOLiO4L1A2DVnVJ9s /9rq4+Mda19m3kgZlW/jKLPU4fqtReEhFq3GaBXZDGi14KDLU8QtERWWX9GHLjXzaU6j kj0YkcgLQ2lX5tUj603y0Wy+OE3UgAy1i2ULSVhlIDrJyqOO51i4UhqxG79lGaQA77i4 YqNw== X-Gm-Message-State: AO0yUKVl50JlQ2TmMEoB1RFCQD1l3wNOMh33fY1VjN72D1H9SKHbVLle qLAQZm43M8vvzRA59EE37GRniQ== X-Google-Smtp-Source: AK7set9IFtclT8jvOgxnFh4s18X3S0+aqJWD96WCl/JLn2mX287jif+Si/9YI/2UIsimcOwWWv9+uw== X-Received: by 2002:ac8:7d94:0:b0:3b9:ba7a:dd89 with SMTP id c20-20020ac87d94000000b003b9ba7add89mr56980qtd.58.1675281912579; Wed, 01 Feb 2023 12:05:12 -0800 (PST) Received: from smtpclient.apple (ec2-13-57-97-131.us-west-1.compute.amazonaws.com. [13.57.97.131]) by smtp.gmail.com with ESMTPSA id ei24-20020a05622a4f1800b003b2957fb45bsm4588030qtb.8.2023.02.01.12.05.10 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 01 Feb 2023 12:05:11 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.300.101.1.3\)) Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture From: "Viacheslav A.Dubeyko" In-Reply-To: <20230131174115.00007493@Huawei.com> Date: Wed, 1 Feb 2023 12:04:56 -0800 Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-cxl@vger.kernel.org, Dan Williams , Adam Manzanares , Cong Wang , Viacheslav Dubeyko Content-Transfer-Encoding: quoted-printable Message-Id: <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> To: Jonathan Cameron X-Mailer: Apple Mail (2.3731.300.101.1.3) X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: DFA431C0011 X-Stat-Signature: zzhgcgwf3pg8rykbg8oowpkssdg7b74a X-HE-Tag: 1675281913-250174 X-HE-Meta: U2FsdGVkX1+0feje/q9Tt0zk8/jkQNAs0WYOiJUsWw362SWURbIXp7L2Orkpc8DcFedjgwPoNs5JC4IbiMaBDJA6YkPixfOGRHGGLUOwKjAeNi6bpMX0HM6mhhABXMIyt0s7GypMed+ozLJZALNsgL7o/QpEP42t7POpRwA+dXQJ28V2Qh3rJUJf6KpdSPmXKemIinOMP3Q7/u8U18dZk6SnlAlcvThKIP2FsS/YfCrpYBPN2qVkEZx/a/bwo2iDgwGF1btOC7X70m5HSCYv5mFzHdqQ8dqTFhKmIQv66xK108kZER02dVUmFV/SvURxJb8Oj+al89lUcaN7UtHYZis5tX7PiTYMePrlsszrKJdUQr/BeJP3uDq2ogU8MkQNtcRrAKX6jlCoW/i5jO2HNijaH/rNzi6vbc4i3rAxb3x0U/A9ziqVRSRuUKQ0tUj2SBY4KNQmhDL0B1U0yNnQhArpYyutvGdDX5DoaFr9eyDE+o9jJzFUKMqKOQGIobspFpuWz5HEl/FUk8DZZXb9fulVDz9swHy2340lluJIItoUI0nkwlNll5e9CUqU1kG+pFDtqZOAk9eDbHzxX7y/f39u6fxnrtG4UHS59nCSp7YNY+51K8K4QgX6KnDIUJjAsye8WYw7Vwf4xObRj0PTOYI7oklVbPE+JtnilQUs0GIyqZt668j8CavgnXkxGc8njEYCNjSYGMuc1BZUXg7IdZ8mAT0zVMyiMrJr+rr2whlq/vR2pjNrMa3+Pjeh5t0r+3xHI/s4vNoMSbfnu2G7HmVeYzh+hb+KUdjEPsBkOyFDvUeExYyFzxX9baXQKYfKrI703mhD8mPuZR+Qrvz8/U5FDkPXUh/fv3JlLQGavnMH7J9l8NmK6CDKDGLXbln1vREU7BtdksaqU+d3yxL5flaFWoVXd5Y0jrKd4/wzPqAtO71z7E9bxITZKsQpWTIqfk+cGeEhmwz6/db7hLR vzuAdSOF qxLFuTToo9f8OwUxhtqEOaXrW2X5LL1nVHOpQnnFHMyAKWVIX0Bx0UB4SRUHF1W9zhxxobAWGQhBX0AWKrtBhtFXT3DjUDVEGnRkiduyBDBSUpahXlAQH24w7qI6FjxA7tGbc/NQ9tC9GKkxWXMcwyzrLWuBqfmLoBPGqtKcYoZfQgrEPfJHtHp5b9IF5jar4umjxKenC0GU/wvm9Ldm5NZ7jRsBJL2mDm/t3ReP0Nfvtz2kwiqP/hpiEIdu0JbBHvGf+vYpQSgZ8TvEU9kQe6fsJjitob9MfABYiu1Wi685EDjthRbxyKmHdl9TZteA6llDGCiBujtGc+mDBCkoiMSs9px8u8HFnpf4GC+5e16cEeKEOredq7hzk2HBcLKBnpaMQB1G78Z1n4Fg60pgVJeyhYMXKTAV1RRQxQ5Pope7w7zqPNl2UFAj9GxB1YDMLu2jzszeOoJUgYLglDpE1mEjidSxCWURrvxNzwqVwlAE5HfZ/IOKJW+8J5pBy5AP1+YufJFjtapxD3Gn4+a7ORlmjYbNBrQmF30RgfU8pQga+ZlDLVXn6WhWuReMYPJl8a5sjeRJuNBsz2NlcfuyybD0eSEataZWTmdMN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Jonathan, > On Jan 31, 2023, at 9:41 AM, Jonathan Cameron = wrote: >=20 > On Mon, 30 Jan 2023 11:11:23 -0800 > "Viacheslav A.Dubeyko" wrote: >=20 >> Hello, >=20 > Hi Slava, >=20 > I'll throw some opinions at this :) >=20 >>=20 >> I would like to suggest Fabric Manager (FM) architecture discussion. = As far as I can see, >> FM architecture requires: (1) FM configuration tool, (2) FM daemon, = (3) QEMU emulation >> of CXL hardware features. FM daemon receives requests from = configuration tool and >> executes commands by means of interaction with kernel-space subsystem = and CXL switch >> (that can be emulated by QEMU). So, the key questions for discussion: >=20 > Worth describing operating modes to be supported: You kind of cover = this later > but I think pulling it out make it clearer that we want one bit of = software to > do several different things. >=20 > 1) FM separate from hosts and talked to by higher level orchestration = software > but using a Switch CCI or MHD mailbox (over PCI) > This one is fairly easy because any security / shooting self in foot = problems > are an issue for higher level software.=20 > 2) FM on host. Probably mostly going be relevant for debug but may = use > the same mailbox as is being used by the existing CXL drivers (for = Multi > Head Device it might be the end point mailbox, for Multi Logical = Device > behind a switch it might be the switch mailbox). > 3) All out of band (MCTP or similar - want some shared code, but no > need for anything in kernel as far as I can tell). >=20 Most probably, we will have multiple FM implementations in firmware. Yes, FM on host could be important for debug and to verify correctness firmware-based implementations. But FM daemon on host could be important to receive notifications and react somehow on these events. Also, = journalling of events/messages/events could be important responsibility of FM daemon on host.=20 >=20 >> (1) How to distribute functionality between user-space and = kernel-space? >=20 > Kernel for transport if mailbox based (switch or MHD). > Possibly help in kernel with the host to Multiheaded device FM LD = tunneling > and host to switch to Multi Logical Device - Logical Device tunneling > but that could also be left to userspace. >=20 People loves to move everything in user-space now. But I believe we = could have as kernel-space as user-space solutions. I think we ned to check what = way could be more efficient and elegant solution. > If MCTP use the existing MCTP framework which is underlying transport = independent. > I posted a PoC for how this might work a while ago (hack on top of = MCTP-I2C > and some emulation) In the cover letter of the emulation PoC >=20 Sounds interesting. Let me check it. But I believe it could not be not = the first task in this implementation. :) >=20 > I think everything else belongs in userspace. I believe there are = redfish APIs > etc that would then be used to query and drive the userspace program = from an > orchestrator or similar level software. >=20 I need to check the redfish API. It sounds reasonable to employ some = existing framework. >> (2) Which functionality kernel-space needs to provide for = implementation FM features? >> Which kernel-space functionality do we need to implement yet? >=20 > Very little needed if we just expose the transport via PCI mailboxes. > There is a possible concern that FM-API commands are frequently > destructive and currently we don't let userspace poke destructive > commands. That may just need a specific opt in to say we know we > can shoot ourselves in the foot. >=20 I think this is why we need kernel. It sounds for me that we have to = have user-space and kernel-space collaboration here. >> (3) Do we need MCTP (Management Component Transport Protocol) or some = other >> protocol can be used for interaction between configuration tool, = FM daemon, and >> CXL switch? >=20 > Yes MCTP is needed. > I don't think we want the actual management code to be different > depending on transport / protocol. However we might layer it so that = there > is an interface program that sits between the management library / = program and > the FM-API transport. >=20 > Note I was struggling to find a suitable MCTP interface to emulate - = so would > welcome suggestions on that. I hacked the above PoC using an aspeed = i2c > controller that supported the right magic combination of features = needed > for MCTP over I2C but it doesn't have ACPI support which rather limits > usage (and I doubt anyone will be keen on adding ACPI support just to > test CXL related code :) If anyone knows of a suitable MCTP host we > could use for this that would be great (MCTP over PCI VDM might be = nice for > example) >=20 Let us start some command/feature implementation and we will figure it = out. But, I assume we need to start from something like CXL devices discovery = at first. >> (4) What architecture FM implementation requires? >> (5) Does it make sense to use Rust as implementation language? >=20 > Take your pick ;) First person to write a lot of code gets to pick the = language. >=20 Yeah, I see the point. Rust can provide some benefits (memory safety = model, for example). But it could introduce some issue with collaboration and makes = implementation more slow. Everybody develops in C language. But switching on Rust could be = not so easy target. >>=20 >>=20 >> FM configuration tool requires such commands: >=20 > A command line tool is fine, but like the 'real' FM configuration = interface will be via > a protocol (e.g. redfish). > There is a WIP for CXL, though not sure on latest status on this = (document on there is from > 2021) >=20 > So ultimately I'd expect fm_cli to be a wrapper around libredfish / = redfishtoo > that just makes it a bit easier to poke > with common commands. >=20 > I'm far from an expert of redfish so may have this all wrong. >=20 Sounds reasonable to me. Let me check how good it could be for this = project. >>=20 >> Discover - discover available agents >> Subcommands: >> - fm_cli discover fm - discover FM instances >=20 > If we are allowing more than one FM then I'd expect all the > other commands to be directed at that by some sort of FM specific > ID. If only one, what does this command do that isn't better > done with fm get_info >=20 Yes, we need to identify every object somehow. And it=E2=80=99s = interesting point. =46rom point of view, some human-friendly names could be good. But firmware-based FM implementation needs to follow the same rules. And it sounds for me that CXL specification should define how CXL FM or CXL device identify itself. Anyway, we need to ask CXL device and it = should return to us some ID. Probably, it will be some GUID or likewise number. >=20 >> - fm_cli discover cxl_devices - discover CXL devices >> - fm_cli discover logical_devices - discover logical devices >=20 > Discover switches as well. >=20 I assumed that CXL switch is a subclass of CXL devices. Do you mean that it is independent case? >>=20 >> FM - manage Fabric Manager >> Subcommands: >> - fm_cli fm get_info - get FM status/info >> - fm_cli fm start - start FM instance >> - fm_cli fm restart - restart FM instance >> - fm_cli fm stop - stop FM instance >> - fm_cli fm get_config - get FM configuration >> - fm_cli fm set_config - set FM configuration >=20 > I'd keep this slim for now. No idea what FM config we might want to > set so don't bother listing command yet. >=20 Yeah, it=E2=80=99s not completely clear yet. But I assume we can = consider such configuration options like: (1) register to receive event notifications (2) logging of events (3) errors handling >> - fm_cli fm get_events - get event records > Not sure what FM would have in the way of events (as opposed to > things it is talking to). >=20 I think FM can log events. If we consider FM daemon on host, then it could issue messages to end user as reaction to some events. >>=20 >> Switch - manage CXL switch >> Subcommands: >> - fm_cli switch get_info - get CXL switch info/status >=20 > These all need an ID field of some type to identify which switch. >=20 Yeah, it is exactly what we need for every command. We need to identify an object for a request. >> - fm_cli switch get_config - get switch configuraiton >> - fm_cli switch set_config - set switch configuration >>=20 >> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device >> Subcommands: >> - fm_cli dcd get_info - Get DCD Info (retrieves the number of = supported hosts, >> total Dynamic Capacity of the device, and supported region = configurations) >> - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity = Region Configuration >> (retrieves the Dynamic Capacity configuration for a specified = host) >> - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region = Configuration >> (sets the configuration of a DC Region) >> - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the = Dynamic Capacity >> Extent List for a specified host) >> - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add = (initiates the addition of >> Dynamic Capacity to the specified region on a host) >=20 > That one is complex ;) Probably needs a whole man page to itself. >=20 Currently, it=E2=80=99s only declaration of command set. Yeah, = implementation will be complex. :) >> - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release = (initiates the release of >> Dynamic Capacity from a host) >>=20 >> FM daemon receives requests from configuration tool and executes = commands by means of >> interaction with kernel-space subsystems. The responsibility of FM = daemon could be: >> - Execute configuration tool commands >> - Manage hot-add and hot-removal of devices >=20 > In what sense? I'd expect it to notify some higher level entity > (orchestrator or similar) but not sure I see what management the > FM would do. =20 >=20 I assume that if FM manages some metadata, then hot-add or hot-removal = could require some metadata corrections. Also, hot-add and hot-removal can = generate some events that FM can receive and process somehow. For example, it is = possible to log event messages into some journal. >> - Manage surprise removal of devices >=20 > Likewise, beyond reporting I wouldn't expect the FM daemon to have any = idea > what to do in the way of managing this. Scream loudly? >=20 Maybe, it could require application(s) notification. Let=E2=80=99s = imagine that application uses some resources from removed device. Maybe, FM can manage = kernel-space metadata correction and helping to manage application requests to not = existing entities. >> - Receive and handle even notifications from the CXL switch >> - Logging events >> - Memory allocation and QoS Telemetry management >> - Error/Failure handling >=20 > I'm not sure on separation of role between this component and > higher level policy / admin driven software. >=20 > For memory allocation it might take a 'give host A this much > memory with this characteristic set' command and own the > allocations across all present devices, or it might just > act as an interface layer to higher level software that does > the fine detail of figuring out which device to allocate memory > from to satisfy such a request. >=20 > Whilst I agree having a broad vision for an interface is good > there are a lot of subtle details in some of these commands > so I'd not spend too long refining the whole lot. Probably better > to look at them one at a time and then just have whoever ends > up maintaining / reviewing this thing responsible for making sure the > parameter format etc is consistent across commands. >=20 Yes, I agree. Let=E2=80=99s do it step by step. I believe we need to = start from implementation the application that process commands and do nothing at first. And first command that needs to be implemented is a discovery of CXL devices, switches, and FM instances because we need to identify CXL object somehow for any other command. Thanks, Slava.