From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 962BBC52D7C for ; Mon, 19 Aug 2024 02:12:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B45F76B007B; Sun, 18 Aug 2024 22:12:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AF50E6B0082; Sun, 18 Aug 2024 22:12:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BD3F6B0083; Sun, 18 Aug 2024 22:12:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7AB846B007B for ; Sun, 18 Aug 2024 22:12:40 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DAC05A0F18 for ; Mon, 19 Aug 2024 02:12:39 +0000 (UTC) X-FDA: 82467371238.28.7910E01 Received: from mail-oo1-f48.google.com (mail-oo1-f48.google.com [209.85.161.48]) by imf14.hostedemail.com (Postfix) with ESMTP id EA97410001D for ; Mon, 19 Aug 2024 02:12:37 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JhJGl2ZI; spf=pass (imf14.hostedemail.com: domain of grovesaustin@gmail.com designates 209.85.161.48 as permitted sender) smtp.mailfrom=grovesaustin@gmail.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724033480; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kuLzrxHtd3ErSvy/xkwzVOEiflzs3mXlwCUfMyo1XlY=; b=FWETCpJEIKIAaHSloEmOrId+7Dput6vLiD2/aX5z7JIdPbh9Go8psWLRUucMQhs9rouyRs FDc7QFLflfZLpam8gzptBbrut2mG1k6DaLA78MUl9rhz4y813oTYe3uINAh1OuYJ/OGcse 7ugqrLVDX7cgXcTXGOOny/E7dnsX8y0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724033480; a=rsa-sha256; cv=none; b=KeUrSsf2Stf8+PMDS5MNHJWZlDsKhHpsUZZZHZtNKp/uNtPtUFxi1HshUq964s+7QLGdke MvReR9P3jgSp+FyPG05layLTLtCMv/RxfkYECjb7pTjKWobyXsknbpsbTR2QBRlSBAums0 c1loxV1M3vxV0oHcBXLt2UeiMCjrEcY= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JhJGl2ZI; spf=pass (imf14.hostedemail.com: domain of grovesaustin@gmail.com designates 209.85.161.48 as permitted sender) smtp.mailfrom=grovesaustin@gmail.com; dmarc=none Received: by mail-oo1-f48.google.com with SMTP id 006d021491bc7-5d5eec95a74so2474650eaf.1 for ; Sun, 18 Aug 2024 19:12:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724033557; x=1724638357; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=kuLzrxHtd3ErSvy/xkwzVOEiflzs3mXlwCUfMyo1XlY=; b=JhJGl2ZI9HvGQT/P0XiqKZIzBOH0PhPaiap3qBvuWzqwdQ1K967L/l+Yxjf0f2zpyM HhVTSc6yO7SQAyv0XkJuWBnA8IepKosx7gr2xgRewmqS+bjfRTzC6ca4+aR9XWshUWQi cwuPqDJwM73Qa4ufxQDpXFIvJzmy2RVAYRQWR9mhv90X7biIyD0G8VjLNClftrBjdjcX QMYuiYZsFpqVLZrRZlm+2iNjwDoX3RlmzcOq+0k+vUmd6fkG3xLMW3TRsoAemNb3dtaH 9kX2ki7J//Sp0X9LBV11iSQ/tXJyT0kBv+V/WJMYv0FYi9XXVk4uPqZIFAsri2UJWVgc MpEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724033557; x=1724638357; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kuLzrxHtd3ErSvy/xkwzVOEiflzs3mXlwCUfMyo1XlY=; b=Eex1sU4Euu4Yn6kNh+vKY1sXd0LnW7oQ2z3UKJrVguEkJI66AdV2l4ns8rr38GCVNO OXGMCImYqJ9StWd9e6ac2wrJ8YQMQxYpZjm6pNSoW2/n+X+PPC01rvjhNINxbAicKoVd 0KtN/VpuC1aaqj2wDU04TgeWj8d66kcsmoXpcHyAVs5u2/PLkVQOiDzSyxE40cmwxCJX aenWYe8jYD52Xp8eDzEdxYZbWzXktdHDLCZ5c7wWfv93JHBMBW55kg+8rxCZfIojLbRD FoybeXvpkrWfpZX7jRUU/IzOrifVbuuIh+TPEcJbc4aKAEWAGCciY4giLPS4VxLqayCW +D0w== X-Forwarded-Encrypted: i=1; AJvYcCXFdLiMwI+kLCDrSgG4i0+ZULurO/Lk4RKZg5m4i5tGyp2COjnoFpm84f2a/HkQ3+POLhXqhYO7U2VMyuzJtfqu4ds= X-Gm-Message-State: AOJu0YzKZlhEuDWhqKp8kjn9ba1oOxQqjwJH4rsnSw5DKCavf3aHIndM bAbBXt48NFbQ2jkuw0Hotzh9a9Vws18S9ZPdFASmmF2vved7yKEA X-Google-Smtp-Source: AGHT+IGPFkOOeEpckRDFCUVwQIGGQr/JsPZgxRaMEz2IAoGo+sQFEwWCNUAzmF3eAuk+/dJ8qwreqw== X-Received: by 2002:a05:6820:1624:b0:5da:a2fd:5af9 with SMTP id 006d021491bc7-5daa2fd7235mr7883905eaf.8.1724033556730; Sun, 18 Aug 2024 19:12:36 -0700 (PDT) Received: from Borg-110.local (syn-070-114-203-196.res.spectrum.com. [70.114.203.196]) by smtp.gmail.com with ESMTPSA id 006d021491bc7-5da8cf83db6sm1928759eaf.33.2024.08.18.19.12.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 18 Aug 2024 19:12:36 -0700 (PDT) Date: Sun, 18 Aug 2024 21:12:34 -0500 From: John Groves To: Jonathan Cameron Cc: David Hildenbrand , linux-mm@kvack.org, linux-cxl@vger.kernel.org, Davidlohr Bueso , Ira Weiny , virtualization@lists.linux.dev, Oscar Salvador , qemu-devel@nongnu.org, Dave Jiang , Dan Williams , linuxarm@huawei.com, wangkefeng.wang@huawei.com, John Groves , Fan Ni , Navneet Singh , =?utf-8?B?4oCcTWljaGFlbCBTLiBUc2lya2lu4oCd?= , Igor Mammedov , Philippe =?utf-8?Q?Mathieu-Daud=C3=A9?= Subject: Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Message-ID: References: <20240815172223.00001ca7@Huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240815172223.00001ca7@Huawei.com> X-Stat-Signature: erfs4ptomaf6o1ft7j3inj3ebmbjkhwk X-Rspamd-Queue-Id: EA97410001D X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1724033557-92880 X-HE-Meta: U2FsdGVkX18am2IYm22LVjVVQUvWPmW5yDyhqinfiAG3SnE2ncJimHLszS234+hTsEjUmZ4rQXRfQabCq/tgmW+UGJ3g0wjlEPqfjygnGIfysOTuS9c4VfOj7CvOoXmtWmMVA5x9kLb+kSK1RxFGYJ6MeINZeNUqT74Wp780XQaLbIto9vRb/gkh5t0at4dNRSeLrXcK71wVxxMI0XRZe/VffiF8PDXk5wCuiDU4T6lgsHlx4f6moGklyoF3fuF8xvBQ/VtQsYpnmBGq2oC7UL0wyhGJHp6neqV8y6xdzBOWAHHrZtbUYIDkl/Cr9P93EpYfcKZi+D/O6xYlIDkoeKVVNfRYJgppEL2dbRLm8+/co+NM4ZkBafp6UBkY00bZsyEhXQ0ebHu4b73/kuUwXChzMr21fZrK8RLFxFcLVbUNdHIgQf3Qnun4BeJhC8Zj/OxzE9vPg0FzeXCzEEemIQlXcmSj5LDmpLNrJRA2DXYM3zvgR+5vS0vaXjrtJSDs22Pb6/yh6Ipik/KY7jdfQO25DHE7PAiCZatrYN7zSZUjPUiD7BhgeBXzB+q9xpy0X/kPLv4TBpIbgNdGQLuyRGibBPBOaJ6y794LrwukdQfs0k1GhWQPWbyA2ArrBAgCvlXOgfz67t5XfGZayDRiE8TsCbfsdWCDSqQmfSOQVOpjhZUkVBS0DRe1pnOMr6S6r3QYjj+8CxkvxVqxlpf91EfBiRz0hW1jEyrULsUH6UW8JGHPiMwoy6FT1u4IyVIro9TP0Tg0qSVaNiVX+9XthbiPEReT8nAXuqFvFC8CLkpJO+UmwLd8VBQsUCQjn2yoOrA1S6G078pPZWey0KF+lNTPs500fII5DUADc3BV4di8cRFhOyrPgRIJH8lZWcg6QbpS4OeZSh847qtS3//dG6S4MBQkPxsXpIrT1m/tFc9WqxK6n26b4QwCEuAQXgQ6pVK1ST7NdrlQBCe8HCo cYXgGgYq uv6luUsuDmOZaKb5MmVpnKc9WFlZBhNeLf9CUUQtSZKXGwZ40+hybiKs5yJIBr/8DF3BfHv+pmAP+YXpvrKOebydw0lk9zrh9R3KwzbjdhjnB739IvIz53xv8UEXPDf/q2Ii3r2G1Ibg/c5rirpop6yTbNQWJOOae0FApcZx0NqHexIUI0V3lKWSybe9lsj+btC6hKktQ2IyxTq4LbeszMdvh3Omkbq9ubBVmhrKOu/BzNSnjJnW1b2Sa492pBcyRE76Ya0GhkAXFL7obhyWUq4oyFXvAlcqE1SCg07BPWDvgEh0ApHIu8+0nW0NkWi2uskdQbWWDoGQGgZ8Y6rqWUKT4o6DGVxWngkoNj6jJlHPKjSKGx0NJhiAAnBNJ4uUYpiVQFQYES2MPrvr+JdFPOEJXpKmBbqNmkqSIdJdsEM9WxPs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 24/08/15 05:22PM, Jonathan Cameron wrote: > Introduction > ============ > > If we think application specific memory (including inter-host shared memory) is > a thing, it will also be a thing people want to use with virtual machines, > potentially nested. So how do we present it at the Host to VM boundary? > > This RFC is perhaps premature given we haven't yet merged upstream support for > the bare metal case. However I'd like to get the discussion going given we've > touched briefly on this in a number of CXL sync calls and it is clear no one is Excellent write-up, thanks Jonathan. Hannes' idea of an in-person discussion at LPC is a great idea - count me in. As the proprietor of famfs [1] I have many thoughts. First, I like the concept of application-specific memory (ASM), but I wonder if there might be a better term for it. ASM suggests that there is one application, but I'd suggest that a more concise statement of the concept is that the Linux kernel never accesses or mutates the memory - even though multiple apps might share it (e.g. via famfs). It's a subtle point, but an important one for RAS etc. ASM might better be called non-kernel-managed memory - though that name does not have as good a ring to it. Will mull this over further... Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs), some of which will be obvious to many of you: * A DCD is just a memory device with an allocator and host-level access-control built in. * Usable memory from a DCD is not available until the fabric manger (likely on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add command to the DCD. * A DCD allocation has a tag (uuid) which is the invariant way of identifying the memory from that allocation. * The tag becomes known to the host from the DCD extents provided via a CXL event following succesful allocation. * The memory associated with a tagged allocation will surface as a dax device on each host that has access to it. But of course dax device naming & numbering won't be consistent across separate hosts - so we need to use the uuid's to find specific memory. A few less foundational observations: * It does not make sense to "online" shared or sharable memory as system-ram, because system-ram gets zeroed, which blows up use cases for sharable memory. So the default for sharable memory must be devdax mode. * Tags are mandatory for sharable allocations, and allowed but optional for non-sharable allocations. The implication is that non-sharable allocations may get onlined automatically as system-ram, so we don't need a namespace for those. (I argued for mandatory tags on all allocations - hey you don't have to use them - but encountered objections and dropped it.) * CXL access control only goes to host root ports; CXL has no concept of giving access to a VM. So some component on a host (perhaps logically an orchestrator component) needs to plumb memory to VMs as appropriate. So tags are a namespace to find specific memory "allocations" (which in the CXL consortium, we usually refer to as "tagged capacity"). In an orchestrated environment, the orchestrator would allocate resources (including tagged memory capacity), make that capacity visible on the right host(s), and then provide the tag when starting the app if needed. if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the root memory allocation to find the right memory device. Once mounted, it's a file sytem so apps can be directed to the mount path. Apps that consume the dax devices directly also need the uuid because /dev/dax0.0 is not invariant across a cluster... I have been assuming that when the CXL stack discovers a new DCD allocation, it will configure the devdax device and provide some way to find it by tag. /sys/cxl//dev or whatever. That works as far as it goes, but I'm coming around to thinking that the uuid-to-dax map should not be overtly CXL-specific. General thoughts regarding VMs and qemu Physical connections to CXL memory are handled by physical servers. I don't think there is a scenario in which a VM should interact directly with the pcie function(s) of CXL devices. They will be configured as dax devices (findable by their tags!) by the host OS, and should be provided to VMs (when appropriate) as DAX devices. And software in a VM needs to be able to find the right DAX device the same way it would running on bare metal - by the tag. Qemu can already get memory from files (-object memory-backend-file,...), and I believe this works whether it's an actual file or a devdax device. So far, so good. Qemu can back a virtual pmem device by one of these, but currently (AFAIK) not a virtual devdax device. I think virtual devdax is needed as a first-class abstraction. If we can add the tag as a property of the memory-backend-file, we're almost there - we just need away to lookup a daxdev by tag. Summary thoughts: * A mechanism for resolving tags to "tagged capacity" devdax devices is essential (and I don't think there are specific proposals about this mechanism so far). * Said mechanism should not be explicitly CXL-specific. * Finding a tagged capacity devdax device in a VM should work the same as it does running on bare metal. * The file-backed (and devdax-backed) devdax abstraction is needed in qemu. * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra points for being easy to implement in both physical and virtual systems. Thanks for teeing this up! John [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md