From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81A6CEB64DA for ; Sun, 16 Jul 2023 15:09:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E10A6B0074; Sun, 16 Jul 2023 11:09:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1904C6B0075; Sun, 16 Jul 2023 11:09:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07F5D8D0001; Sun, 16 Jul 2023 11:09:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EDF446B0074 for ; Sun, 16 Jul 2023 11:09:07 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id BDDEA1A01E2 for ; Sun, 16 Jul 2023 15:09:07 +0000 (UTC) X-FDA: 81017807934.15.853C0E6 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf21.hostedemail.com (Postfix) with ESMTP id 0547B1C0019 for ; Sun, 16 Jul 2023 15:09:05 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of cmarinas@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=cmarinas@kernel.org; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=arm.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689520146; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EmXcTEm9rZBXy8gjuFczubzVftcWBvIe243nyBS37EE=; b=DFetucCwsmSEIB+kO6DheuJq9yYO4AD9VtGuuvnAtdnfYlDmtui9nq0e3FuQjMYhQPAZcM XRCsO6+glmczRJ4PSG3mVvtc5WcPaYG8DrrKyDwrq8YSeC3IPqMiWuQrbStVqgq8prBhmG ysRor4yjdsrYSqYoS2R0I6bpBX2xnHY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689520146; a=rsa-sha256; cv=none; b=XCKIA+gD5EKjWzCaQlW+yWjBlCxOYeUbd0REP1GKKu2bLMOVRw5/04W9VsiNSpXG4v0BMx CDEEeC7OVuih6KTQMfyOP4a0+fcjIE2QW7gqggYJpnZPWtJjMhmcnY24ZIA4qXV1hXycVI PX7G6eOaM0aALCwoQjpJGFRhZ/NVqoA= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of cmarinas@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=cmarinas@kernel.org; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=arm.com (policy=none) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 37F8A60C44; Sun, 16 Jul 2023 15:09:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93F9CC433C7; Sun, 16 Jul 2023 15:09:03 +0000 (UTC) Date: Sun, 16 Jul 2023 08:09:02 -0700 From: Catalin Marinas To: Benjamin Herrenschmidt Cc: Jason Gunthorpe , Marc Zyngier , ankita@nvidia.com, alex.williamson@redhat.com, naoya.horiguchi@nec.com, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, Lorenzo Pieralisi , Clint Sbisa , osamaabb@amazon.com Subject: Re: [PATCH v3 1/6] kvm: determine memory type from VMA Message-ID: References: <20230405180134.16932-1-ankita@nvidia.com> <20230405180134.16932-2-ankita@nvidia.com> <86r0spl18x.wl-maz@kernel.org> <67a7374a72053107661ecc2b2f36fdb3ff6cc6ae.camel@kernel.crashing.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <67a7374a72053107661ecc2b2f36fdb3ff6cc6ae.camel@kernel.crashing.org> X-Rspamd-Queue-Id: 0547B1C0019 X-Rspam-User: X-Stat-Signature: 5qjndt996tjfox469gk4b9i53qps1cbp X-Rspamd-Server: rspam03 X-HE-Tag: 1689520145-334728 X-HE-Meta: U2FsdGVkX19G+/l+dmxUqvKDHtpcHoRVUaF372utRVlYgWqQv/H+y/j2mBQArRMmGC3gqrk5c5I94P0cwxktXgumrMWZ2PloRU17Qt9V2Z+cMgxnPSCNThg2+e/OH54fvEVew3VgX9ngqTvgMFnLh+dUb/sH672QWqohK53oA6SDbE7WFbmFxo4fOj9TREALTeSpJ/ivBOHbb3Yb2od42GszpwZOaFx0nSOXXQSTFdjpf0d/5nV7bRgbcCwygYEdo7TeTSFKdgtMFpvC3Dz4CA63nfC6NroKm9TYSyo0iJ8dRPm1HKaDlQSkPhYQQ73IiyHCtMsqj5sHU+VBvVNDyfenq0CEoTQYJyVJbaWAONlMl6JFFGSBDoTkxKlWH56T9kenb9/QVXDtyPH5f29GIYF2ZjM26FSeaPrIUoCYgUr3QYj7Sfmymi6GpmYmIZ2mH0avjXcXJyjTQoSPswrppwhIe5FPQbaLNekNS3IXkLkXp0wB7s5Qr+OE2e58NhIvwWsp9UaqRLvwSyK8LppN8fahbnMlQHVFyWQuICExHf4hOALdR4RLGWPq3K7T5dM7G7fxboLooA1Xc8vvvgfD7+EQXQTrMawZB7BrQgTPeef1bONKHZ/LdZRwQ6LJUsKyY2axahHXBhhu4xGT5AdKJhzvIjveUq6A8e4V/oUgmjp7O1Lng+RB4rvjRbXlarpnoVq8+9vhbMUWpbv6C4S7ENQ1vSD/sziJXBVAAmImbm9ENdhMwCQ0i5zSksAycUI4lzCU7P7oeZCZF+v1Vuy361r36EPJGDyVBv0sERrfUPeRWASmK3GAk+X+gojHdiKTToILcvx3DZ44uOd3TCN2xFN7nsbMidp7yser/NJuPR0AUEgbTYgdiAJEBTwwFYoUgQh+rQyy5V6YwzOvxCK6Pr+qET06nu7wU9R4RDfcTrVEMCQGyWrN0B8l0FS7IQpr7QE+1qM3hAdjRVLhulO Ey02ECnm 7hn9l0LD98jS2aiwyoU9t8IMpa6vxMEMTflUQrSutiVQrzYAQ0qd9fZfm1dK0cOzy7fBlTWRDE8S7O2YU+i7UuHYxSiq8K0KLgJpjU8fH3+ERfr8pgyn0JgqHjEuXGv1FSjlOPfcpRdac642uQDSyZWWhaiMvNlTc4Lm7QNruQ5wryplXkFz+M0JoPXaB2coWxmzmx3II+Ih99BRCAKbSTfCecy00+PRxwyNj0x/LmK4o80giP2v+Qtu/pm3pVy/uPVqmJsfT9gRi6YQ0GhzKKEcYGe83xM5VdWoBsxCrHPRf8pgXltjN/fJ9YBrIpTMuf1tlb7giIR48LjkN48msTQWtcZauotGoLJ20LVWgNN0+ABguT2e0OewKcRJJ20YC6qvMxYUHWwxdh5nbEGl0T9S3QmXleb6C9u8rKx8OTou4Hik= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Ben, On Fri, Jul 14, 2023 at 06:10:39PM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2023-05-31 at 12:35 +0100, Catalin Marinas wrote: > > There were several off-list discussions, I'm trying to summarise my > > understanding here. This series aims to relax the VFIO mapping to > > cacheable and have KVM map it into the guest with the same attributes. > > Somewhat related past threads also tried to relax the KVM device > > pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC > > (pgprot_writecombine). Those were initially using the PCIe prefetchable > > BAR attribute but that's not a reliable means to infer whether Normal vs > > Device is safe. Anyway, I think we'd need to unify these threads and > > come up with some common handling that can cater for various attributes > > required by devices/drivers. Therefore replying in this thread. > > So picking up on this as I was just trying to start a separate > discussion on the subject for write combine :-) Basically this thread started as a fix/improvement for KVM by mimicking the VFIO user mapping attributes at the guest but the conclusion we came to is that the VFIO PCIe driver cannot reliably tell when WC is possible. > In this case, not so much for KVM as much as for VFIO to userspace > though. > > The rough idea is that the "userspace driver" (ie DPDK or equivalent) > for the device is the one to "know" wether a BAR or portion of a BAR > can/should be mapped write-combine, and is expected to also "know" > what to do to enforce ordering when necessary. I agree in principle. On the KVM side we concluded that it's the guest driver that knows the attributes, so the hypervisor should not restrict them. In the DPDK case, it would be the user driver that knows the device it is mapping and the required attributes. In terms of security for arm64 at least, Device vs Normal NC (or nc vs wc in Linux terminology) doesn't make much difference with the former occasionally being worse. The kernel would probably trust the DPDK code if it allows direct device access. > So the userspace component needs to be responsible for selecting the > mapping, the same way using the PCI sysfs resource files today allows > to do that by selecting the _wc variant. I guess the sysfs interface is just trying to work around the VFIO limitations. > I don't know how much of this makes sense for KVM, but I think what we > really want is for userspace to be able to specify some "attributes" > (which we can initially limit to writecombine, full cachability > probably requires a device specific kernel driver providing adequate > authority, separate discussion in any case), for all or a portion of a > BAR mapping. For KVM, at least the WC case, user-space doesn't need to be involved as it normally should not access the same BAR concurrently with the guest. But at some point, for CXL-attached memory for example, it may need to be able to map it as cacheable so that it has the same attributes as the guest. > The easy way is an ioctl to affect the attributes of the next mmap but > it's a rather gross interface. > > A better approach (still requires some coordination but not nearly as > bad) would be to have an ioctl to create "subregions", ie, dynamically > add new "struct vfio_pci_region" (using the existing dynamic index > API), which are children of existing regions (including real BARs) and > provide different attributes, which mmap can then honor. > > This is particularly suited for the case (which used to exist, I don't > know if it still does) where the buffer that wants write combining > reside in the same BAR as registers that otherwise don't. IIUC that's still the case for some devices (I think Jason mentioned some Mellanox cards). > A simpler compromise if that latter case is deemed irrelevant would be > an ioctl to selectively set a region index (including BARs) to be WC > prior to mmap. > > I don't know if that fits in the ideas you have for KVM, I think it > could by having the userspace component require mappings using a > "special" attribute which we could define as being the most relaxed > allowed to pass to a VM, which can then be architecture defined. The > guest can then enforce specifics. Does this make sense ? I think this interface would help KVM when we'll need a cacheable mapping. For WC, we are ok without any VFIO changes. -- Catalin