From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46233EB64DA for ; Fri, 14 Jul 2023 08:11:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 79E3C6B0074; Fri, 14 Jul 2023 04:11:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 74DF36B0075; Fri, 14 Jul 2023 04:11:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EE866B0078; Fri, 14 Jul 2023 04:11:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4DCA66B0074 for ; Fri, 14 Jul 2023 04:11:10 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 17B5D12015E for ; Fri, 14 Jul 2023 08:11:10 +0000 (UTC) X-FDA: 81009497100.06.5A0D693 Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) by imf14.hostedemail.com (Postfix) with ESMTP id 1A59610001C for ; Fri, 14 Jul 2023 08:11:07 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of benh@kernel.crashing.org designates 63.228.1.57 as permitted sender) smtp.mailfrom=benh@kernel.crashing.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689322268; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3juUh2YYDrdCtAe5YazhLl0o6ra2u7197YkJUUmKkrA=; b=Y8wUO5S9FYxDpn/5ps+aK/x1bC79ow+A630mwzgE9o6cf0ftNfaqHPdy+pNKygby5ZiSPp D3ibS06hQezhvb41ZsrnUK22OcZmEjUbW7CkUYP6s+tMa14ferAjRoN3A/rvYV3a5sP4JJ gXdZuWm9BXbjdS83Ek43dyTIITYfed0= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of benh@kernel.crashing.org designates 63.228.1.57 as permitted sender) smtp.mailfrom=benh@kernel.crashing.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689322268; a=rsa-sha256; cv=none; b=8bOW8W/Vz+0LGpAhgWiQAU8vW/s7JeB6jYmi8Pj+aw9wlTEhp4NeLX4ygbTgLES3IPiCc6 pWRyuoIRB668vMmMFIWhGElIrhqb2oXmSKzsAP6MhxCuOwUxMaYFlE2rnyroyNHszaLJNr C7+UVqBNBaeEa2EcHryVG8ryVB9f7Ck= Received: from [IPv6:::1] (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id 36E8Adte003053; Fri, 14 Jul 2023 03:10:40 -0500 Message-ID: <67a7374a72053107661ecc2b2f36fdb3ff6cc6ae.camel@kernel.crashing.org> Subject: Re: [PATCH v3 1/6] kvm: determine memory type from VMA From: Benjamin Herrenschmidt To: Catalin Marinas , Jason Gunthorpe Cc: Marc Zyngier , ankita@nvidia.com, alex.williamson@redhat.com, naoya.horiguchi@nec.com, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, Lorenzo Pieralisi , Clint Sbisa , osamaabb@amazon.com Date: Fri, 14 Jul 2023 18:10:39 +1000 In-Reply-To: References: <20230405180134.16932-1-ankita@nvidia.com> <20230405180134.16932-2-ankita@nvidia.com> <86r0spl18x.wl-maz@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.44.4-0ubuntu1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 1A59610001C X-Stat-Signature: pck1oj7mzdzqiwenxza9ki7jes7e9nz7 X-HE-Tag: 1689322267-201968 X-HE-Meta: U2FsdGVkX1+AIQ57cOwf7uKEefkSc9KdntAvyYpmOuhNP7VsdO0CrdlbETIo7DrTXuRAd3PJmXM/AGflkHhPS+V40EsBHHYK36R+7DP/azgZWvXRNDGl1EH0fonrQ0Tqe763b1rc1guwWD1ZBCFjcdNCIhN2GU23w5yQUlw7gGwGyb6PiZ4jmF4ke2iWaQjG5q3gWm3zF5gwotkth6PQm/dd2cLt9OrSeGLs3fui4NdHkxZS86KLnVUZKWpa/woWC/Lmq+O6FpQhCfovEwzrVnnd6Hnl/MUBBkNqqVJD+c7vhaGFhvfzvUPKyv6U3VHEJtQCLUR1n4XanUyEwYARDneSvE7Ks51Uw/bt/E9I1UdNRmek/BMyZKkk9ERWbHGgFo0qaU5C0KoVEGTPWLl9mCRyGB3pEPs3SmmTCZ7o3IP692n7CjLivPUE8bC17KDSh6HEfyK+QipSd/JpRffLRvNMUh4GJNT/Mzj4Yjb9kraopq0YREmQ7RJYXPVVvJt9ug8ZvdkJCGPVhvSyRI2sztWXMySD2e71FSrhPV3h0i0/QgQIqeTc+VWlWXR+MG4NMqptdsjFsNLoyx/J4UXsDHq6/9e4QILbrtWTEZTKXOQKTOIaDIqQ9CjFJBTMckJpTal90CNmcWKaQz1bUrw0HAXR56dSy0GAUHkk2B6+kuzRjGgafKtr11ymqHj+aLo5hpcNYO9oNUkFHXmvUWBbmRfAv+S7XXK3lr/cFo0RvIXDes0W9Y8Oy1gNg7jNziMvjFuvSzjYhT2ZSpKTfFNGO/2aGPFt0ZU4P0p+8M8zDK03aOT/Z6McmVgUXjWA6hCXGMz+Q6MrfDOiIQ7ohrE8JNEvtSquAU4/+sjjKG97FqXcjrH5zZ5kEXMX2e0aHx1XF0H6UQFhteA4qGK8XGLiqTVI5BtkudGT1nhzeIja/RXCc8cLtFkf3mH9hvJIt29sL/okWAx+ro2i5vE+TVT yPO5rmVQ /RR+E8mhZ9GdhqHToGXz+GXo2ywdp5mfI3is+FlUlo+4buiiBdNHuuYKjIXGLq9YShTD0kg759OAZmJc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 2023-05-31 at 12:35 +0100, Catalin Marinas wrote: > There were several off-list discussions, I'm trying to summarise my > understanding here. This series aims to relax the VFIO mapping to > cacheable and have KVM map it into the guest with the same attributes. > Somewhat related past threads also tried to relax the KVM device > pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC > (pgprot_writecombine). Those were initially using the PCIe prefetchable > BAR attribute but that's not a reliable means to infer whether Normal vs > Device is safe. Anyway, I think we'd need to unify these threads and > come up with some common handling that can cater for various attributes > required by devices/drivers. Therefore replying in this thread. So picking up on this as I was just trying to start a separate discussion on the subject for write combine :-) In this case, not so much for KVM as much as for VFIO to userspace though. The rough idea is that the "userspace driver" (ie DPDK or equivalent) for the device is the one to "know" wether a BAR or portion of a BAR can/should be mapped write-combine, and is expected to also "know" what to do to enforce ordering when necessary. So the userspace component needs to be responsible for selecting the mapping, the same way using the PCI sysfs resource files today allows to do that by selecting the _wc variant. I posted a separate message that Lorenzo CCed back to some of you, but let's recap here to keep the discussion localized. I don't know how much of this makes sense for KVM, but I think what we really want is for userspace to be able to specify some "attributes" (which we can initially limit to writecombine, full cachability probably requires a device specific kernel driver providing adequate authority, separate discussion in any case), for all or a portion of a BAR mapping. The easy way is an ioctl to affect the attributes of the next mmap but it's a rather gross interface. A better approach (still requires some coordination but not nearly as bad) would be to have an ioctl to create "subregions", ie, dynamically add new "struct vfio_pci_region" (using the existing dynamic index API), which are children of existing regions (including real BARs) and provide different attributes, which mmap can then honor. This is particularly suited for the case (which used to exist, I don't know if it still does) where the buffer that wants write combining reside in the same BAR as registers that otherwise don't. A simpler compromise if that latter case is deemed irrelevant would be an ioctl to selectively set a region index (including BARs) to be WC prior to mmap. I don't know if that fits in the ideas you have for KVM, I think it could by having the userspace component require mappings using a "special" attribute which we could define as being the most relaxed allowed to pass to a VM, which can then be architecture defined. The guest can then enforce specifics. Does this make sense ? Cheers Ben.