From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42457C28B28 for ; Tue, 18 Mar 2025 09:39:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6C7E3280002; Tue, 18 Mar 2025 05:39:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6771A280001; Tue, 18 Mar 2025 05:39:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F245280002; Tue, 18 Mar 2025 05:39:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2E81D280001 for ; Tue, 18 Mar 2025 05:39:36 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8FC7357A17 for ; Tue, 18 Mar 2025 09:39:37 +0000 (UTC) X-FDA: 83234174394.16.AFA1BCE Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf19.hostedemail.com (Postfix) with ESMTP id B5F241A000A for ; Tue, 18 Mar 2025 09:39:35 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kx+9FD55; spf=pass (imf19.hostedemail.com: domain of maz@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=maz@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742290775; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZjBamXLhG9t31btVrgsJM7L1wHONxOSaj8rFraH+oxQ=; b=BsVr/BGsmKuRwA/PopEOjFg3vOfzARPjIlSUcy56s0ecCi0nZL/tXd5N3PWPek3p79ozDc GxUDkIiWAuEDa1UxHB0HNxQWEokr6axneTO7wB6X7G8JbxCZuniRReAxjx1jPW5n3lcFMp aUUuS0nIs//4IZjEEXcWLNpWeKjh47M= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742290775; a=rsa-sha256; cv=none; b=NVwtfVjvAlvYKRlVNhuEnpD9FKZV8TIMwEjQ3gDVGrt+sipVMpF+79LU9howQzQOwu84SS wGzydDD6eERoTcvXiTqT4K8H1GJ9V5XqBoGLtKsXFllxBh9qJGKISxqfEfC9LMWH+gRTlr HqWmTqkCBY7tqsEodJMYhIQ1hX2fegI= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kx+9FD55; spf=pass (imf19.hostedemail.com: domain of maz@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=maz@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 5E222A48F13; Tue, 18 Mar 2025 09:34:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 44FF1C4CEDD; Tue, 18 Mar 2025 09:39:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1742290774; bh=e/N39kECrZaAZqXOkn575Z5oEd+RpQ3Lp7nzRZQGNro=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=kx+9FD55olsHSOS5x25jlmc97mqriSkBafms+JbkkGsDo0bB5t+Dpm7M1q5MKXkN7 Kp4dQXxtIW8f40PSN9w5Lp+/gdcXXQtfHhNogqadcWrs+DN0JAEQJCavL+YUT0lJVB U8X2qVQY7CNkvrbNXa/s8snf+qRUG3Md0StX1LFLu9fxMnjml5KQvtZO0hH7SCEMVc AZiBMQ+qCgpaY+hqQBz5DA46V0lFjKD5owm3sy8qQMTolWwb/9x6inxi73nnQivdAI fmAdtoBszkVnleKAPLAXOHjPq6Sd+KL2w7XnL9Bdmb+gPMFyuhrQn8gGW9fx5bASoa BcXv8pRa77bGQ== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1tuTQ3-00EcHD-OW; Tue, 18 Mar 2025 09:39:31 +0000 Date: Tue, 18 Mar 2025 09:39:30 +0000 Message-ID: <86wmcmn0dp.wl-maz@kernel.org> From: Marc Zyngier To: Catalin Marinas Cc: Ankit Agrawal , Jason Gunthorpe , "oliver.upton@linux.dev" , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "will@kernel.org" , "ryan.roberts@arm.com" , "shahuang@redhat.com" , "lpieralisi@kernel.org" , "david@redhat.com" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Vikram Sethi , Andy Currid , Alistair Popple , John Hubbard , Dan Williams , Zhi Wang , Matt Ochs , Uday Dhoke , Dheeraj Nigam , Krishnakant Jaju , "alex.williamson@redhat.com" , "sebastianene@google.com" , "coltonlewis@google.com" , "kevin.tian@intel.com" , "yi.l.liu@intel.com" , "ardb@kernel.org" , "akpm@linux-foundation.org" , "gshan@redhat.com" , "linux-mm@kvack.org" , "ddutile@redhat.com" , "tabba@google.com" , "qperret@google.com" , "seanjc@google.com" , "kvmarm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags In-Reply-To: References: <20250310103008.3471-1-ankita@nvidia.com> <20250310103008.3471-2-ankita@nvidia.com> <861pv5p0c3.wl-maz@kernel.org> <86r033olwv.wl-maz@kernel.org> <87tt7y7j6r.wl-maz@kernel.org> <8634fcnh0n.wl-maz@kernel.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/29.4 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: catalin.marinas@arm.com, ankita@nvidia.com, jgg@nvidia.com, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, zhiw@nvidia.com, mochs@nvidia.com, udhoke@nvidia.com, dnigam@nvidia.com, kjaju@nvidia.com, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: B5F241A000A X-Stat-Signature: drje94qeniczz79j8onjme3783u68kdq X-HE-Tag: 1742290775-369176 X-HE-Meta: U2FsdGVkX196Kmj1GEL9+dkdFfzZJobpYQw0LgQwFcFfwQFSpb4Tu4yoEoVjD70mnveGUSX7WV5mL2hXrVAXfVhTT+LVbfRmb4FLUC2+Krn0u+4S4NqJ6KOTHnXh1qaZ85NVaFae26zkBpBv4+QUvoPHu6uzMMvOxjTUa0zc+M1sPx5gWl7rs2AvM9aNVvBCdmiagwpmn0an+hieUGKVgylUx/L1w3kzhwI8tW9jtNXHa25gMSrsfJlFxvyXbrEz2ALgHYVMn+sZI46Vi0oYwkSjnj6M1wfH5HXf8pprhehsFgU8ErHEuBs9GQWAmFWWHsc7Pk6ODlnMUHh2PEmpAPURaIYzgL86J4kpLbL6EWJuSA8BzP6YaQ1fAAgWrp9lVSy425E/le3Zl0aOwRE28Y2bzec15Mjf6yKVQOhNUle9X3S3v/U11aGoG0FGUi5h7EW3WAlv4wBNtsaWaiPEtdd2uuzKcB4KzZRO+MXZkQy+ten8U/KMsDhkmt36YUo0GGsqVj1jwCMJW3spfrqnk7LZ9R1AgtCBlXxL6snPHQKf4vXL+hHJ6cw6Db2jjWf0bmLa9qjmGUHqTcqLCocH/m9a0EKTBEjv0oW2+Jp+iPpiOo9hRMtjsYg5sIKxc9//28EQw/8VawjjdDiMIstcl2TribeH2vUmoyfJTv3wF4Z1waFNu08I1nNMyGE3rqWgNaxecms3CFXkG2xTv5ew9wEKOsjEkHPEJI5EpJiuikoyjc67VxumSGVDZhlrXKCA1Yy19NYxncX48NkhGoTqxkfeiTODWMXCqaQqDgcuw9vNQs6VgO017Xv7UIjGSh52AjbGilr23vQ+BEs5AEAcmba3kPnFGvvZo0u66IfHHNMRrcqoQ8bGOFyTvRd7CSXUIHeGOqszm04BKmyBhNNoYj6h4GWAvvqqcbln1LQWDMKWV3PDN+Ku1YZDfKoTOhTMJ7Qt9lNblJ+7RN6zXaa lSkavTDd eUKvPvwFKrsREXMNADG6rKTSsJrJTi43A2aJwWTY70T2gLY8Gm/QnMkmbK+K6MNmC5e3jTl9NtSRMJmIYqC2LOhPX+tKSbiwgyG48VNVOcXOUV0jwsTTFmpcURMH3tlk9dST9mydgTKQ9nMVkgALD+ll6mp/Wo9G3fVkOA2FIqQvq+dHBETJKGTZaWAnNL2FhLJj1mPKSP3PxDB/OnQw4JuMErPyQNXlJrhhUrQkAFN8kase9P25H93vH4MRYRPjAs704sX61at8YI+EUqo52bQ+bRbOZ7ycPVt0iZgC4xUdIerQoyTb9RZ+B2WcN+CWeyEaYifryWkibYtl663g4MmSgc6fpxaQ40vcy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 17 Mar 2025 19:54:25 +0000, Catalin Marinas wrote: > > On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote: > > On Mon, 17 Mar 2025 05:55:55 +0000, > > Ankit Agrawal wrote: > > > > > > >> For my education, what is an accepted way to communicate this? Please let > > > >> me know if there are any relevant examples that you may be aware of. > > > > > > > > A KVM capability is what is usually needed. > > > > > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change > > > to fetch the new KVM cap. Then it could fail in case the FWB is not > > > supported with some additional conditions (so that the currently supported > > > configs with !FWB won't break on usermode). > > > > > > The proposed code change is to map in S2 as NORMAL when vma flags > > > has VM_PFNMAP. However, Qemu cannot know that driver is mapping > > > with PFNMAP or not. So how may Qemu decide whether it is okay to > > > fail for !FWB or not? > > > > This is not about FWB as far as userspace is concerned. This is about > > PFNMAP as non-device memory. If the host doesn't have FWB, then the > > "PFNMAP as non-device memory" capability doesn't exist, and userspace > > fails early. > > > > Userspace must also have some knowledge of what device it obtains the > > mapping from, and whether that device requires some extra host > > capability to be assigned to the guest. > > > > You can then check whether the VMA associated with the memslot is > > PFNMAP or not, if the memslot has been enabled for PFNMAP mappings > > (either globally or on a per-memslot basis, I don't really care). > > Trying to page this back in, I think there are three stages: > > 1. A KVM cap that the VMM can use to check for non-device PFNMAP (or > rather cacheable PFNMAP since we already support Normal NC). > > 2. Memslot registration - we need a way for the VMM to require such > cacheable PFNMAP and for KVM to check. Current patch relies on (a) > the stage 1 vma attributes which I'm not a fan of. An alternative I > suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption > that the vfio driver knows if it supports cacheable (it's a bit of a > stretch trying to make this generic). Yet another option is (c) a > KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration. > > 3. user_mem_abort() - follows the above logic (whatever we decide), > maybe with some extra check and WARN in case we got the logic wrong. > > The problems in (2) are that we need to know that the device supports > cacheable mappings and we don't introduce additional issues or end up > with FWB on a PFNMAP that does not support cacheable. Without any vma > flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is > relying on the stage 1 attributes. But we don't know them at the memslot > registration, only later in step (3) after a GUP on the VMM address > space. > > So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know > they are going to be mapped as cacheable. So we need this information > somehow, either from the vma->vm_flags or slot->flags. Yup, that's mostly how I think of it. Obtaining a mapping from the xPU driver must result in VM_PFNMAP being set in the VMA. I don't think that's particularly controversial. The memslot must also be created with a new flag ((2c) in the taxonomy above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This flag is only allowed if (1) is valid. This results in the following behaviours: - If the VMM creates the memslot with the cacheable attribute without (1) being advertised, we fail. - If the VMM creates the memslot without the cacheable attribute, we map as NC, as it is today. What this doesn't do is *automatically* decide for the VMM what attributes to use. The VMM must know what it is doing, and only provide the memslot flag when appropriate. Doing otherwise may eat your data and/or take the machine down (cacheable mapping on a device can be great fun). If you want to address this, then "someone" needs to pass some additional VMA flag that KVM can check. Of course, all of this only caters for well behaved userspace, and we need to gracefully handle (3) when the VMM sneaks in a new VMA that has conflicting attributes. For that, we need a reasonable fault reporting interface that allows userspace to correctly handle it. I don't think this is unique to this case, but also covers things like MTE and other funky stuff that relies on the backing memory having some particular "attributes". An alternative could be to require the VMA to be sealed, which would prevent any overlapping mapping. But I only have looked at that for 2 minutes... Thanks, M. -- Without deviation from the norm, progress is not possible.