From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CBA5C36008 for ; Wed, 26 Mar 2025 18:24:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9925228009E; Wed, 26 Mar 2025 14:24:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 94FFD28008D; Wed, 26 Mar 2025 14:24:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7E6A928009E; Wed, 26 Mar 2025 14:24:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 580C928008D for ; Wed, 26 Mar 2025 14:24:36 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C3261160137 for ; Wed, 26 Mar 2025 18:24:36 +0000 (UTC) X-FDA: 83264527752.25.A1EBE5F Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf25.hostedemail.com (Postfix) with ESMTP id 0E071A0005 for ; Wed, 26 Mar 2025 18:24:34 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=HwVNY6nS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 3YUbkZwYKCMQ2okxtmqyyqvo.mywvsx47-wwu5kmu.y1q@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3YUbkZwYKCMQ2okxtmqyyqvo.mywvsx47-wwu5kmu.y1q@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743013475; a=rsa-sha256; cv=none; b=kncKgQLw9LhDDVoYdm6PGrcKZTO82d0XYHxq5keVArCVDsiOJoRuHSyjjjrOa9tApBs2uB 98dfjMVKW4YzF746iOcK9oO9trIE6Il2Wg5SFy2TPJxX8k91NMvsPEoZbhpjfy7rbsn/qw zqH2ET+4TlAwc0lFWQokAX83jgvtXik= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=HwVNY6nS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 3YUbkZwYKCMQ2okxtmqyyqvo.mywvsx47-wwu5kmu.y1q@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3YUbkZwYKCMQ2okxtmqyyqvo.mywvsx47-wwu5kmu.y1q@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743013475; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KQSyQdOtaRXXJqY6mJYGU96ThIlQP1CEWPFYxr/eyuc=; b=8AlHpGG+jYs0bXg/xgDT/Bn2SnKDJCPzJFSD9YAqxyRmmgPDd3rdmY9FK8+28/pV/lSV3E quQachV2sxzlozrqhIveGal6RR2DHtICQfvbl/aimuFOSc+DkKs7JvfA0nsJmQltvjPAKZ SPper31mtCMYsgrzpAgUXcmN9XBFKs0= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2240c997059so3591305ad.0 for ; Wed, 26 Mar 2025 11:24:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743013474; x=1743618274; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=KQSyQdOtaRXXJqY6mJYGU96ThIlQP1CEWPFYxr/eyuc=; b=HwVNY6nSy2zpVdyOxsnhcwyZCDIxtTBVs66nqB94qOPKjxPgI4I+8x1z4un06wuJMg PWmF6mrk5Pl9AGhvnYHNxzyPhaC7z0jITnvngEiYSZM0heRtOYBb6QOG5mXv60IyMBKC 7c/djyfyS2Wk55UdfC78UWUnLOxdDEOQgcP4e4L9hMSfToqlm5+wQZYPF5so6Lx2VPuA EEfGbuSpT0aF/jXXmA2y6we+CJADKmK84QAqR13ZNaMCBqPx89ugtoGZa13vQH15DUst njeXk0mMM4xMfM480a5j64fDNTArk7o00DjXLuaKdyVPAHEgAoLwP+lw4qL8MnbEGcFr Bq8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743013474; x=1743618274; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KQSyQdOtaRXXJqY6mJYGU96ThIlQP1CEWPFYxr/eyuc=; b=dRCRn9jLguwJVXZ3Dky8hWYSyaoOkrNpxghs0n4CnH2G4fsz1qSvtmnal3MldNuJrt Sru0Qh6JxU48IxSQQVydNLl8pT6itoq760woLxSxAIrwrIsSMcT079L8Eh5dbM8ybTMl 5Kv0y13WzAjn3jQGIywVWqa0grBKYIYpHVO8RCWgN2DkhrwPbK7aNXyV7hHokaUwSxOy a3bsf/U+DK9TDLF3WSkuyms6723xACae+LGpuPnwIqVr6fOcOpuYwNNwJWc/1zIYeCCQ Zf5QzB6Be8mJCTLq3ZOBBaQKelAXrJKH+kcULD6/JK2gBWRCRrPLnPx7q1PL2MZPJmZE BGXw== X-Forwarded-Encrypted: i=1; AJvYcCWglO6pB1BE1yjMDxF7fVxK3e4a+mzoGgbYqIRWDNqsToVKH4Tmj6kFg4ZrYHf315R6K7yDEXPRCA==@kvack.org X-Gm-Message-State: AOJu0YwdLxDCBBrRG8PfhbX5g27MxRpmE9wBErzGSvZcGMDHVXaxopI/ eb2jDBaWevb+BnjZk4cwrBZWmpAKMX64V9EMSZlLmeDE89C1JM1CAARYPvP+XTXS7THBvtlJqA3 AhQ== X-Google-Smtp-Source: AGHT+IH9At4P504/E7QC1PFAI9YXhOi0kk7VTZY2FxHfytUMmysBWuzG1WMVt3IjO8CPrl186Ota31MBAcw= X-Received: from pfbck7.prod.google.com ([2002:a05:6a00:3287:b0:736:ae72:7543]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:d4c5:b0:215:94eb:adb6 with SMTP id d9443c01a7336-2280495a14dmr8796425ad.40.1743013473742; Wed, 26 Mar 2025 11:24:33 -0700 (PDT) Date: Wed, 26 Mar 2025 11:24:32 -0700 In-Reply-To: <86wmcbllg2.wl-maz@kernel.org> Mime-Version: 1.0 References: <20250319170429.GK9311@nvidia.com> <20250319192246.GQ9311@nvidia.com> <86y0wrlrxt.wl-maz@kernel.org> <86wmcbllg2.wl-maz@kernel.org> Message-ID: Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags From: Sean Christopherson To: Marc Zyngier Cc: Ankit Agrawal , Catalin Marinas , Jason Gunthorpe , Oliver Upton , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "will@kernel.org" , "ryan.roberts@arm.com" , "shahuang@redhat.com" , "lpieralisi@kernel.org" , "david@redhat.com" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Vikram Sethi , Andy Currid , Alistair Popple , John Hubbard , Dan Williams , Zhi Wang , Matt Ochs , Uday Dhoke , Dheeraj Nigam , Krishnakant Jaju , "alex.williamson@redhat.com" , "sebastianene@google.com" , "coltonlewis@google.com" , "kevin.tian@intel.com" , "yi.l.liu@intel.com" , "ardb@kernel.org" , "akpm@linux-foundation.org" , "gshan@redhat.com" , "linux-mm@kvack.org" , "ddutile@redhat.com" , "tabba@google.com" , "qperret@google.com" , "kvmarm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: 0E071A0005 X-Stat-Signature: cddme716z89r6c1zpnk5q1hutuhbgnxf X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1743013474-540710 X-HE-Meta: U2FsdGVkX19unWch4PIiEYDKJSkaQB/GhXVrBhbSfM0hxZ/oZ91V6xSZ0pqGJaXtyAX3UVjL1t86NtceBNw78lz8JIu5433L+L6pidnMHSOmFlaW2Q08H/GYvVmE3PftNJzveTAplxIta8DSAeUh2jx3yhQyZiOhjALTAm+gK4L4BGxn8+lCqzAqgAKCOPofn/zMWGSWWPIbVRsVuuoR1Wp23Z/FK1yJ79mmtJXD1dA8WVuRbDtLY8D74Gz9dyAYkitcIff8GK96nobDu+EQnC3h5kpmwNLM5GH/6ei2OOLI5bh0h1M0ZdB3nDsbyS6Znkw2WAkSLoL4O4AO30I8r5aSHcRko0bfgVbS7NgcDWCq5X/bU5y7J+wedgpyXxRUrhrt3d8iEiNd22nUZMy45Aow443qCcq0R34DMnOQTW3sT26CTpTYTvy7E6MHFCfEQhaAbAc6TE8qm+XDoYIaH3JGyVb5ZWvHSJKbiXg2WQlWcvBfXywcMV4NBj9k/Nkyo6gN5rvGGviqzPizdXmA8Kg5ZTIOmuoAGGvTLQz9hRnWVt+n244SwuhJjeqnfhYocuS8uLpmW/+dZ4QeB1kEKv7jKnUORGdjXAU2IlOVYuwEipvAy90Z/dduJJuaWsL4x0ET/UD/89OGudgnDwnerdUMYyom5KEHOtVCovOeBErHFqMIIn43uU8polJrcycQtIb46bJrLcGmv6DuZCVO7r5BWQX314Uti+LG19lg1e9Cnl2LoHHQSuMLg/hTvsJENn1Vtd0wzKl9R9u3G82VvGyWG+zVhLmNYynbhU68WqGAlFe3/y57gDJSC4VNmHXkrSqD76dUl1j4KkiRP33UuhQyjYR1tOkkkiFecjVUq313uwzsluwDZS/6BPM3vLSIJaI94KgMEN50bPH24FPfOwHzKlzaCZcajAbuzUbsXzW6VvxYH+SWzz0pdLs9WFaYabfpUeR4puZivBtbHzk Tlk56dZP HaMohohYt1v2DiDoM7AP1cbpBziAgCYw8QhMzYNoYhhp+BZh9rHCGUK+07M4J/O7nbCQ6nFYkt2GEstxQya9dFPAezgy9n0jYa5rL6eAz2U8mF/FeNbdZlFZgX1p018/BUMMRhqAi5j+E7pNP3smFLk1UHBtQXz5sb8zkrZsQiq17NF19a0HwliiICH+Vy+nZaRhs9q1efXSXkFivkvDGw7ScT4FOeP6dzdUT0kSjbRWOrhpPJ92RsGbpN2YjeT4y8sClNgmU6xppmGirrRNirZ2fa5xE81ZyM1H+s4DSlWaW5Q268MbWWS36iQx5R7WHyy77IAvtFE27Am8wkTi7OPR+KEAJeaB3LC7sFlMM+KEHUC8loj3ycHFadjEKqr6AK3IP9L6Mv3BlpIhJhd9zU3lMMorCJoeDtihFzKrB+0TrmEaJquCNkjIbPt6tWmUBNlM4Z43Y5x0zW6wb3O0z2SHZWxYxZtEQTmuZpQGcHtdCzXo8ZHcv9nUp8uDJtCCaky8z2P+b3BfmT5TUXpxXxn4I2/boPEf4IiAmtNNArAKyNTiFVfDiKh7t02ve5iSS8QjckzEZiV9K1/xdtC6QT5Nl7p/N8tk98uCmJ/c2MkmAN1t05bDfTM0eMg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000007, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 26, 2025, Marc Zyngier wrote: > On Wed, 26 Mar 2025 16:10:45 +0000, > Sean Christopherson wrote: > > > > On Wed, Mar 26, 2025, Marc Zyngier wrote: > > > On Wed, 26 Mar 2025 14:53:34 +0000, > > > Sean Christopherson wrote: > > > > > > > > On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > > > > > the toys... > > > > > > > > > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > > > > > otherwise. > > > > > > > > > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > > > > > > > > > I'd wait until the KVM maintainers have their say. > > > > > > > > > > > > > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > > > > > patch in memslot creation? > > > > > > > > That's not sufficient. As pointed out multiple times in this thread, any checks > > > > done at memslot creation are best effort "courtesies" provided to userspace to > > > > avoid terminating running VMs when the memory is faulted in. > > > > > > > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is > > > > not. > > > > > > > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB, > > > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, > > > > KVM can simply honor the cacheability based on the VMA. Neither of those requires > > > > > > Remind me how this work with stuff such as guestmemfd, which, by > > > definition, doesn't have a userspace mapping? > > > > Definitely not through a memslot flag. The cacheability would be a property of > > the guest_memfd inode, similar to how it's a property of the underlying device > > in this case. > > It's *not* a property of the device. It's a property of the mapping. Sorry, bad phrasing. I was trying to say that the entity that controls the cacheability is ultimately whatever kernel subsystem/driver controls the mapping. > > I don't entirely see what guest_memfd has to do with this. > > You were the one mentioning sampling the cacheability via the VMA. As > far as I understand guestmemfd, there is no VMA to speak of. > > > One of the big > > advantages of guest_memfd is that KVM has complete control over the lifecycle of > > the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there > > are valid host mappings when memory is unmapped from the guest, and so can't do > > the necessary maintenance. I agree with Jason's earlier statement that that's a > > solvable kernel flaw. > > > > For guest_memfd, KVM already does maintenance operations when memory is reclaimed, > > for both SNP and TDX. I don't think ARM's cacheability stuff would require any > > new functionality in guest_memfd. > > I don't know how you reconcile the lack of host mapping and cache > maintenance. The latter cannot take place without the former. I assume cache maintenance only requires _a_ mapping to the physical memory. With guest_memfd, KVM has the pfn (which happens to always be struct page memory today), and so can establish a VA=>PA mapping as needed. > > > > a memslot flag. A KVM capability to enumerate FWB support would be nice though, > > > > e.g. so userspace can assert and bail early without ever hitting an > > > > ioctl error. > > > > > > It's not "nice". It's mandatory. And FWB is definitely *not* something > > > we want to expose as such. > > > > I agree a capability is mandatory if we're adding a memslot flag, but I don't > > think it's mandatory if this is all handled through kernel plumbing. > > It is mandatory, full stop. Otherwise, userspace is able to migrate a > VM from an FWB host to a non-FWB one, start the VM, blow up on the > first page fault. That's not an acceptable outcome. > > > > > > > If we want to support existing setups that happen to work by dumb luck or careful > > > > configuration, then that should probably be an admin decision to support the > > > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. > > > > > > No. That's not how we handle an ABI issue. VM migration, with and > > > without FWB, can happen in both direction, and must have clear > > > semantics. So NAK to a kernel parameter. > > > > > > If I have a VM with a device mapped as *device* on FWB host, I must be > > > able to migrate it to non-FWB host, and back. A device mapped as > > > *cacheable* can only be migrated between FWB-capable hosts. > > > > But I thought the whole problem is that mapping this fancy memory as device is > > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything > > in the first place? > > I don't know where you got that idea. This is all about what memory > type is exposed to a guest: > > - with FWB, no need for CMOs, so cacheable memory is allowed if the > device supports it (i.e. it actually exposes memory), and device > otherwise. > > - without FWB, CMOs are required, and we don't have a host mapping for > these pages. As a fallback, the mapping is device only, as this > doesn't require any CMO by definition. > > There is no notion of "safety" here. Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs if the memory is cacheable could result in data corruption, i.e. would be a safety issue for the host. But I missed that you were proposing that the !FWB behavior would be to force device mappings. > > > Importantly, it is *userspace* that is in charge of deciding how the > > > device is mapped at S2. And the memslot flag is the correct > > > abstraction for that. > > > > I strongly disagree. Whatever owns the underlying physical memory is in charge, > > not userspace. For memory that's backed by a VMA, userspace can influence the > > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state > > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd. > > I don't buy that. Userspace needs to know the semantics of the memory > it gives to the guest. Or at least discover that the same device > plugged into to different hosts will have different behaviours. Just > letting things rip is not an acceptable outcome. Agreed, but that doesn't require a memslot flag. A capability to enumerate that KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to have KVM reject memslots that are cachaeable in the VMA, but would get device in stage-2, then we can provide that functionality through the capability, i.e. let userspace decide if it wants "fallback to device" vs. "error on creation" on a per-VM basis. What I object to is adding a memslot flag. > > I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g. > > to disallow mapping memory as writable even if the VMA is writable. But IMO, > > adding a memslot flag to control cacheability isn't purely substractive. > > I don't see how that solves the problem at hand: given the presence or > absence of FWB, allow userspace to discover as early as possible what > behaviour a piece of memory provided by a device will have, and > control it to handle migration.