From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59A10C5478C for ; Wed, 28 Feb 2024 12:44:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE6EF940008; Wed, 28 Feb 2024 07:44:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B96B5940007; Wed, 28 Feb 2024 07:44:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1188940008; Wed, 28 Feb 2024 07:44:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8FFAF940007 for ; Wed, 28 Feb 2024 07:44:54 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6D0C0A09AA for ; Wed, 28 Feb 2024 12:44:54 +0000 (UTC) X-FDA: 81841182108.12.8284557 Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf09.hostedemail.com (Postfix) with ESMTP id AD81D140022 for ; Wed, 28 Feb 2024 12:44:52 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rbs8G8Cl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of qperret@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=qperret@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709124292; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XKO94nMA1YtYYpk+BNP1fuJKlQZLG/9h2W7iUjre2sY=; b=fWG0+zSNXCJxh9gr9Vl393MCZvTGInI1Oj77lKwdgDmScV7m4kdmOT0bQoh3kNgsNxWpRG AOvTG54RQ4S0z8C42XstXAI1lC5ds/9/vsWsBo3ZJThS0pcvTaa0QN5tELJgmf3hLle2uS Ie71W5yqSWKP5YXl3axGGfpF2l6nzSM= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rbs8G8Cl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of qperret@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=qperret@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709124292; a=rsa-sha256; cv=none; b=Pz6EeD8MJs6EHvQ+cR0rI6UCzyFECBKWrmYVS2XxKAZHSakVr1MBB6HhZqzdq69uxIFgPD 0X1nrUhF0gRR7+BopU0vttOS0cJUwZQMLdTvPMamh1O6GuYTFONhOE2mTLJdAIF1wdJZIw deZErRyOHZZWO+4mvR99Ojxy/j48V+g= Received: by mail-ed1-f54.google.com with SMTP id 4fb4d7f45d1cf-5643ae47cd3so7081396a12.3 for ; Wed, 28 Feb 2024 04:44:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709124291; x=1709729091; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=XKO94nMA1YtYYpk+BNP1fuJKlQZLG/9h2W7iUjre2sY=; b=rbs8G8Cl1H8FoqErVU28i+LPY5OnEZvMXClP/fPnSrQwq1Gf9koEs3k37zaqThYfwl PW/K1XBpC+0Ly7pX61miKNW12XkMueIoSmsZKuBxO1Vzo2lIIR6M04CiXtnXTSAyaJg7 gYaSXdiwxJ16740KOWyJAgKtOUe5jmZ9X8xpB37RiBApLpv8PNZYjMaC2Mychyu6dbjS moi7lqOnDK7btRKAVmBDXCuXXPEwy8Ge9Ax7qkbgmUMA5HUhMSzaBXj0ovTsEqjU9oC4 Hx+Zs+/sZ1Yf4GQKMnPUGLvXbhOvifsx8xConblXklOsy3m5iTEODE5ik7ST08t2I3Kn pGFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709124291; x=1709729091; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XKO94nMA1YtYYpk+BNP1fuJKlQZLG/9h2W7iUjre2sY=; b=bKUMiPTTjRWh2/oyuox6Vh1x/FxSimsq+ayoXomGkwB7ci2Cxz7uSzsjnTfUJw0HHC 1OX1DaYzcOjzCJe1rLkL9EUyXbJ3LPu3LXk+aRhhxax4kFxUJjBzCFxv3Po0rlsB3QBQ Fbz1Ahb1oARfrU2g9ZvHz7EOBdK9XlPUTTVCqhwZWQxHZ/dbU3+cbm+EpaD7z2NCEYh5 N3K99/tgfYkJBWyxkaMlfOsVIKir+IRodYC3Q0sqQzKE0RaIYk0SJptKyMkoKC6giKi2 Hkv7BMQmBAusEyA25s6X8dEJbQBwUySiJK2jD20wjcIY2GpzicPc/EL3fCgcGef2eP7g Li5Q== X-Forwarded-Encrypted: i=1; AJvYcCWOAwJn3OvG+YkFG4qaezxiMROO5Fyve/sEqXusVjVGA0SZzTK0IHkpmLRaoQYA+LbJIjpBXveAf+UB/A8V2RMui9Y= X-Gm-Message-State: AOJu0YzeDKVL5jprNESCHnXuRNXJyoHU+BhLkadU8WQDdA/rq7aGtTlD D/8vfQJF2ocFJOpbH5uS/hBa+PmZF4sw31qmQkKQaqi/rNRGLpNuYEex61D84w== X-Google-Smtp-Source: AGHT+IFCLZvk2INpPsiNJeuTsGQcBnfWl23OygqTc+LqjELiOS3oXuIg0YTp/dWBUHQaq/Li8DlVuw== X-Received: by 2002:a17:907:209a:b0:a3f:7e2:84cc with SMTP id pv26-20020a170907209a00b00a3f07e284ccmr10084717ejb.6.1709124290580; Wed, 28 Feb 2024 04:44:50 -0800 (PST) Received: from google.com (64.227.90.34.bc.googleusercontent.com. [34.90.227.64]) by smtp.gmail.com with ESMTPSA id hu21-20020a170907a09500b00a42eb167492sm1826966ejc.116.2024.02.28.04.44.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 04:44:50 -0800 (PST) Date: Wed, 28 Feb 2024 12:44:47 +0000 From: Quentin Perret To: David Hildenbrand Cc: Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, keirf@google.com, linux-mm@kvack.org Subject: Re: folio_mmapped Message-ID: References: <20240222161047.402609-1-tabba@google.com> <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com> <40a8fb34-868f-4e19-9f98-7516948fc740@redhat.com> <20240226105258596-0800.eberman@hu-eberman-lv.qualcomm.com> <925f8f5d-c356-4c20-a6a5-dd7efde5ee86@redhat.com> <755911e5-8d4a-4e24-89c7-a087a26ec5f6@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <755911e5-8d4a-4e24-89c7-a087a26ec5f6@redhat.com> X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: AD81D140022 X-Stat-Signature: 5pagxrxyoj7387nyd6jy7pj4kbseuste X-Rspam-User: X-HE-Tag: 1709124292-535328 X-HE-Meta: U2FsdGVkX1886oJgORrhnxyYv60S+YT4m/bREQoPoWgh4Rg3cWZkLzVIIEJvVVA32yZtgW2um4648CQDl7gyExKMXKTHeiMq52NUcyw95yuyfMlqO7Ix7NS486E7Cv/5A8GZ+8PByUswRy56bUHRut4nbsMOJ8WaSWzPZ8w79Shdd9IRpokL86yCqZ9XHWNeRsCXyBUyjMfK/+B5YvLDEFfRXAz8IZ7L9hyYD885VBfo8OjUCTCer9c6QobCJHCQx7uZDV84a9RYv22tD2pt08Eg0JlqwIWnzWVQCfZjENaon69T9KU3v/0LQnUvesm65dmg6E6iQRIQN4lV8d6GCBMTYV4LbN5ic7tVQ907s5fEdLsRNXChHDhI/mxKu0ps30j+ia4BCkUl0+z7eQ3GZMCi+vgfxuivxDjd44qhR7gI27m0K8YkJstW2tTukmJKoGORT/ufMxP458tRn8wORDlEA7PFXtgdjJPRo6HAyNAdQOwrmGzWtlx3B2rWvszezQph7vtHN9GJ1hniEa9W0gS6F+HbXFMw452/T+nsYTUM/oEfiltMkXemPBj6Ih8xg4hVTbec5M89sLlkDSw1YklIwoj+2aEYgXqYIaJzCHiPGGu4vP5Zuv6xHkNa5x0C0Z/m2t4/9q6NKAFx7BCkwlnm/FaWi1QBJz+8i3NHaX0gixbYpyXKnEPgeEHVv+S4JLfkr96Z00U8HS/+XGYj8JxwRFoVmwWGt0JoHjhwVHyBbubN9DpZinssdM4Q6L+2k2hm+UPFOt71cG8NDxREHvXF4yj8/gQH1i+B0Sn994sm72MxHUqlRrSVznEEKa4rVVkKj93FiJXsISyPtHklIuyBFC8ASbr6BfkeCEYOmvbIOxnrmmF0JIZwCvw1b7Y5oApI+uC8C0NS7AV0omoDtphDXbgsPg1viIpVBS6yDYLxfxv3ewRHVgggDCkJORrF498EiqUlsCBaHfMB2GF cbN1kOtF GAMbAj6Kqb9yhBde6GKJ6SssD7MTSi7ULou+d6hknlSApgB5aXjTHDFwyQ8feTcc1xgNup55PQP4eIPVl0iO9x5Qvc1Z1TKENFchYceG1FJ9dHURPDb0tNBYcSCpBkqtS0WzkzobziBXUdD+X3mx4z5Y0gaREjDLGBQoLeZBJxcQWj4zBCGxiKBihrqhzVvGYy/Tig1vdyt+xwHg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wednesday 28 Feb 2024 at 12:11:30 (+0100), David Hildenbrand wrote: > On 28.02.24 11:48, Quentin Perret wrote: > > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote: > > > > > > > > Ah, this was something I hadn't thought about. I think both Fuad and I > > > > need to update our series to check the refcount rather than mapcount > > > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > > > > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But > > > checking for any unexpected references might be better (there are still some > > > GUP users that don't use FOLL_PIN). > > > > As a non-mm person I'm not sure to understand to consequences of holding > > a GUP pin to a page that is not covered by any VMA. The absence of VMAs > > imply that userspace cannot access the page right? Presumably the kernel > > can't be coerced into accessing that page either? Is that correct? > > Simple example: register the page using an iouring fixed buffer, then unmap > the VMA. iouring now has the page pinned and can read/write it using an > address in the kernel vitual address space (direct map). > > Then, you can happily make the kernel read/write that page using iouring, > even though no VMA still covers/maps that page. Makes sense, and yes that would be a major bug if we let that happen, thanks for the explanation. > [...] > > > > Instead of > > > > > > 1) Converting a page to private only if there are no unexpected > > > references (no mappings, GUP pins, ...) and no VMAs covering it where > > > we could fault it in later > > > 2) Disallowing mmap when the range would contain any private page > > > 3) Handling races between mmap and page conversion > > > > The one thing that makes the second option cleaner from a userspace > > perspective (IMO) is that the conversion to private is happening lazily > > during guest faults. So whether or not an mmapped page can indeed be > > accessed from userspace will be entirely undeterministic as it depends > > on the guest faulting pattern which userspace is entirely unaware of. > > Elliot's suggestion would prevent spurious crashes caused by that > > somewhat odd behaviour, though arguably sane userspace software > > shouldn't be doing that to start with. > > The last sentence is the important one. User space should not access that > memory. If it does, it gets a slap on the hand. Because it should not access > that memory. > > We might even be able to export to user space which pages are currently > accessible and which ones not (e.g., pagemap), although it would be racy as > long as the VM is running and can trigger a conversion. > > > > > To add a layer of paint to the shed, the usage of SIGBUS for > > something that is really a permission access problem doesn't feel > > SIGBUS stands for "BUS error (bad memory access)." > > Which makes sense, if you try accessing something that can no longer be > accessed. It's now inaccessible. Even if it is temporarily. > > Just like a page with an MCE error. Swapin errors. Etc. You cannot access > it. > > It might be a permission problem on the pKVM side, but it's not the > traditional "permission problem" as in mprotect() and friends. You cannot > resolve that permission problem yourself. It's a higher entity that turned > that memory inaccessible. Well that's where I'm not sure to agree. Userspace can, in fact, get back all of that memory by simply killing the protected VM. With the approach suggested here, the guestmem pages are entirely accessible to the host until they are attached to a running protected VM which triggers the protection. It is very much userspace saying "I promise not to touch these pages from now on" when it does that, in a way that I personally find very comparable to the mprotect case. It is not some other entity that pulls the carpet from under userspace's feet, it is userspace being inconsistent with itself that causes the issue here, and that's why SIGBUS feels kinda wrong as it tends to be used to report external errors of some sort. > > appropriate. Allocating memory via guestmem and donating that to a > > protected guest is a way for userspace to voluntarily relinquish access > > permissions to the memory it allocated. So a userspace process violating > > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the > > point that signal would be sent, the page would have been accounted > > against that userspace process, so not sure the paging examples that > > were discussed earlier are exactly comparable. To illustrate that > > differently, given that pKVM and Gunyah use MMU-based protection, there > > is nothing architecturally that prevents a guest from sharing a page > > back with Linux as RO. > > Sure, then allow page faults that allow for reads and give a signal on write > faults. > > In the scenario, it even makes more sense to not constantly require new > mmap's from user space just to access a now-shared page. > > > Note that we don't currently support this, so I > > don't want to conflate this use case, but that hopefully makes it a > > little more obvious that this is a "there is a page, but you don't > > currently have the permission to access it" problem rather than "sorry > > but we ran out of pages" problem. > > We could user other signals, at least as the semantics are clear and it's > documented. Maybe SIGSEGV would be warranted. > > I consider that a minor detail, though. > > Requiring mmap()/munmap() dances just to access a page that is now shared > from user space sounds a bit suboptimal. But I don't know all the details of > the user space implementation. Agreed, if we could save having to mmap() each page that gets shared back that would be a nice performance optimization. > "mmap() the whole thing once and only access what you are supposed to > access" sounds reasonable to me. If you don't play by the rules, you get a > signal. "... you get a signal, or maybe you don't". But yes I understand your point, and as per the above there are real benefits to this approach so why not. What do we expect userspace to do when a page goes from shared back to being guest-private, because e.g. the guest decides to unshare? Use munmap() on that page? Or perhaps an madvise() call of some sort? Note that this will be needed when starting a guest as well, as userspace needs to copy the guest payload in the guestmem file prior to starting the protected VM. Thanks, Quentin