From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 253DDC282DE for ; Thu, 13 Mar 2025 22:14:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D2DA280002; Thu, 13 Mar 2025 18:13:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 08239280001; Thu, 13 Mar 2025 18:13:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E3EAE280002; Thu, 13 Mar 2025 18:13:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C1E73280001 for ; Thu, 13 Mar 2025 18:13:57 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7E9AF140626 for ; Thu, 13 Mar 2025 22:13:58 +0000 (UTC) X-FDA: 83217931356.10.913D12D Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) by imf04.hostedemail.com (Postfix) with ESMTP id 10F5C40007 for ; Thu, 13 Mar 2025 22:13:48 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=IWmVIl75; spf=pass (imf04.hostedemail.com: domain of "prvs=160f57211=kalyazin@amazon.co.uk" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=160f57211=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741904029; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ykTJUUOvH2VvwmsTF8Y9p+LL6eaaY5GGxwQKE1SyE+c=; b=6e1uRkkHUfRx0cgYh5Hmk1yx2N3iZ2w98FACHldZo3BaOb7qZeln5Kf6Oo0487V2JtK+1n zahzRLpF28tc5vpdGe+wG4CE2NoO+rK4izz4xQ7BlZBj+2D/nHDBxB+z0plRLROs+3+zn8 T4R16rFyLE2A50b+qKylu0E4ET9yyUo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741904029; a=rsa-sha256; cv=none; b=2XFkkEFWYcV95M/zmJn/Zbw3HqdBrVrMWXxcZXAX/gyqKd39ID8zxYmKHHpgHbA+DsGphf aFCr83zWdK59SmiqLxdNzJwjQWPTS4s7lskVm4O6weFrdBIbtSmuCPatvioq1k530C2Z+P YIiKvAHGw52WqIqMz0KEmYwlBIRtLcM= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=IWmVIl75; spf=pass (imf04.hostedemail.com: domain of "prvs=160f57211=kalyazin@amazon.co.uk" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=160f57211=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1741904029; x=1773440029; h=message-id:date:mime-version:reply-to:subject:to:cc: references:from:in-reply-to:content-transfer-encoding; bh=ykTJUUOvH2VvwmsTF8Y9p+LL6eaaY5GGxwQKE1SyE+c=; b=IWmVIl75E6dZ7chE8IrwTX3CH2d3ZhL1He59tzfHj1G1mDmL8tjiFXCG gk+mrJOLA570CCamWeFgjt4ktKOlQkWtfkb3r6KSCUE2g7mMapbmYVn4P ObySnIusFfZf1Gbq7M06EHnVF+ljPKDbvVL6j6ImJjACEQ4/WCo5Otrop 0=; X-IronPort-AV: E=Sophos;i="6.14,245,1736812800"; d="scan'208";a="178481195" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Mar 2025 22:13:46 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.17.79:22832] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.42.72:2525] with esmtp (Farcaster) id c936554d-6948-4d14-b818-fcc3de9bf611; Thu, 13 Mar 2025 22:13:45 +0000 (UTC) X-Farcaster-Flow-ID: c936554d-6948-4d14-b818-fcc3de9bf611 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Thu, 13 Mar 2025 22:13:42 +0000 Received: from [192.168.31.185] (10.106.83.24) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Thu, 13 Mar 2025 22:13:41 +0000 Message-ID: <507e6ad7-2e28-4199-948a-4001e0d6f421@amazon.com> Date: Thu, 13 Mar 2025 22:13:23 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing To: Peter Xu CC: James Houghton , , , , , , , , , , , , , , , , , References: <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com> <7c304c72-1f9c-4a5a-910b-02d0f1514b01@amazon.com> <69dc324f-99fb-44ec-8501-086fe7af9d0d@amazon.com> Content-Language: en-US From: Nikita Kalyazin Autocrypt: addr=kalyazin@amazon.com; keydata= xjMEY+ZIvRYJKwYBBAHaRw8BAQdA9FwYskD/5BFmiiTgktstviS9svHeszG2JfIkUqjxf+/N JU5pa2l0YSBLYWx5YXppbiA8a2FseWF6aW5AYW1hem9uLmNvbT7CjwQTFggANxYhBGhhGDEy BjLQwD9FsK+SyiCpmmTzBQJnrNfABQkFps9DAhsDBAsJCAcFFQgJCgsFFgIDAQAACgkQr5LK IKmaZPOpfgD/exazh4C2Z8fNEz54YLJ6tuFEgQrVQPX6nQ/PfQi2+dwBAMGTpZcj9Z9NvSe1 CmmKYnYjhzGxzjBs8itSUvWIcMsFzjgEY+ZIvRIKKwYBBAGXVQEFAQEHQCqd7/nb2tb36vZt ubg1iBLCSDctMlKHsQTp7wCnEc4RAwEIB8J+BBgWCAAmFiEEaGEYMTIGMtDAP0Wwr5LKIKma ZPMFAmes18AFCQWmz0MCGwwACgkQr5LKIKmaZPNTlQEA+q+rGFn7273rOAg+rxPty0M8lJbT i2kGo8RmPPLu650A/1kWgz1AnenQUYzTAFnZrKSsXAw5WoHaDLBz9kiO5pAK In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.106.83.24] X-ClientProxiedBy: EX19D012EUC002.ant.amazon.com (10.252.51.162) To EX19D022EUC002.ant.amazon.com (10.252.51.137) X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 10F5C40007 X-Stat-Signature: ah5pj734uzg4qb8fibj6cbfr18yfh8fw X-HE-Tag: 1741904028-338313 X-HE-Meta: U2FsdGVkX1+ZI4LETxM+hYab9EcM3gLhY7IxtEvmD+EogjC4Wz0kKTsLN4hfmWdWlG/mALRmqmjs6OUBWquQUN+CtTJWrGAowZWZIsbwat1SNuNs7NyQvMTsiBX1h9j/tLtnMJ8Pd1cXtg3S4PU26P6XWTJPMlA9W5LAIs08fmhunQj3wsqgbQ+jviMp+ivNkOy9RbOASSP2mhfPBLCHrpPX+V889+UBs5otWl/l0Y/rvMOpTiasyobSiYBiJNqvaaBPFHcaIkNcfZSt+/jri8QYngY6ReI1EfcRKFOs5jP9Etf3G3OHIsew1inMgRl1Z5Phasre7MLV1us8AZBaE7qNqrGiiiZDT065sjRee5hXDAf5/uazcchqIduumosiWg4CvW0+ZkR6lnJ7tNlSlNoiZfVU8Z4KgqwrvQeYDWFuZof2AiYP1gzllhKJ81tVGVNe99E5YnlWfldRf7199K4Csrhzu3J4U2E90wQyfLbMxs9G+Hl7pAYF7a4TK/zK/49ibzEKcEXIza0V7saxtu+wEAjKdX4ae8bFk/tPepthc0h+SokijyxEkhqrNsFkK8GBQHbmDcwaCfeTKr7LQKqIJa4rJkv4gbNSOS+eDfyFLy2cBf7gPr+uQHBQxwyF3X6GcO8qX7hHJEg7E1IxBs9/naEIJudarJbPX3CXUHabgK9cICtpM7A2mU8TdR0QTHvS6DF+tktBJF3mKkf6alNuwVoDdQqEgK+x4NvmmPLfWmTHueS3AZjUfshUWJg00nrT0guKSWhBWdvlklUoglmPRuat0/KSIbnQRVohC2sIDTpgwfEJMFDCHd/5bK6Qz3YWauE0Okfl2nVhsIyguq/0CBXZqfyVr3rvc9/N2gTqoRgDhdpZ7Di6J7ono+t56GIWUmIgUsBxpV+h2OxXV+su7P5y488FdU8+LWXIprX43lkHBE8c5U1BnVocnZCAqBwMKVH18vrcifrFfUw 1bX6q/0+ WjvmSrCdXOfVMJGOxtVgkpNEHGZ4uvD08W0li2TWTWUmxQ6VodMW7oycCpBwHLNzcbH2VCcNy6FWV3CVSbltAuikS64Gihp4NCJTHto1Ou44+iyJ11OZPbybnxM+Lub39ilfRgDFKfYW/xQx71En+XdaorR6Dux+sFXBW+UhrQkozYqtcy5KKh4gYN2BKvrzL3iTFxlJuR4YxiTchNztD6csstedxpb44GmdtweernWjauIdsFdZuqMoiDtl3aMVr6Jpj46ECZEgOnDmP8Ubje2boy7u/aNNMFhVSzWEThJ9gxZBtWvfAQdrHtuZGhVXLgQ9x88wmt2ZWQ0CCHBxcmYw+nNOFA3p35tjRYqzMNBZASAakO4AiuCclfcGQDKTiEl84P6k4z5u0Hs3kIA019Yl270WJVeuA4IckR6XlKCZgb+BbZJUuEYOvQY1EZMPaAoGUSVrLMzOw5lt/u4uT54KS9jmt0VGQSKVpL4DfdFpaY/CObXu3skBHgpzeMNEMm+/7KeEF+7rWWO7WlwpcUuq/nDK5qw4Tm5wizj713g/fUVDAYyIzt55wa23jx/217jyPXdoZFhg87U6b2jxQ9iagQgFQnRkIR8cChehKI8YstdFPEplEfMEjLoAU6PA3YjM8FwViQLfkL0V7DoQR0UM6niW1Rb1fUtOBAaqj6YFIEBjdk3+8p2Piv2W8OjFlTSLFqOcGnEJ4Tbjdo1+u/dTYgxt4M3atMfTPmDbTeCxaU4LXgVKnih2YOGVsSR2k8OE3AqeVsab3XEA2wSdu1HOlvI25cw/ofjnKjjCWrsxHo+bee0JeCudqqKLwCVwYcl/5KXaal0lHxzl2SQgthKaUd3VNfy7VbUNKpjf+G7MdzTXu7v0v0QRZVLrLCCoNBLvSQjZ6lzH1/TbxEMVHzrR7bQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 13/03/2025 19:12, Peter Xu wrote: > On Thu, Mar 13, 2025 at 03:25:16PM +0000, Nikita Kalyazin wrote: >> >> >> On 12/03/2025 19:32, Peter Xu wrote: >>> On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote: >>>> However if MISSING is not registered, the kernel will auto-populate with a >>>> clear page, ie there is no way to inject custom content from userspace. To >>>> explain my use case a bit more, the population thread will be trying to copy >>>> all guest memory proactively, but there will inevitably be cases where a >>>> page is accessed through pgtables _before_ it gets populated. It is not >>>> desirable for such access to result in a clear page provided by the kernel. >>> >>> IMHO populating with a zero page in the page cache is fine. It needs to >>> make sure all accesses will go via the pgtable, as discussed below in my >>> previous email [1], then nobody will be able to see the zero page, not >>> until someone updates the content then follow up with a CONTINUE to install >>> the pgtable entry. >>> >>> If there is any way that the page can be accessed without the pgtable >>> installation, minor faults won't work indeed. >> >> I think I see what you mean now. I agree, it isn't the end of the world if >> the kernel clears the page and then userspace overwrites it. >> >> The way I see it is: >> >> @@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >> if (WARN_ON_ONCE(folio_test_large(folio))) { >> ret = VM_FAULT_SIGBUS; >> goto out_folio; >> } >> >> if (!folio_test_uptodate(folio)) { >> clear_highpage(folio_page(folio, 0)); >> kvm_gmem_mark_prepared(folio); >> } >> >> + if (userfaultfd_minor(vmf->vma)) { >> + folio_unlock(folio); >> + filemap_invalidate_unlock_shared(inode->i_mapping); >> + return handle_userfault(vmf, VM_UFFD_MISSING); >> + } > > I suppose you meant s/MISSING/MINOR/. Yes, that's what I meant, thank you. >> + >> vmf->page = folio_file_page(folio, vmf->pgoff); >> >> out_folio: >> if (ret != VM_FAULT_LOCKED) { >> folio_unlock(folio); >> folio_put(folio); >> } >> >> On the first fault (cache miss), the kernel will allocate/add/clear the page >> (as there is no MISSING trap now), and once the page is in the cache, a >> MINOR event will be sent for userspace to copy its content. Please let me >> know if this is an acceptable semantics. >> >> Since userspace is getting notified after KVM calls >> kvm_gmem_mark_prepared(), which removes the page from the direct map [1], >> userspace can't use write() to populate the content because write() relies >> on direct map [2]. However userspace can do a plain memcpy that would use >> user pagetables instead. This forces userspace to respond to stage-2 and >> VMA faults in guest_memfd differently, via write() and memcpy respectively. >> It doesn't seem like a significant problem though. > > It looks ok in general, but could you remind me why you need to stick with > write() syscall? > > IOW, if gmemfd will always need mmap() and it's fully accessible from > userspace in your use case, wouldn't mmap()+memcpy() always work already, > and always better than write()? Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is an optimisation. Most of the pages in guest_memfd are only ever accessed by the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need userspace pagetables set up. By using write() we can avoid VMA faults, installing corresponding PTEs and double page initialisation we discussed earlier. The optimised path only contains pagecache population via write(). Even TDP faults can be avoided if using KVM prefaulting API [1]. [1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory > > Thanks, > >> >> I believe, with this approach the original race condition is gone because >> UFFD messages are only sent on cache hit and it is up to userspace to >> serialise writes. Please correct me if I'm wrong here. >> >> [1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#mdf41fe2dc33332e9c500febd47e14ae91ad99724 >> [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf5d794aa31d753cbc73e193628f31e418051983d >> >>>> >>>>> as long as the content can only be accessed from the pgtable (either via >>>>> mmap() or GUP on top of it), then afaiu it could work similarly like >>>>> MISSING faults, because anything trying to access it will be trapped. >>> >>> [1] >>> >>> -- >>> Peter Xu >>> >> >> > > -- > Peter Xu >