From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFF4FC282DE for ; Thu, 13 Mar 2025 15:25:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07915280003; Thu, 13 Mar 2025 11:25:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 002B8280002; Thu, 13 Mar 2025 11:25:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DBDFE280003; Thu, 13 Mar 2025 11:25:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B9577280002 for ; Thu, 13 Mar 2025 11:25:25 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8B040801BA for ; Thu, 13 Mar 2025 15:25:26 +0000 (UTC) X-FDA: 83216901852.30.25723E6 Received: from smtp-fw-6002.amazon.com (smtp-fw-6002.amazon.com [52.95.49.90]) by imf04.hostedemail.com (Postfix) with ESMTP id 65A5840018 for ; Thu, 13 Mar 2025 15:25:24 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=JidtACz0; spf=pass (imf04.hostedemail.com: domain of "prvs=160f57211=kalyazin@amazon.co.uk" designates 52.95.49.90 as permitted sender) smtp.mailfrom="prvs=160f57211=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741879524; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+g9gtzsApXHktJfo71oZbOr3PAfG8MTtcFusvXjlKDI=; b=PIdGpUHis6tJ96ZhRnxhq1A8z6W2jVD685jYxMGDFdi4U1y1UM0N5PRdGPo800ThJTGFw5 2+PPivvTwLlNJ0gg6URF28qNnugpAoaUX1Nf/LbpL8KEmoz0vKSI3I6jxzTdtB4qOh1Ws9 vdYA7asWAfAzKiDVdd5rMFKw6O8MUmE= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=JidtACz0; spf=pass (imf04.hostedemail.com: domain of "prvs=160f57211=kalyazin@amazon.co.uk" designates 52.95.49.90 as permitted sender) smtp.mailfrom="prvs=160f57211=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741879524; a=rsa-sha256; cv=none; b=hDiqoiqYQ+VYfmh43S2bm3GbTcZnq5HrrnArNlZpdAlkH2UBvI10OtRw3Ic1dpOZzxwVnO QnE1aiInGdvIYSFiTEf+FeRjH3/OaPbvvMNfWoV9YDIhCRgb2iET9kBTjGljMbyLybvaJI nCvbrb0WYbRs+3hDTGQUFhQFWvdx0CI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1741879524; x=1773415524; h=message-id:date:mime-version:reply-to:subject:to:cc: references:from:in-reply-to:content-transfer-encoding; bh=+g9gtzsApXHktJfo71oZbOr3PAfG8MTtcFusvXjlKDI=; b=JidtACz0mvGzykrVwr5lb30e6wGyAgk+vJC7QNmas8y+PuqYB+Ucm3am MNpKH8zDxbEnD71KGZoaCOZJxhIktNqfPFY0dSB+LENUqggANige+r57r kaRKVThaeWQQO0HfAslOXWyrntqSEk8yk+vYzGSw107Q44XAOir9VU/GE o=; X-IronPort-AV: E=Sophos;i="6.14,245,1736812800"; d="scan'208";a="480147088" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-6002.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Mar 2025 15:25:20 +0000 Received: from EX19MTAEUA002.ant.amazon.com [10.0.10.100:42458] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.37.123:2525] with esmtp (Farcaster) id 62e0b85a-879e-4398-a400-7c187b7dc5e6; Thu, 13 Mar 2025 15:25:19 +0000 (UTC) X-Farcaster-Flow-ID: 62e0b85a-879e-4398-a400-7c187b7dc5e6 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUA002.ant.amazon.com (10.252.50.126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Thu, 13 Mar 2025 15:25:19 +0000 Received: from [192.168.26.168] (10.106.82.15) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Thu, 13 Mar 2025 15:25:18 +0000 Message-ID: <69dc324f-99fb-44ec-8501-086fe7af9d0d@amazon.com> Date: Thu, 13 Mar 2025 15:25:16 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing To: Peter Xu CC: James Houghton , , , , , , , , , , , , , , , , , References: <20250303133011.44095-1-kalyazin@amazon.com> <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com> <7c304c72-1f9c-4a5a-910b-02d0f1514b01@amazon.com> Content-Language: en-US From: Nikita Kalyazin Autocrypt: addr=kalyazin@amazon.com; keydata= xjMEY+ZIvRYJKwYBBAHaRw8BAQdA9FwYskD/5BFmiiTgktstviS9svHeszG2JfIkUqjxf+/N JU5pa2l0YSBLYWx5YXppbiA8a2FseWF6aW5AYW1hem9uLmNvbT7CjwQTFggANxYhBGhhGDEy BjLQwD9FsK+SyiCpmmTzBQJnrNfABQkFps9DAhsDBAsJCAcFFQgJCgsFFgIDAQAACgkQr5LK IKmaZPOpfgD/exazh4C2Z8fNEz54YLJ6tuFEgQrVQPX6nQ/PfQi2+dwBAMGTpZcj9Z9NvSe1 CmmKYnYjhzGxzjBs8itSUvWIcMsFzjgEY+ZIvRIKKwYBBAGXVQEFAQEHQCqd7/nb2tb36vZt ubg1iBLCSDctMlKHsQTp7wCnEc4RAwEIB8J+BBgWCAAmFiEEaGEYMTIGMtDAP0Wwr5LKIKma ZPMFAmes18AFCQWmz0MCGwwACgkQr5LKIKmaZPNTlQEA+q+rGFn7273rOAg+rxPty0M8lJbT i2kGo8RmPPLu650A/1kWgz1AnenQUYzTAFnZrKSsXAw5WoHaDLBz9kiO5pAK In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.106.82.15] X-ClientProxiedBy: EX19D012EUA001.ant.amazon.com (10.252.50.122) To EX19D022EUC002.ant.amazon.com (10.252.51.137) X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 65A5840018 X-Stat-Signature: pfsd6ohoxnbw7qrc7sn3bq45f435smsf X-HE-Tag: 1741879524-93116 X-HE-Meta: U2FsdGVkX19aTpz4GaoV2zv/0/CGvUrm/nStTj5wGDojfa3t4YxCrgm6b+KxUVJ7SNo085PZFyLI9Q6p65gahqhrUc8f79DLNdVA+b2kPBc3MhX4bRvdwO0Qqh+uvHmM7sNrPa3SyIMicuGCAoxq+OjRZ0+XtFmVz+zZR0GMc3uIYHpqPziJyNoEcg4mxvoiVxlM5P4d11w20AKg5yTAaFbCSXwekHOBgE3s/5GRVtQmHbzAyKveYYO/IYVrZCOXPHrqTERYXqVjSj54VaKgrVQ4h194AuGUB46Ked4gzFOSbYQc+yXkW3VY/Rp0A7HvRBj+n1GtoYXOFH9D6cXxolJVgqDB1xf/CtgFPMPs7LB0/jraKEy5R5mT/EcOAFIGEjQ9zNtql3+IMsIb7nxusx6JWM9WwP4bQndsu49JoA6FoCQqKO3dXDc+ZrtKk6Ro3HTNW+ZWWtuizq+BAU/0tfyOQZ1iIblyzJx/VTUgrKPHWuy5AGmvXCIw87AVnERasvuZ3Mk4apsxMxwm89gBVVtXB6QwOufb2ozlbWWDBqMB7Zcnn+YLBgHAn/v8kPjT1BNNdGLSh3Dmm3gq8RhqeuKd4I0h2RTNVfCThFm61S5VnrPUcrX0fq6/IrEULXgXOXIfYr6yN96mGm2n36DLrCsSqUzUlA9u8p8tETtshZJell0bN+hZRFpcPiMYqC1BI3UZzSIctJGNa3m8tygm+L9VhaL3tG3YLL97zDixIzuY9TnlsCfffXBQgVRU19LnAPjtZMQCXqXCCeM8JdtWOZTxvguOzNllw5KmQOc8gY2p/RmAKHu68RD6MVauOKVG6BDOBPGanW5XOhrWEbZwHDmgEjb5qcBRFy6hIQ3l4KP0QK90l38ut22F7DmHtfy2BVbSLTqSWae4je7o+BmQSSLcbzNPIZJOX2qOBNtbLsbpJZAGhabD4NoJZ31CmF+RP/xpnQ67PKYHwDcfOn5 xrNFttVQ VcSiPQWZtPGsuZZSjoV4wx170akLZPufLI7+WTIACLBRlqWcVZE4B9QfVFjUNUddXZ+ZH78MAjMNJeDiRdc2Ru0StY2E5E7jM/4R61LfmMpvOJqPtMuzPPIeRRXwmOKEvs/a27WoUrWcAntvT9Ehih1W7CkA3jRXkz2gp4+tMAfJ5HtD5sI0RaugwgBHGx7h4nL2QdZcqwyuJnf+XhffYNvjey+Dznhjc88oYRjHgqR2bnbZzWW38ARNIc+syjIw/HHfctoWCAYIkLaW/TwvyS61QceRUNmtqgyHSX6gW70S2u60PxCARc5oEVRdAU9yDXVA9xPEOD5jsKHMVJsDMFgTVLPWP2xeiiJDisgtpwcwfHFH1LQtpAQ0diWJ7gGzxwFhAiC0BxFhgnnI41Yps7oO8SnZrj0BcWqrqcw5T+SQjsLh0Z2W02ZHKSiWIkjYX4rs8ToGPUeLsiULQkyXInOjmuzPJ57mP7qHXALoC2KyiAD+VMMqs4bizjf5zIebwUiYAJzxbk3k8w0fITWKhzzsI165as11VNHJcW0PR4BybE8Fwyz1zCqoZl04ulKTmVfMrB50HPVVLRxVw7vDOnYEWuIVzd+MUr/5b8GHK1CnHpkH7ED83Lf0Cv56K3OcQKKfmpEh2vb1yGI0qfxR1Ry9Po1yjfytLg5lV6SxyaAcR15pHhdn4ArAQLnowSdnHwj+ijC+69OpI8WQpqNv0VLWS274ijsrxoGto5qcPgXdpOR42VQDu+oApcFSItnaa2C6BxD1CuT9+HhFr8jDRqz3YkQr3GABsB+/RtIBngU95sEeJpisAdZ7JJMwNahegGiWn0tBO/ELl3mYTK/waVBps9Y4JuSOkeQ9LIrmUpYOtBUlspLCZU3E9Aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/03/2025 19:32, Peter Xu wrote: > On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote: >> However if MISSING is not registered, the kernel will auto-populate with a >> clear page, ie there is no way to inject custom content from userspace. To >> explain my use case a bit more, the population thread will be trying to copy >> all guest memory proactively, but there will inevitably be cases where a >> page is accessed through pgtables _before_ it gets populated. It is not >> desirable for such access to result in a clear page provided by the kernel. > > IMHO populating with a zero page in the page cache is fine. It needs to > make sure all accesses will go via the pgtable, as discussed below in my > previous email [1], then nobody will be able to see the zero page, not > until someone updates the content then follow up with a CONTINUE to install > the pgtable entry. > > If there is any way that the page can be accessed without the pgtable > installation, minor faults won't work indeed. I think I see what you mean now. I agree, it isn't the end of the world if the kernel clears the page and then userspace overwrites it. The way I see it is: @@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) if (WARN_ON_ONCE(folio_test_large(folio))) { ret = VM_FAULT_SIGBUS; goto out_folio; } if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); kvm_gmem_mark_prepared(folio); } + if (userfaultfd_minor(vmf->vma)) { + folio_unlock(folio); + filemap_invalidate_unlock_shared(inode->i_mapping); + return handle_userfault(vmf, VM_UFFD_MISSING); + } + vmf->page = folio_file_page(folio, vmf->pgoff); out_folio: if (ret != VM_FAULT_LOCKED) { folio_unlock(folio); folio_put(folio); } On the first fault (cache miss), the kernel will allocate/add/clear the page (as there is no MISSING trap now), and once the page is in the cache, a MINOR event will be sent for userspace to copy its content. Please let me know if this is an acceptable semantics. Since userspace is getting notified after KVM calls kvm_gmem_mark_prepared(), which removes the page from the direct map [1], userspace can't use write() to populate the content because write() relies on direct map [2]. However userspace can do a plain memcpy that would use user pagetables instead. This forces userspace to respond to stage-2 and VMA faults in guest_memfd differently, via write() and memcpy respectively. It doesn't seem like a significant problem though. I believe, with this approach the original race condition is gone because UFFD messages are only sent on cache hit and it is up to userspace to serialise writes. Please correct me if I'm wrong here. [1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#mdf41fe2dc33332e9c500febd47e14ae91ad99724 [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf5d794aa31d753cbc73e193628f31e418051983d >> >>> as long as the content can only be accessed from the pgtable (either via >>> mmap() or GUP on top of it), then afaiu it could work similarly like >>> MISSING faults, because anything trying to access it will be trapped. > > [1] > > -- > Peter Xu >