From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 089A0C282DE for ; Thu, 13 Mar 2025 19:12:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7AE5C280007; Thu, 13 Mar 2025 15:12:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 76099280001; Thu, 13 Mar 2025 15:12:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5FF44280007; Thu, 13 Mar 2025 15:12:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 408FE280001 for ; Thu, 13 Mar 2025 15:12:20 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8ABB51A04C9 for ; Thu, 13 Mar 2025 19:12:20 +0000 (UTC) X-FDA: 83217473640.14.AE29055 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 2C14BC000B for ; Thu, 13 Mar 2025 19:12:17 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Sa7H0qjO; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741893138; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eu8Zt5JHKLInYlZV//fPfzrbXAHVXgUz1d/jNhTgXBA=; b=S+N0NHryysXFpyWPCAzQiNxws2Zyq/pwnZgE2Y5E5JTavP1hI74krvGy1GSo7hl+UdlOLG 3Rp1ud1OqsS+JBdUR58S+pH9IyAc2R1ilroDFAaUiBvk43MQSZXMlF85NI9/9jSSRtN32Y psHzKszbDEv06T8mdNsu+kxv0oDF+hc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Sa7H0qjO; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741893138; a=rsa-sha256; cv=none; b=YA47caHwhMEIu6lwb+x55s4XVnKpFUb1DXbup0I8XEOmUNHI3KThDOR56ECDpec3ZbFdJf qXmioI9utZ17DcpA+3Ii9MIbpffuLcrrpAHaAZjRCt0px+faSUq+ZIka5Q7vnsYutt+8zq RTKWtpUloCbugw1FqmV8pN9tKjWFKtI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741893137; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eu8Zt5JHKLInYlZV//fPfzrbXAHVXgUz1d/jNhTgXBA=; b=Sa7H0qjOTXP+In0oYuh9aTl+/aGQoPsttMrDXBfMjMyIVcVAragm5qSsf1PGT45uaHj3Z0 lW4lB+SyHGQCQ4pUKSw0/SVHxsIUftBZYLd8RnkFm4Y6e1Y657Qdk9q2Ety2wy07I1g+0L +vPvNAimCaFtBpxrCaAj908vxIMSkGQ= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-675-llhfQ9mkNVmGx5U9FXjqLw-1; Thu, 13 Mar 2025 15:12:16 -0400 X-MC-Unique: llhfQ9mkNVmGx5U9FXjqLw-1 X-Mimecast-MFC-AGG-ID: llhfQ9mkNVmGx5U9FXjqLw_1741893136 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-7c0a89afcaaso258831785a.0 for ; Thu, 13 Mar 2025 12:12:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741893136; x=1742497936; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=eu8Zt5JHKLInYlZV//fPfzrbXAHVXgUz1d/jNhTgXBA=; b=SOCxvYoWLmJRjNaJCxRjVsNtPmHiALpSY7slFpqA2HIQ04HqO/TeJPm4PL4AL5HC29 rlGErr5cBTQ/AGJVSyqjLH0aiyi2UWI+yMEX083W+VZAbagKmb3XvVKfxdw9v80hpnF/ +EWgrNGy8l5Y9vZF6PurbPMATT4eTtsJHZ9AvhcgG48xHtIh77dYRnZF2yVnE4ZYP2OW rvworp0N3zK4tRJu5svniQdgCj5MInHOYyU8g7qzoFA/uOO4MF+Gytv/xPyrt1X4P4yM FeQ+9Tt7VDBlFSS/WR80rKON0JQlD3lFg+Ywb30TpmiAFejV89ax/2rZDg4RurfiMBMG ZCuA== X-Forwarded-Encrypted: i=1; AJvYcCWv/KRUbbJquXOK9cqrdk08HvJDvPbEhq03/l4Vig7AXe514PDZDTVxa+ARgWJ9GYCDrzNX1HUjuQ==@kvack.org X-Gm-Message-State: AOJu0YwdhFyVUdElvvmjJ5rRt45mFxfO2cgznS3KZLXRLoP2DYaxyZFy 6kqd8RfL02ycgx6WWR7xXjCwRsPnm2/7AmQXdXXr954EWpPfsYRuFZbF7PVGgwwf9FA7CdjSOJn itqX/JFVNQ5KTXKaP17wIFTY3QXRONvMJno8Op2b624abdNwG X-Gm-Gg: ASbGncvXUzqLj9qPdAPvn6+q+zsAlkFEdYT6y0HLeP4+R//4ehDDndDtRE3Vw6JJ94e SK9kNEmpWZVnjM0VVCRty8z0vyRP4Nj8ip4t3/Q3CDAuM2VyHdSRBeLHVnllLPYvXhzgkWaWZEh +UU8LB/Z1iiuGga3Fg72ZXK1U0alL/RZNO3o4pwFp/ybV+42gNMMkLG7EoHJWszIBTutQsYqtZk ggJlA8dF3YAm33gbE5r3Sio9WLoQGBpO3psFevqTXAzy1J5EHi8uKxVFxUCUX5YzSuV9Qgf6KWo 2S0kzzg= X-Received: by 2002:a05:620a:838f:b0:7c3:c13f:5744 with SMTP id af79cd13be357-7c573713527mr744171085a.3.1741893135762; Thu, 13 Mar 2025 12:12:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFmL8wnHI3XvOc6TmOoi5mIyx1LyijkYTY/k9F1G3PKJ/9z/mlexd/aqNP663C+GN+ActHxCA== X-Received: by 2002:a05:620a:838f:b0:7c3:c13f:5744 with SMTP id af79cd13be357-7c573713527mr744166785a.3.1741893135438; Thu, 13 Mar 2025 12:12:15 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c573d89ad6sm130678585a.102.2025.03.13.12.12.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 12:12:14 -0700 (PDT) Date: Thu, 13 Mar 2025 15:12:11 -0400 From: Peter Xu To: Nikita Kalyazin Cc: James Houghton , akpm@linux-foundation.org, pbonzini@redhat.com, shuah@kernel.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, david@redhat.com, ryan.roberts@arm.com, quic_eberman@quicinc.com, graf@amazon.de, jgowans@amazon.com, roypat@amazon.co.uk, derekmn@amazon.com, nsaenz@amazon.es, xmarcalx@amazon.com Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing Message-ID: References: <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com> <7c304c72-1f9c-4a5a-910b-02d0f1514b01@amazon.com> <69dc324f-99fb-44ec-8501-086fe7af9d0d@amazon.com> MIME-Version: 1.0 In-Reply-To: <69dc324f-99fb-44ec-8501-086fe7af9d0d@amazon.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 6ZDJxaHQnBynGfJTPpmrweAcC_4Xb_Uk3DfxkcpTFOM_1741893136 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Queue-Id: 2C14BC000B X-Rspamd-Server: rspam03 X-Stat-Signature: q17amdfgbbggw9zpxr7jsh4k7rysw4bq X-HE-Tag: 1741893137-453524 X-HE-Meta: U2FsdGVkX19ReiwnuhcvdrltT38UKSCQlu/B/SWiomuCbLi7yhHUhSKRMBfLcqHsD0mYkuvwb9kc7IrNPrK2T/tEz/xRajlwRRa7/Wpa+kgA0gd62JmyEFy7f5xKQeiBf+Q3TY0+/pnJJToBK92Hjk2HtvQ+JhzP0Hria0z8BMb9wPXtn9iPS5+/jEvaULjSNiDYDy4XhoueaeMavOWlCowRuVJr+D2XNh917G3Ou3if2VzOuJ0AypNyuXfhEYw/246WJE3iockykGLrlbjOStlqX44KkdqMM062xYDqeh7jzSEhk2H6W7kNgQD6SAG2ILa89YWCZxG+PTVAgfFyjCnS/+tg+uqA/dNcFxf5prrWRhqIxPyKoUntEvtvO11cRWtGHuf/c1Sj95rQDc+xaJnE5Rg7ePvHQMr990h9VSYIVQQGmgHWJHRfAm/1paopgEBpASKQbMRlkd4AcW15rDAh6AmFDhhXBebQm4dT4n3e2SvVATFWGEWXduw2DrttJSkqeBInFe/1KAj9QFGfE4yozBAWbXKUB9LzEcRqyuWhr6RrowOb9IRZEg2J4ABviMyIlTd6fSfuwkACWv9gu7dJQlWCV0KaYQ5f15Q/ae1KKEayN7yAat6IFmPYbTi3Hf0wSFewSquU8jBd70oO5WVXopbv2AXqqBZuuJ26JaS5GRMfD0P3yuMyPZNbzAPF2GevQywxZmOeo706Xhy68w8K4tXAWy9GBcCVaLMKnty226Jviolc8wVOqv0GhWrHm2AY0w94SKhyO0N3NIMf0oX/AWSP9xsXJG+6FNgFHw0JDNrREOM6QCYrUL04uvXxtahoeD6l7E373IcAPjbjACuZopTehMchG9udxKOLTQUfpzTxfAin5I5DxYlZcEIoDGBAjfayZGI23lVAJGr8GR1s+/tFTJatSL97VtA7XxFCsT7CekdCfLyZIxTBKU67DYjMLiDFiabvs/Y7mFN E3W0CGFG EadXFmS52OTObEAIiX2iIPblJy0OLriypL2PJpkthEELi3L6ArPGH/z0W1jfKMxGmFMyiMKrvo0wlXBzV0gFldbQqBjPABwe2LRgcFwy8qWlKGYuN0yF05kNZ2crTv757RNCH7b/yUUIzpD8KJWdF9XuY95HPGwJcuxQ2PmflGHJINySQncIfeOFisRmL9fnjbgRJeC+famLEOfkHbU3Bip0FrGark8TeDgwoB9A6zrAhW8nOwzHdQIsMAhLfp4iMzFI75IQk8fuQUQJwdH3hQdVrl2VzzynygjIpNGbAohRDQFKi2CIyv0OVjidLuytNX1pRJqP+gq7Taw57Cl/rkBJGIbGqp2ZX5qZFNBE7MyT0p9Cxfos1Fle6yDhZ0O9eyKtFu8B8hHG+h17N0bEAAOcS+wCChG9D70HWI0X6exON0RCyjaVRsmOa9fGjuwuzr3++IeqQUMwdW2Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 13, 2025 at 03:25:16PM +0000, Nikita Kalyazin wrote: > > > On 12/03/2025 19:32, Peter Xu wrote: > > On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote: > > > However if MISSING is not registered, the kernel will auto-populate with a > > > clear page, ie there is no way to inject custom content from userspace. To > > > explain my use case a bit more, the population thread will be trying to copy > > > all guest memory proactively, but there will inevitably be cases where a > > > page is accessed through pgtables _before_ it gets populated. It is not > > > desirable for such access to result in a clear page provided by the kernel. > > > > IMHO populating with a zero page in the page cache is fine. It needs to > > make sure all accesses will go via the pgtable, as discussed below in my > > previous email [1], then nobody will be able to see the zero page, not > > until someone updates the content then follow up with a CONTINUE to install > > the pgtable entry. > > > > If there is any way that the page can be accessed without the pgtable > > installation, minor faults won't work indeed. > > I think I see what you mean now. I agree, it isn't the end of the world if > the kernel clears the page and then userspace overwrites it. > > The way I see it is: > > @@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > if (WARN_ON_ONCE(folio_test_large(folio))) { > ret = VM_FAULT_SIGBUS; > goto out_folio; > } > > if (!folio_test_uptodate(folio)) { > clear_highpage(folio_page(folio, 0)); > kvm_gmem_mark_prepared(folio); > } > > + if (userfaultfd_minor(vmf->vma)) { > + folio_unlock(folio); > + filemap_invalidate_unlock_shared(inode->i_mapping); > + return handle_userfault(vmf, VM_UFFD_MISSING); > + } I suppose you meant s/MISSING/MINOR/. > + > vmf->page = folio_file_page(folio, vmf->pgoff); > > out_folio: > if (ret != VM_FAULT_LOCKED) { > folio_unlock(folio); > folio_put(folio); > } > > On the first fault (cache miss), the kernel will allocate/add/clear the page > (as there is no MISSING trap now), and once the page is in the cache, a > MINOR event will be sent for userspace to copy its content. Please let me > know if this is an acceptable semantics. > > Since userspace is getting notified after KVM calls > kvm_gmem_mark_prepared(), which removes the page from the direct map [1], > userspace can't use write() to populate the content because write() relies > on direct map [2]. However userspace can do a plain memcpy that would use > user pagetables instead. This forces userspace to respond to stage-2 and > VMA faults in guest_memfd differently, via write() and memcpy respectively. > It doesn't seem like a significant problem though. It looks ok in general, but could you remind me why you need to stick with write() syscall? IOW, if gmemfd will always need mmap() and it's fully accessible from userspace in your use case, wouldn't mmap()+memcpy() always work already, and always better than write()? Thanks, > > I believe, with this approach the original race condition is gone because > UFFD messages are only sent on cache hit and it is up to userspace to > serialise writes. Please correct me if I'm wrong here. > > [1] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#mdf41fe2dc33332e9c500febd47e14ae91ad99724 > [2] https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf5d794aa31d753cbc73e193628f31e418051983d > > > > > > > > as long as the content can only be accessed from the pgtable (either via > > > > mmap() or GUP on top of it), then afaiu it could work similarly like > > > > MISSING faults, because anything trying to access it will be trapped. > > > > [1] > > > > -- > > Peter Xu > > > > -- Peter Xu