linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nikita Kalyazin <kalyazin@amazon.com>
To: Peter Xu <peterx@redhat.com>, James Houghton <jthoughton@google.com>
Cc: <akpm@linux-foundation.org>, <pbonzini@redhat.com>,
	<shuah@kernel.org>, <kvm@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <lorenzo.stoakes@oracle.com>,
	<david@redhat.com>, <ryan.roberts@arm.com>,
	<quic_eberman@quicinc.com>, <graf@amazon.de>,
	<jgowans@amazon.com>, <roypat@amazon.co.uk>, <derekmn@amazon.com>,
	<nsaenz@amazon.es>, <xmarcalx@amazon.com>
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing
Date: Mon, 10 Mar 2025 18:12:22 +0000	[thread overview]
Message-ID: <fdae95e3-962b-4eaf-9ae7-c6bd1062c518@amazon.com> (raw)
In-Reply-To: <Z8i0HXen8gzVdgnh@x1.local>



On 05/03/2025 20:29, Peter Xu wrote:
> On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
>> I think it might be useful to implement an fs-generic MINOR mode. The
>> fault handler is already easy enough to do generically (though it
>> would become more difficult to determine if the "MINOR" fault is
>> actually a MISSING fault, but at least for my userspace, the
>> distinction isn't important. :)) So the question becomes: what should
>> UFFDIO_CONTINUE look like?
>>
>> And I think it would be nice if UFFDIO_CONTINUE just called
>> vm_ops->fault() to get the page we want to map and then mapped it,
>> instead of having shmem-specific and hugetlb-specific versions (though
>> maybe we need to keep the hugetlb specialization...). That would avoid
>> putting kvm/gmem/etc. symbols in mm/userfaultfd code.
>>
>> I've actually wanted to do this for a while but haven't had a good
>> reason to pursue it. I wonder if it can be done in a
>> backwards-compatible fashion...
> 
> Yes I also thought about that. :)

Hi Peter, hi James.  Thanks for pointing at the race condition!

I did some experimentation and it indeed looks possible to call 
vm_ops->fault() from userfault_continue() to make it generic and 
decouple from KVM, at least for non-hugetlb cases.  One thing is we'd 
need to prevent a recursive handle_userfault() invocation, which I 
believe can be solved by adding a new VMF flag to ignore the userfault 
path when the fault handler is called from userfault_continue().  I'm 
open to a more elegant solution though.

Regarding usage of the MINOR notification, in what case do you recommend 
sending it?  If following the logic implemented in shmem and hugetlb, ie 
if the page is _present_ in the pagecache, I can't see how it is going 
to work with the write syscall, as we'd like to know when the page is 
_missing_ in order to respond with the population via the write.  If 
going against shmem/hugetlb logic, and sending the MINOR event when the 
page is missing from the pagecache, how would it solve the race 
condition problem?

Also, where would the check for the folio_test_uptodate() mentioned by 
James fit into here?  Would it only be used for fortifying the MINOR 
(present) against the race?

> When Axel added minor fault, it's not a major concern as it's the only fs
> that will consume the feature anyway in the do_fault() path - hugetlbfs has
> its own path to take care of.. even until now.
> 
> And there's some valid points too if someone would argue to put it there
> especially on folio lock - do that in shmem.c can avoid taking folio lock
> when generating minor fault message.  It might make some difference when
> the faults are heavy and when folio lock is frequently taken elsewhere too.

Peter, could you expand on this?  Are you referring to the following 
(shmem_get_folio_gfp)?

	if (folio) {
		folio_lock(folio);

		/* Has the folio been truncated or swapped out? */
		if (unlikely(folio->mapping != inode->i_mapping)) {
			folio_unlock(folio);
			folio_put(folio);
			goto repeat;
		}
		if (sgp == SGP_WRITE)
			folio_mark_accessed(folio);
		if (folio_test_uptodate(folio))
			goto out;
		/* fallocated folio */
		if (sgp != SGP_READ)
			goto clear;
		folio_unlock(folio);
		folio_put(folio);
	}

Could you explain in what case the lock can be avoided?  AFAIC, the 
function is called by both the shmem fault handler and userfault_continue().

> It might boil down to how many more FSes would support minor fault, and
> whether we would care about such difference at last to shmem users. If gmem
> is the only one after existing ones, IIUC there's still option we implement
> it in gmem code.  After all, I expect the change should be very under
> control (<20 LOCs?)..
> 
> --
> Peter Xu
> 



  reply	other threads:[~2025-03-10 18:12 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-03 13:30 Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 1/5] KVM: guest_memfd: add kvm_gmem_vma_is_gmem Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 2/5] KVM: guest_memfd: add support for uffd missing Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 3/5] mm: userfaultfd: allow to register userfaultfd for guest_memfd Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 4/5] mm: userfaultfd: support continue " Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 5/5] KVM: selftests: add uffd missing test " Nikita Kalyazin
2025-03-03 21:29 ` [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing Peter Xu
2025-03-05 19:35   ` James Houghton
2025-03-05 20:29     ` Peter Xu
2025-03-10 18:12       ` Nikita Kalyazin [this message]
2025-03-10 19:57         ` Peter Xu
2025-03-11 16:56           ` Nikita Kalyazin
2025-03-12 15:45             ` Peter Xu
2025-03-12 17:07               ` Nikita Kalyazin
2025-03-12 19:32                 ` Peter Xu
2025-03-13 15:25                   ` Nikita Kalyazin
2025-03-13 19:12                     ` Peter Xu
2025-03-13 22:13                       ` Nikita Kalyazin
2025-03-13 22:38                         ` Peter Xu
2025-03-14 17:12                           ` Nikita Kalyazin
2025-03-14 18:32                             ` Peter Xu
2025-03-14 20:04                             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdae95e3-962b-4eaf-9ae7-c6bd1062c518@amazon.com \
    --to=kalyazin@amazon.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=graf@amazon.de \
    --cc=jgowans@amazon.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=nsaenz@amazon.es \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=quic_eberman@quicinc.com \
    --cc=roypat@amazon.co.uk \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=xmarcalx@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox