From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 225C3C282EC for ; Tue, 11 Mar 2025 16:57:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 71A50280002; Tue, 11 Mar 2025 12:57:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C92F280001; Tue, 11 Mar 2025 12:57:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5697A280002; Tue, 11 Mar 2025 12:57:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3871C280001 for ; Tue, 11 Mar 2025 12:57:06 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 97C245064E for ; Tue, 11 Mar 2025 16:57:04 +0000 (UTC) X-FDA: 83209875168.07.44801DE Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) by imf29.hostedemail.com (Postfix) with ESMTP id B95EA120016 for ; Tue, 11 Mar 2025 16:56:57 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=hUXVRI64; spf=pass (imf29.hostedemail.com: domain of "prvs=158b0d9c1=kalyazin@amazon.co.uk" designates 99.78.197.218 as permitted sender) smtp.mailfrom="prvs=158b0d9c1=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741712218; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tlV9Mo9aek/qm0du1iKz9kkFBfH6SyE9Qgk2PjqxtjE=; b=QxWJ7HjXkZABr+yCn3/cyGZ/ej5AhJxje2Fyln8xPVreW9fN5Lrvai/1O6LErOXwsI1qTl 4SbHZ3+nwAjCKJwAe5ajAxRjs7ea6bYyMEdEDE6qeietjuAY4Ux33M2cPQ1LdgUBDaQq36 dh1TwkaBUWkBdChuUi71b4ofVsrGFRY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741712218; a=rsa-sha256; cv=none; b=RwBchIKNZnnugkOUdBJ2e+K/xZJ0t9pBSAQawDQ5AP71c5OAoJ9oc/ehfnom/RDz2/HjSw UbIpLr0lQ0EynBwe1LPxMaeM0SvBaXkb7weZZRXRXwHnV6+iYsd0c7Wn86gqkdJARLoYrR H08GpGi0F6w9QJ0+AwE0Rr2VaQRt4cg= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=hUXVRI64; spf=pass (imf29.hostedemail.com: domain of "prvs=158b0d9c1=kalyazin@amazon.co.uk" designates 99.78.197.218 as permitted sender) smtp.mailfrom="prvs=158b0d9c1=kalyazin@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1741712218; x=1773248218; h=message-id:date:mime-version:reply-to:subject:to:cc: references:from:in-reply-to:content-transfer-encoding; bh=tlV9Mo9aek/qm0du1iKz9kkFBfH6SyE9Qgk2PjqxtjE=; b=hUXVRI64Y79i+S7FJ+zrSyB/es8vILdCYurnMvJct0VeP2aqlTlH95lV 5AVJy+3uF2bwzOB5dqjATIDImMdWwCIxykC3TNTmYuOzh851ri9UjTU1N 7M4UmarOvNTIV0PjCI0V+x2gTntaoWDSacasQTog8URucRVmFx6AYmafc 0=; X-IronPort-AV: E=Sophos;i="6.14,239,1736812800"; d="scan'208";a="385729533" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Mar 2025 16:56:50 +0000 Received: from EX19MTAEUA002.ant.amazon.com [10.0.17.79:6023] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.26.251:2525] with esmtp (Farcaster) id f548cd40-c792-49af-ad1d-1f5d23f80cf5; Tue, 11 Mar 2025 16:56:49 +0000 (UTC) X-Farcaster-Flow-ID: f548cd40-c792-49af-ad1d-1f5d23f80cf5 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUA002.ant.amazon.com (10.252.50.126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Tue, 11 Mar 2025 16:56:49 +0000 Received: from [10.95.111.253] (10.95.111.253) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Tue, 11 Mar 2025 16:56:48 +0000 Message-ID: <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com> Date: Tue, 11 Mar 2025 16:56:47 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing To: Peter Xu CC: James Houghton , , , , , , , , , , , , , , , , , References: <20250303133011.44095-1-kalyazin@amazon.com> Content-Language: en-US From: Nikita Kalyazin Autocrypt: addr=kalyazin@amazon.com; keydata= xjMEY+ZIvRYJKwYBBAHaRw8BAQdA9FwYskD/5BFmiiTgktstviS9svHeszG2JfIkUqjxf+/N JU5pa2l0YSBLYWx5YXppbiA8a2FseWF6aW5AYW1hem9uLmNvbT7CjwQTFggANxYhBGhhGDEy BjLQwD9FsK+SyiCpmmTzBQJnrNfABQkFps9DAhsDBAsJCAcFFQgJCgsFFgIDAQAACgkQr5LK IKmaZPOpfgD/exazh4C2Z8fNEz54YLJ6tuFEgQrVQPX6nQ/PfQi2+dwBAMGTpZcj9Z9NvSe1 CmmKYnYjhzGxzjBs8itSUvWIcMsFzjgEY+ZIvRIKKwYBBAGXVQEFAQEHQCqd7/nb2tb36vZt ubg1iBLCSDctMlKHsQTp7wCnEc4RAwEIB8J+BBgWCAAmFiEEaGEYMTIGMtDAP0Wwr5LKIKma ZPMFAmes18AFCQWmz0MCGwwACgkQr5LKIKmaZPNTlQEA+q+rGFn7273rOAg+rxPty0M8lJbT i2kGo8RmPPLu650A/1kWgz1AnenQUYzTAFnZrKSsXAw5WoHaDLBz9kiO5pAK In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.95.111.253] X-ClientProxiedBy: EX19D010EUA004.ant.amazon.com (10.252.50.94) To EX19D022EUC002.ant.amazon.com (10.252.51.137) X-Rspam-User: X-Rspamd-Queue-Id: B95EA120016 X-Stat-Signature: ydkk9p3io77jxd6xm9bbqjms7ti7pd4a X-Rspamd-Server: rspam10 X-HE-Tag: 1741712217-514019 X-HE-Meta: U2FsdGVkX18iPrzXi69J1YNK2sh7lRSe9nHXNu/pisnVAZUuagm+GuFPUABckTsvOLAJ/3SlDTTDlmn5sYmQhyuGCHzZkcQNc26kor84huZQw6UoWdbkrKDgh6W8G7OdgLnO2+er/Y3A1TRo2LhxVe2WIvZxUA1pKoQJ4477bywP4lJx8JpcB3eWIhF5D+1WsS58nM1rtTio7O0D9xFIEg3yKcjoT6DL573kqBdLZhWF8SPmL3I/KlTRX4Ht1Qa8OT8N9uBmR+i3jCfqLso+SjsAidBcVtyKw/LhiwQe6qdUDk1p40VrgovdXEk/9fQ0u8Pr5gNoeYhx3EhnYdu7oq2rZjPySL7M07m1na3kV2lmx6itZZ9zQhP21pZ7mmP2lSnbZJuYQ7U7rK506XVW2RdCL5ZNOabeSEwUpgg1dDj6dn8fwGnyhzSWDB32wOa38NcacFFdjw4MQvxLUqEKRzQTfod5aZFvxu1+QO2ELGvwPYj29GZtpuzgCGU4L04jCahh/cwGLBwHKruTeS498/VTgHgwG4liKshc/iRZQ5DPo3ghLA3eCKMVKfICe1tSLaZ/KxH2QMdRq9LaLhnO5xme95ksCP4xv/oVeVekbRLe2iyn/I3Xy/aWnVklm0gnvs2HnjWWD1oSP9nB3xeOgj7GF1GMGI3+NLsjcT312+l6FtKj5MItg7UL6gQ84TLePx9ELA3GCCFlPcfva3IGcm/qFQwq/mslqjba44PjsJvbbKW3jpkST6kzG0EKcUYwUL2dSg30T+rJLWfothOfpS/2RSsDUCQwvnXt7VqrdOA+XJ9KPwNktt0JTs3MK/qQaY2QDFamiuWWEB3SK54RPTTMhtXRWw/elknCcQqbqAfGQhnP19XRPxPKfxVfluEbUxkFOx7yPzfWRxOlvb0opPtUvQLfEe9I82O8GINP5pQc3tCBYdT2bpeyjjrb79HPx3UDmBXlfVkmC/tLDqo PgkyYL3H GHWZ05VRr9S2grAC8AXQMEz0EcoPWH62YpRs3Um3XDguNtExH9sVLjGEFhkfMHEMO+Jyii7woQ1jfDCKs62YtxfeExHzVxKEwS/lhgHvv508tFBPWoi247IrlOWmwHpuSnj89UP/X9UgCe9m8yP/gmLtMRNUVVqA+mqe57MxGfRpZIGrhPsu0YYlLjyQQmOxv5yD6fTmXGA8Em27XBuRiarviJ7JBlop2oKeyCy1AhXs8O4qk03eUqYOX0/Wb61yNjUwGPjiyQEYFehY/cwxiqh0/0u41ojQ2Uxuz2Qh2+JDAGZmHTk/e8saLwuLOJUeNZH7VPAPWB0xhebNY9VstxQ+5j9x54RGCFRgLLxDtljqogVcOYSsdmgZADlWaksPY/rp+w2GOSRMW8LhjfWC9sPIBilrN8JrwHU4Af30oSk21Lh1dhOKCi8SOax70bymAB5U6MAfxDQHWqYXu5cCpaLdAAtLKEoFKwA1v7rjqacKN7rGlpyVI4mKumGcNJRu1RLjzqYEuoC6BKZYdfb6g7xDAFPy3OZ/MGSKgPjd+9+CnlRQepyBw/7n+Ogg3xQRrq6O1ey8UXlpxd5TpBLjs/+Vxc1XQHA8qMdZ1g6ehQtmnaOg0jipw/1tADiTK5vIlgYIaiqueNnzILJtAWtBFmeWmIaaroViGt9tWjXz5BOyPyx3dD3kG9AWJ0J77BsDtJhaCahfTC+c1uqBOXZTL/iTZZk3+FAjd3a2lJYS0gVWU9dxr4YNMyirxnw0c6MiL98k713PP6ZnYm8wInfauO8Jguqo5rE82X2WiztgrcH7qcLhpaP2cFIT1cOwk0dQGP0p5wPhFKgEFQeNr0T4+UHDr00un6kR9Ly49h8gktOHEh34= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/03/2025 19:57, Peter Xu wrote: > On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote: >> >> >> On 05/03/2025 20:29, Peter Xu wrote: >>> On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote: >>>> I think it might be useful to implement an fs-generic MINOR mode. The >>>> fault handler is already easy enough to do generically (though it >>>> would become more difficult to determine if the "MINOR" fault is >>>> actually a MISSING fault, but at least for my userspace, the >>>> distinction isn't important. :)) So the question becomes: what should >>>> UFFDIO_CONTINUE look like? >>>> >>>> And I think it would be nice if UFFDIO_CONTINUE just called >>>> vm_ops->fault() to get the page we want to map and then mapped it, >>>> instead of having shmem-specific and hugetlb-specific versions (though >>>> maybe we need to keep the hugetlb specialization...). That would avoid >>>> putting kvm/gmem/etc. symbols in mm/userfaultfd code. >>>> >>>> I've actually wanted to do this for a while but haven't had a good >>>> reason to pursue it. I wonder if it can be done in a >>>> backwards-compatible fashion... >>> >>> Yes I also thought about that. :) >> >> Hi Peter, hi James. Thanks for pointing at the race condition! >> >> I did some experimentation and it indeed looks possible to call >> vm_ops->fault() from userfault_continue() to make it generic and decouple >> from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent >> a recursive handle_userfault() invocation, which I believe can be solved by >> adding a new VMF flag to ignore the userfault path when the fault handler is >> called from userfault_continue(). I'm open to a more elegant solution >> though. > > It sounds working to me. Adding fault flag can also be seen as part of > extension of vm_operations_struct ops. So we could consider reusing > fault() API indeed. Great! >> >> Regarding usage of the MINOR notification, in what case do you recommend >> sending it? If following the logic implemented in shmem and hugetlb, ie if >> the page is _present_ in the pagecache, I can't see how it is going to work > > It could be confusing when reading that chunk of code, because it looks > like it notifies minor fault when cache hit. But the critical part here is > that we rely on the pgtable missing causing the fault() to trigger first. > So it's more like "cache hit && pgtable missing" for minor fault. Right, but the cache hit still looks like a precondition for the minor fault event? >> with the write syscall, as we'd like to know when the page is _missing_ in >> order to respond with the population via the write. If going against >> shmem/hugetlb logic, and sending the MINOR event when the page is missing >> from the pagecache, how would it solve the race condition problem? > > Should be easier we stick with mmap() rather than write(). E.g. for shmem > case of current code base: > > if (folio && vma && userfaultfd_minor(vma)) { > if (!xa_is_value(folio)) > folio_put(folio); > *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); > return 0; > } > > vma is only availble if vmf!=NULL, aka in fault context. With that, in > write() to shmem inodes, nothing will generate a message, because minor > fault so far is only about pgtable missing. It needs to be mmap()ed first, > and has nothing yet to do with write() syscalls. Yes, that's true that write() itself isn't going to generate a message. My idea was to _respond_ to a message generated by the fault handler (vmf != NULL) with a write(). I didn't mean to generate it from write(). What I wanted to achieve was send a message on fault + cache miss and respond to the message with a write() to fill the cache followed by a UFFDIO_CONTINUE to set up pagetables. I understand that a MINOR trap (MINOR + UFFDIO_CONTINUE) is preferable, but how does it fit into this model? What/how will guarantee a cache hit that would trigger the MINOR message? To clarify, I would like to be able to populate pages _on-demand_, not only proactively (like in the original UFFDIO_CONTINUE cover letter [1]). Do you think the MINOR trap could still be applicable or would it necessarily require the MISSING trap? [1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/ >> >> Also, where would the check for the folio_test_uptodate() mentioned by James >> fit into here? Would it only be used for fortifying the MINOR (present) >> against the race? >> >>> When Axel added minor fault, it's not a major concern as it's the only fs >>> that will consume the feature anyway in the do_fault() path - hugetlbfs has >>> its own path to take care of.. even until now. >>> >>> And there's some valid points too if someone would argue to put it there >>> especially on folio lock - do that in shmem.c can avoid taking folio lock >>> when generating minor fault message. It might make some difference when >>> the faults are heavy and when folio lock is frequently taken elsewhere too. >> >> Peter, could you expand on this? Are you referring to the following >> (shmem_get_folio_gfp)? >> >> if (folio) { >> folio_lock(folio); >> >> /* Has the folio been truncated or swapped out? */ >> if (unlikely(folio->mapping != inode->i_mapping)) { >> folio_unlock(folio); >> folio_put(folio); >> goto repeat; >> } >> if (sgp == SGP_WRITE) >> folio_mark_accessed(folio); >> if (folio_test_uptodate(folio)) >> goto out; >> /* fallocated folio */ >> if (sgp != SGP_READ) >> goto clear; >> folio_unlock(folio); >> folio_put(folio); >> } >> >> Could you explain in what case the lock can be avoided? AFAIC, the function >> is called by both the shmem fault handler and userfault_continue(). > > I think you meant the UFFDIO_CONTINUE side of things. I agree with you, we > always need the folio lock. > > What I was saying is the trapping side, where the minor fault message can > be generated without the folio lock now in case of shmem. It's about > whether we could generalize the trapping side, so handle_mm_fault() can > generate the minor fault message instead of by shmem.c. > > If the only concern is "referring to a module symbol from core mm", then > indeed the trapping side should be less of a concern anyway, because the > trapping side (when in the module codes) should always be able to reference > mm functions. > > Actually.. if we have a fault() flag introduced above, maybe we can > generalize the trap side altogether without the folio lock overhead. When > the flag set, if we can always return the folio unlocked (as long as > refcount held), then in UFFDIO_CONTINUE ioctl we can lock it. Where does this locking happen exactly during trapping? I was thinking it was only done when the page was allocated. The trapping part (quoted by you above) only looks up the page in the cache and calls handle_userfault(). Am I missing something? >> >>> It might boil down to how many more FSes would support minor fault, and >>> whether we would care about such difference at last to shmem users. If gmem >>> is the only one after existing ones, IIUC there's still option we implement >>> it in gmem code. After all, I expect the change should be very under >>> control (<20 LOCs?).. >>> >>> -- >>> Peter Xu >>> >> > > -- > Peter Xu >