From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21067C282EC for ; Mon, 10 Mar 2025 18:12:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9CD19280017; Mon, 10 Mar 2025 14:12:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97C5E280004; Mon, 10 Mar 2025 14:12:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F7FB280017; Mon, 10 Mar 2025 14:12:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 60B7A280004 for ; Mon, 10 Mar 2025 14:12:29 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E594016121C for ; Mon, 10 Mar 2025 18:12:31 +0000 (UTC) X-FDA: 83206436502.05.B4EC21F Received: from smtp-fw-52004.amazon.com (smtp-fw-52004.amazon.com [52.119.213.154]) by imf27.hostedemail.com (Postfix) with ESMTP id C372940011 for ; Mon, 10 Mar 2025 18:12:29 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=E5smtliR; dmarc=pass (policy=quarantine) header.from=amazon.com; spf=pass (imf27.hostedemail.com: domain of "prvs=157309fb2=kalyazin@amazon.co.uk" designates 52.119.213.154 as permitted sender) smtp.mailfrom="prvs=157309fb2=kalyazin@amazon.co.uk" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741630349; a=rsa-sha256; cv=none; b=G2JlbgaLDfToZNSk0FTk9QtlV5lGDpg9NTJKx8YrRFyDnLpWIZgjuu+0qr1CZajkYS3WBw QvW5HihS9Iro7CD/IT363gZVQLLA61SGL4OTOjDIlrXg4Uy5o7Tkt5cB35kb2yrCC0S5ih oWN541nrxn3B0LXPr5pwHlNIvTyrWLk= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=E5smtliR; dmarc=pass (policy=quarantine) header.from=amazon.com; spf=pass (imf27.hostedemail.com: domain of "prvs=157309fb2=kalyazin@amazon.co.uk" designates 52.119.213.154 as permitted sender) smtp.mailfrom="prvs=157309fb2=kalyazin@amazon.co.uk" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741630349; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AJNzET63DVGR7tJQ8b5ZW+nPqzzSSt+0wH0JG0KxXMQ=; b=patSlQA79hxBc5TDdO34b78A50nFV5hp+Lhzp7FQMUtOU6Bjaf51aWIMsrGW4fR/OMwx7M U0i5BvrglyRhCx2FoXkAAXFBo6nKYTVKw74v8/CWy78MSQ7F56YSribuLmlu4tZQcaCS2N sf6VpAzLkQ40WqP+lHc5UTjtY7Ne7sE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1741630350; x=1773166350; h=message-id:date:mime-version:reply-to:subject:to:cc: references:from:in-reply-to:content-transfer-encoding; bh=AJNzET63DVGR7tJQ8b5ZW+nPqzzSSt+0wH0JG0KxXMQ=; b=E5smtliReO5wx6+rDBpPGlt3qh7DA6QBXJ6QLKFRBs0h6NufSK490uqP DbZk6edXC7t1VMZ54f3d348jDd91zPH9ryDYcQGCKe0Wn6a74nJI533oc /NI9Ts/zmzCL7e4cmC4/t2dc2qBxVuVsvppM4oO6ODBGUQGI4RLy8+B6m U=; X-IronPort-AV: E=Sophos;i="6.14,236,1736812800"; d="scan'208";a="278114488" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-52004.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2025 18:12:26 +0000 Received: from EX19MTAEUA001.ant.amazon.com [10.0.17.79:59359] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.34.60:2525] with esmtp (Farcaster) id c3925b25-0958-4048-85a6-faf49dbd70a0; Mon, 10 Mar 2025 18:12:24 +0000 (UTC) X-Farcaster-Flow-ID: c3925b25-0958-4048-85a6-faf49dbd70a0 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUA001.ant.amazon.com (10.252.50.223) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Mon, 10 Mar 2025 18:12:24 +0000 Received: from [192.168.30.50] (10.106.82.18) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Mon, 10 Mar 2025 18:12:23 +0000 Message-ID: Date: Mon, 10 Mar 2025 18:12:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing To: Peter Xu , James Houghton CC: , , , , , , , , , , , , , , , , References: <20250303133011.44095-1-kalyazin@amazon.com> Content-Language: en-US From: Nikita Kalyazin Autocrypt: addr=kalyazin@amazon.com; keydata= xjMEY+ZIvRYJKwYBBAHaRw8BAQdA9FwYskD/5BFmiiTgktstviS9svHeszG2JfIkUqjxf+/N JU5pa2l0YSBLYWx5YXppbiA8a2FseWF6aW5AYW1hem9uLmNvbT7CjwQTFggANxYhBGhhGDEy BjLQwD9FsK+SyiCpmmTzBQJnrNfABQkFps9DAhsDBAsJCAcFFQgJCgsFFgIDAQAACgkQr5LK IKmaZPOpfgD/exazh4C2Z8fNEz54YLJ6tuFEgQrVQPX6nQ/PfQi2+dwBAMGTpZcj9Z9NvSe1 CmmKYnYjhzGxzjBs8itSUvWIcMsFzjgEY+ZIvRIKKwYBBAGXVQEFAQEHQCqd7/nb2tb36vZt ubg1iBLCSDctMlKHsQTp7wCnEc4RAwEIB8J+BBgWCAAmFiEEaGEYMTIGMtDAP0Wwr5LKIKma ZPMFAmes18AFCQWmz0MCGwwACgkQr5LKIKmaZPNTlQEA+q+rGFn7273rOAg+rxPty0M8lJbT i2kGo8RmPPLu650A/1kWgz1AnenQUYzTAFnZrKSsXAw5WoHaDLBz9kiO5pAK In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.106.82.18] X-ClientProxiedBy: EX19D004EUC004.ant.amazon.com (10.252.51.191) To EX19D022EUC002.ant.amazon.com (10.252.51.137) X-Rspamd-Queue-Id: C372940011 X-Stat-Signature: kau3c854a9csjo4597mjfdcsermibtim X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1741630349-826815 X-HE-Meta: U2FsdGVkX1+ozbYeBIBx06d7llTPau2JZd3IfqAYzY4/l4frjIfR0Zjdqv08TVwNkNMxoF/33RWMda9a86t0ePRl1IHfADmzXhqFsP7nYfWnd49Xwkm89Tbwcd7Go8MLQ0iObGwpTrO47JiO8pOtH+xnRYWaBvHD3UkX+L+F0dBsidFkWGV7Okf6Jy/H2iwpx/jxVnBrjg6jQEsXqtGzT8Xzv+x93a0OqWJB3N9tCa9y0oS88L8PF5ew7JpOG6RVe64ptpRJC6UJn+T5rkMUvbM0d79e3Sv42FkGJ81Z/sYUx+cYDlIIsnTHgPCa5r2OtJ2qgqW4uIQtRaSu/Jz938jKY4a1099VelEhAMic5Htd9WO1Skp9jOGO2ww3PMGvZ2ceKAi20qwknbQGwcGFfOg0hJ9E5X9eGUf1cnNvsax3VfZfdVYuDUZD3UPg6hIX1mr5e1iciVJ3HhkJ1xnjAxC3Af8Hr6bIV5zZBWMS80lDpZ023rgTlIod3YwKyhjppvWZlq7XPfhSVjzsYlaesNmebUINa0EzqgIOeDIB16faoQOkS/QJlx9oO4BQYc8/QgXFaF4NMBSgjgIdhbFy0ggJ5kJQtQMZUu5OcCJ8RTs9jbBIZtKoqQigt7GgJImD9s8zUZ6nToVXS0X0w+vZdLaZkceDZZiBcob6G9PN+AjCGKxv/oUjoArqkib0n5ODIfBuBxuK+bcKduVKS2r10uWCT165LUJTDhvVBBCMCPs0UoXDsMJ0Dy0MNWdOIyzXsZAjsueByuXS4t61fzuRjCtFA5Hag/iQyn1lA1GXjkWcF2//K5YGD8u+oXoJ1xzEiy0tkqSVmJSDQTsfimz5uhmZO/4BgM8A4FhmshJwT2MrEBQqARwhs9gk43Y5rRhTWtjgfcrb8lIMMvS5MYZEqtMjbV2sueAwqrn1lpYrba82KdpjowG14pQX7lGFoWRl3yZ18+cJ/XiXzaw7AXZ 2Ki39h34 s6djUOaoQGMf0oGWHde4Wd9YTawiAJzOE/QEg7K4gMZiJ+6SeMUu7shBqqA9OoywnKykE7Wm71CSRF4oEv36qGrUxb43LDf1/vlv3IKCMODLOY9ORvftJeOp0aHJkr87ZYpa95HX8igyuB2NZndn5Q/7ldxapRt5nXqPmiz7IAzECecJQld9wdpnB/KXttpybvaXotlgS8aV/cmHyAoEEbCAHh+p+2RStV2c/qJkewhmtPB46spYGJ4M/DQLfwiuHvuBK0WQ+3w2BnKbrfcu8TwEAuIIvdhOe6WExI9rzx972u86tJLsY41/UUtwY2hDLCz5Yc9LMxgyQ3D0TP59HH5cp44+ahPvl2PfMNVNrUL5csSGyJjSs9NlCAK3ang5OZUz41vW6tR3fsg/fMivqg1//JzE31dyIXW+KDeYdK54h0h8qFEfzFctMbY5faul7nNm2CBID4ANHFtBb9qp2oyW8WyMlBKM2/fUEjh+6DYVtXAwP/4p/Vqa+/gNtjmtZ7VBo3cTlaCaELmO4eX7mAPxKh7l1bc4iehKnSn8idB3MWcMhZD6f43bQJdn2bhS3rYUT6mmC0a+cFb9zS0vpcGhUL76RDUe14UasBNy8gK74U183ytJMGDpGOq97BLNB1PCqTa2oN+LUmOFJX61iSE39cRLGEGX/7knfKeOW3SAqhl5IxIr3RNuiPK84xPVwVF3suwGkzAedgrwcroZFEkPmAf/2Hii34bidib36Pplp1fyGp/KQ7LUmgmfnXkbY3Ygm+gxa4t0d78h2o8Mygmru6Aa4lZrx/gLeS8LQFy0YYyInRi7UMT3P9Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000411, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/03/2025 20:29, Peter Xu wrote: > On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote: >> I think it might be useful to implement an fs-generic MINOR mode. The >> fault handler is already easy enough to do generically (though it >> would become more difficult to determine if the "MINOR" fault is >> actually a MISSING fault, but at least for my userspace, the >> distinction isn't important. :)) So the question becomes: what should >> UFFDIO_CONTINUE look like? >> >> And I think it would be nice if UFFDIO_CONTINUE just called >> vm_ops->fault() to get the page we want to map and then mapped it, >> instead of having shmem-specific and hugetlb-specific versions (though >> maybe we need to keep the hugetlb specialization...). That would avoid >> putting kvm/gmem/etc. symbols in mm/userfaultfd code. >> >> I've actually wanted to do this for a while but haven't had a good >> reason to pursue it. I wonder if it can be done in a >> backwards-compatible fashion... > > Yes I also thought about that. :) Hi Peter, hi James. Thanks for pointing at the race condition! I did some experimentation and it indeed looks possible to call vm_ops->fault() from userfault_continue() to make it generic and decouple from KVM, at least for non-hugetlb cases. One thing is we'd need to prevent a recursive handle_userfault() invocation, which I believe can be solved by adding a new VMF flag to ignore the userfault path when the fault handler is called from userfault_continue(). I'm open to a more elegant solution though. Regarding usage of the MINOR notification, in what case do you recommend sending it? If following the logic implemented in shmem and hugetlb, ie if the page is _present_ in the pagecache, I can't see how it is going to work with the write syscall, as we'd like to know when the page is _missing_ in order to respond with the population via the write. If going against shmem/hugetlb logic, and sending the MINOR event when the page is missing from the pagecache, how would it solve the race condition problem? Also, where would the check for the folio_test_uptodate() mentioned by James fit into here? Would it only be used for fortifying the MINOR (present) against the race? > When Axel added minor fault, it's not a major concern as it's the only fs > that will consume the feature anyway in the do_fault() path - hugetlbfs has > its own path to take care of.. even until now. > > And there's some valid points too if someone would argue to put it there > especially on folio lock - do that in shmem.c can avoid taking folio lock > when generating minor fault message. It might make some difference when > the faults are heavy and when folio lock is frequently taken elsewhere too. Peter, could you expand on this? Are you referring to the following (shmem_get_folio_gfp)? if (folio) { folio_lock(folio); /* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); } Could you explain in what case the lock can be avoided? AFAIC, the function is called by both the shmem fault handler and userfault_continue(). > It might boil down to how many more FSes would support minor fault, and > whether we would care about such difference at last to shmem users. If gmem > is the only one after existing ones, IIUC there's still option we implement > it in gmem code. After all, I expect the change should be very under > control (<20 LOCs?).. > > -- > Peter Xu >