From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F07F4EB8FBC for ; Wed, 6 Sep 2023 12:44:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 76357440166; Wed, 6 Sep 2023 08:44:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7134E440151; Wed, 6 Sep 2023 08:44:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5DB4E440166; Wed, 6 Sep 2023 08:44:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4A967440151 for ; Wed, 6 Sep 2023 08:44:50 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 22A8180A26 for ; Wed, 6 Sep 2023 12:44:50 +0000 (UTC) X-FDA: 81206141940.02.65EFBA0 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf30.hostedemail.com (Postfix) with ESMTP id 204AF80024 for ; Wed, 6 Sep 2023 12:44:46 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=kpkqbzl4; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf30.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694004287; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CJKZSZTdaWbIT7+zm2ImzrrqrKjSAMvGVs++uD0+a/o=; b=CqGHKCK8qMoyVFjQbG2eelZS7+yn6QexcqUldk7NJ2kiUtJqAZo/ZHrwayStIkK3Y61+os lSdk6hX2XWrzL+7F9q9MKEM6e3Oe5+DKNVNmpiLDUrUxNaobQ+ojHre5FzCshBpiMNYcl3 byyXTVuwHBQiZAvkOkyjlcMGrYQoy4Q= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=kpkqbzl4; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf30.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694004287; a=rsa-sha256; cv=none; b=aUkAlQMRKMH68d+bc7FDB9Pgk4q6+XUR0PMbvoXmZ1rVo+CYpAni+DcrNzpVIS38OtNTRb nxGKjKvV3ZrMPkz7fSFOU78SySF52XVd0Jbko9Rc9GIToUk30QVXBGC9HsX4Bl4JDCfJz8 ZGQgVWWiKJTpfvBMxQVieP5aWcy91pA= Received: from [192.168.100.7] (unknown [59.103.219.75]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 660C466072E5; Wed, 6 Sep 2023 13:44:38 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1694004285; bh=2Hy416HIimMmtLwQOJg3KtDgvwKos8dGIQ36vxE9Wjo=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=kpkqbzl4qJ9CFTyCBR+9f/FLmP7jie3FrOCMe523qNwOB1p9o3BdLFnUNMZ2FLUKA Q6FN5sUBQt1lBSPkcwzYrf35sljzkaFmvlmU6xYdiwRAnqQJRt5pPESMNVggagZBD6 TvWpnyhyWLSZBpE1iJiqlqvNyENxFjmvUZKj+wrounNRMWY2Mikn9aVmE0MyKQzpSt AX4/eLhvaont7g9r3uaq4K/d7H2p85FlSOW3AtesM8J4lg7GVGIcbf96eLFiHVXol9 BqltlHK5n0z9IyvrI/y3bT6gcuSNj+vRYNosTEZaS3MZIfeYa1l96gn8XXcDVuskep 9M3dVDFSv/eiA== Message-ID: Date: Wed, 6 Sep 2023 17:44:33 +0500 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Cc: Muhammad Usama Anjum , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com, Cyrill Gorcunov , Mike Rapoport , Nadav Amit , David Hildenbrand , Danylo Mocherniuk , =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Paul Gofman , Peter Xu Subject: Re: [PATCH v33 0/6] Implement IOCTL to get and optionally clear info about PTEs To: Andrew Morton References: <20230821141518.870589-1-usama.anjum@collabora.com> Content-Language: en-US From: Muhammad Usama Anjum In-Reply-To: <20230821141518.870589-1-usama.anjum@collabora.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 204AF80024 X-Stat-Signature: nu8iqrn4wkmxcdsipsq5grc8qtbxana7 X-Rspam-User: X-HE-Tag: 1694004286-60255 X-HE-Meta: U2FsdGVkX19Lz0WpV0o536B22dHiHYKnj8ipRuDtN6ayrrLGYvDBFYad46xvmF204iA3uVA8Lv8BJ7RHYl8Hs2ZzWoO6Xyeels5ij4nIhjbh43gXMmx1nW0G6YwbNBKAUgVDLFNrWsHfi3jroRgkcgYqeT1w4VcBEtaML6maK4CnoTrZKfcmtUXASb8OMMYoqFJ395Kbaw1AucNhd9L4oUIkc/GMBXrnwedc/Sm9EqEXxLMG0P2N0hiCIVBnxNf9UsnFHbp9p9iWoTNfCLHD2SDdrN75nb5HB32/ZBRRC1KpzdaxC+eKDL+jcaoX7CUY9R1r9WZsoDc9gcnYb0Bi2IikyQVQ0bhcyfgGWf+OlFSTvPbPv9fOet7v2D20eHbTecRlOBWTlb1fhS+SevSfcFmUqDBXruvH1Nz5av1JdSYaNrVlye8KZYNgF+sqhUqMRLAnuntDGpao8NgW89EPp2ZZ/QwNzYz1tbIfjd9pDPc6jmc2z8IwlbfMLzw2kN4/M5Kdt5AObZdi624nZQ/4PVMBD9xe2eognvlayPZopQNzK9IOjXhV8K8T4UA8nVSoPy4gEcvUZWaS8XxpvdvXlzfWqRygTl0fwoziu12ryx9tdnBvV/BfiNIOpHD7R/1lFeZvM680zaTglWRGCv+fZU1iqdzyFb0eAKRKj6Fqo3+4HFN5U2Emja9XLE4elk5ppLX6AmJ9JjL6ICJlVQyo91A3NFbgNltif3JB2lDd4BD04OKXkWV7XW3ujuSndgkhgR+Pyic3H4w5S7Vnk1+Hmu6AlSnWMwbXQBc8XYfrYhP5RYE7w4MryBgChwlRAzCpaE2BBPQVr9qh1o30Wkg39hujBwPPFwrjErqnxTpBRy7zJ1rk9JZqeDZ7FYOMDdgEah8w/wX4yumFRhKQbAZOG/BfARnza02w1+wqDZok9r5uzssJ473hrbW2k+r0VtRmJrKxtO5x7w2v0Fowlsk OuEx+vdX Edm5qG6THDklBgUKBO6BoF6HxoHBuTT7B0a/8ewuNAS9RXigrPs5hqvKBOBJpYpYGym3JO6U0Mt7q3cnHMDkP0YdQ4g5E5jQTe5b4kT3fxrCp93J973aKYguE24ahyz8GJbW8khL6+wRlg7/cLltxY+2BCraS/G4ySDKwugDv5IdriGYT2d1QO6J4QnsRneGqpHyh6/TuoBjUxp3GPKQ6cku+jpkmOMcJ6NHnCqZns/JhV2LaVuzIPIqYajXXVv0nXr/tblN68KDEuHkLr+dwCHmunshutbeik4owndxtOVxuvHwKnxc7XuLe0M/CPQlOsxqIO3agMPHa7L8vyhZ17/LaEw2fjgkj0+O5mOCI0YRBGMo206pPUiuCRszzfAvFi4erVT5/zkcg4m4DGHfAruE7ogSMnIrIlK+2ROgMNwYKkdmeho7ciYfRGNHQgn92Dhk1EGatKcNUAqKgp5x6Q9/Le+t+TwJARbpo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Soft Reminder. On 8/21/23 7:15 PM, Muhammad Usama Anjum wrote: > *Changes in v33*: > - Add PAGE_IS_FILE support for THPs > > *Changes in v31 and v32*: > - Minor updates > > *Changes in v30*: > - Rebase on top of next-20230815 > - Minor nitpicks > > *Changes in v29:* > - Polish IOCTL and improve documentation > > *Changes in v28:* > - Fix walk_end and add 17 test cases in selftests patch > > *Changes in v27:* > - Handle review comments and minor improvements > - Add performance improvement patch on top with test for easy review > > *Changes in v26:* > - Code re-structurring and API changes in PAGEMAP_IOCTL > > *Changes in v25*: > - Do proper filtering on hole as well (hole got missed earlier) > > *Changes in v24*: > - Rebase on top of next-20230710 > - Place WP markers in case of hole as well > > *Changes in v23*: > - Set vec_buf_index in loop only when vec_buf_index is set > - Return -EFAULT instead of -EINVAL if vec is NULL > - Correctly return the walk ending address to the page granularity > > *Changes in v22*: > - Interface change: > - Replace [start start + len) with [start, end) > - Return the ending address of the address walk in start > > *Changes in v21*: > - Abort walk instead of returning error if WP is to be performed on > partial hugetlb > > *Changes in v20* > - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO > > *Changes in v19* > - Minor changes and interface updates > > *Changes in v18* > - Rebase on top of next-20230613 > - Minor updates > > *Changes in v17* > - Rebase on top of next-20230606 > - Minor improvements in PAGEMAP_SCAN IOCTL patch > > *Changes in v16* > - Fix a corner case > - Add exclusive PM_SCAN_OP_WP back > > *Changes in v15* > - Build fix (Add missed build fix in RESEND) > > *Changes in v14* > - Fix build error caused by #ifdef added at last minute in some configs > > *Changes in v13* > - Rebase on top of next-20230414 > - Give-up on using uffd_wp_range() and write new helpers, flush tlb only > once > > *Changes in v12* > - Update and other memory types to UFFD_FEATURE_WP_ASYNC > - Rebaase on top of next-20230406 > - Review updates > > *Changes in v11* > - Rebase on top of next-20230307 > - Base patches on UFFD_FEATURE_WP_UNPOPULATED > - Do a lot of cosmetic changes and review updates > - Remove ENGAGE_WP + !GET operation as it can be performed with > UFFDIO_WRITEPROTECT > > *Changes in v10* > - Add specific condition to return error if hugetlb is used with wp > async > - Move changes in tools/include/uapi/linux/fs.h to separate patch > - Add documentation > > *Changes in v9:* > - Correct fault resolution for userfaultfd wp async > - Fix build warnings and errors which were happening on some configs > - Simplify pagemap ioctl's code > > *Changes in v8:* > - Update uffd async wp implementation > - Improve PAGEMAP_IOCTL implementation > > *Changes in v7:* > - Add uffd wp async > - Update the IOCTL to use uffd under the hood instead of soft-dirty > flags > > *Motivation* > The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows > GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch() > retrieves the addresses of the pages that are written to in a region of > virtual memory. > > This syscall is used in Windows applications and games etc. This syscall is > being emulated in pretty slow manner in userspace. Our purpose is to > enhance the kernel such that we translate it efficiently in a better way. > Currently some out of tree hack patches are being used to efficiently > emulate it in some kernels. We intend to replace those with these patches. > So the whole gaming on Linux can effectively get benefit from this. It > means there would be tons of users of this code. > > CRIU use case [2] was mentioned by Andrei and Danylo: >> Use cases for migrating sparse VMAs are binaries sanitized with ASAN, >> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of >> shadow memory [4]. Being able to migrate such binaries allows to highly >> reduce the amount of work needed to identify and fix post-migration >> crashes, which happen constantly. > > Andrei's defines the following uses of this code: > * it is more granular and allows us to track changed pages more > effectively. The current interface can clear dirty bits for the entire > process only. In addition, reading info about pages is a separate > operation. It means we must freeze the process to read information > about all its pages, reset dirty bits, only then we can start dumping > pages. The information about pages becomes more and more outdated, > while we are processing pages. The new interface solves both these > downsides. First, it allows us to read pte bits and clear the > soft-dirty bit atomically. It means that CRIU will not need to freeze > processes to pre-dump their memory. Second, it clears soft-dirty bits > for a specified region of memory. It means CRIU will have actual info > about pages to the moment of dumping them. > * The new interface has to be much faster because basic page filtering > is happening in the kernel. With the old interface, we have to read > pagemap for each page. > > *Implementation Evolution (Short Summary)* > From the definition of GetWriteWatch(), we feel like kernel's soft-dirty > feature can be used under the hood with some additions like: > * reset soft-dirty flag for only a specific region of memory instead of > clearing the flag for the entire process > * get and clear soft-dirty flag for a specific region atomically > > So we decided to use ioctl on pagemap file to read or/and reset soft-dirty > flag. But using soft-dirty flag, sometimes we get extra pages which weren't > even written. They had become soft-dirty because of VMA merging and > VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were > able to by-pass this short coming by ignoring VM_SOFTDIRTY until David > reported that mprotect etc messes up the soft-dirty flag while ignoring > VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We > discussed if we can revert these patches. But we could not reach to any > conclusion. So at this point, I made couple of tries to solve this whole > VM_SOFTDIRTY issue by correcting the soft-dirty implementation: > * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause > regression. We left it behind. > * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I > got the reply don't increase the size of the VMA by 8 bytes. > > At this point, we left soft-dirty considering it is too much delicate and > userfaultfd [9] seemed like the only way forward. From there onward, we > have been basing soft-dirty emulation on userfaultfd wp feature where > kernel resolves the faults itself when WP_ASYNC feature is used. It was > straight forward to add WP_ASYNC feature in userfautlfd. Now we get only > those pages dirty or written-to which are really written in reality. (PS > There is another WP_UNPOPULATED userfautfd feature is required which is > needed to avoid pre-faulting memory before write-protecting [9].) > > All the different masks were added on the request of CRIU devs to create > interface more generic and better. > > [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch > [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com > [3] https://github.com/google/sanitizers > [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit > [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com > [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ > [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com > [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com > [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com > [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com > > * Original Cover letter from v8* > Hello, > > Note: > Soft-dirty pages and pages which have been written-to are synonyms. As > kernel already has soft-dirty feature inside which we have given up to > use, we are using written-to terminology while using UFFD async WP under > the hood. > > It is possible to find and clear soft-dirty pages entirely in userspace. > But it isn't efficient: > - The mprotect and SIGSEGV handler for bookkeeping > - The userfaultfd wp (synchronous) with the handler for bookkeeping > > Some benchmarks can be seen here[1]. This series adds features that weren't > present earlier: > - There is no atomic get soft-dirty/Written-to status and clear present in > the kernel. > - The pages which have been written-to can not be found in accurate way. > (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty > pages than there actually are.) > > Historically, soft-dirty PTE bit tracking has been used in the CRIU > project. The procfs interface is enough for finding the soft-dirty bit > status and clearing the soft-dirty bit of all the pages of a process. > We have the use case where we need to track the soft-dirty PTE bit for > only specific pages on-demand. We need this tracking and clear mechanism > of a region of memory while the process is running to emulate the > getWriteWatch() syscall of Windows. > > *(Moved to using UFFD instead of soft-dirty feature to find pages which > have been written-to from v7 patch series)*: > Stop using the soft-dirty flags for finding which pages have been > written to. It is too delicate and wrong as it shows more soft-dirty > pages than the actual soft-dirty pages. There is no interest in > correcting it [2][3] as this is how the feature was written years ago. > It shouldn't be updated to changed behaviour. Peter Xu has suggested > using the async version of the UFFD WP [4] as it is based inherently > on the PTEs. > > So in this patch series, I've added a new mode to the UFFD which is > asynchronous version of the write protect. When this variant of the > UFFD WP is used, the page faults are resolved automatically by the > kernel. The pages which have been written-to can be found by reading > pagemap file (!PM_UFFD_WP). This feature can be used successfully to > find which pages have been written to from the time the pages were > write protected. This works just like the soft-dirty flag without > showing any extra pages which aren't soft-dirty in reality. > > The information related to pages if the page is file mapped, present and > swapped is required for the CRIU project [5][6]. The addition of the > required mask, any mask, excluded mask and return masks are also required > for the CRIU project [5]. > > The IOCTL returns the addresses of the pages which match the specific > masks. The page addresses are returned in struct page_region in a compact > form. The max_pages is needed to support a use case where user only wants > to get a specific number of pages. So there is no need to find all the > pages of interest in the range when max_pages is specified. The IOCTL > returns when the maximum number of the pages are found. The max_pages is > optional. If max_pages is specified, it must be equal or greater than the > vec_size. This restriction is needed to handle worse case when one > page_region only contains info of one page and it cannot be compacted. > This is needed to emulate the Windows getWriteWatch() syscall. > > The patch series include the detailed selftest which can be used as an > example for the uffd async wp test and PAGEMAP_IOCTL. It shows the > interface usages as well. > > [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/ > [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com > [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com > [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n > [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ > [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/ > > Regards, > Muhammad Usama Anjum > > Muhammad Usama Anjum (5): > fs/proc/task_mmu: Implement IOCTL to get and optionally clear info > about PTEs > fs/proc/task_mmu: Add fast paths to get/clear PAGE_IS_WRITTEN flag > tools headers UAPI: Update linux/fs.h with the kernel sources > mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL > selftests: mm: add pagemap ioctl tests > > Peter Xu (1): > userfaultfd: UFFD_FEATURE_WP_ASYNC > > Documentation/admin-guide/mm/pagemap.rst | 89 + > Documentation/admin-guide/mm/userfaultfd.rst | 35 + > fs/proc/task_mmu.c | 722 ++++++++ > fs/userfaultfd.c | 26 +- > include/linux/hugetlb.h | 1 + > include/linux/userfaultfd_k.h | 28 +- > include/uapi/linux/fs.h | 59 + > include/uapi/linux/userfaultfd.h | 9 +- > mm/hugetlb.c | 34 +- > mm/memory.c | 28 +- > tools/include/uapi/linux/fs.h | 59 + > tools/testing/selftests/mm/.gitignore | 2 + > tools/testing/selftests/mm/Makefile | 3 +- > tools/testing/selftests/mm/config | 1 + > tools/testing/selftests/mm/pagemap_ioctl.c | 1660 ++++++++++++++++++ > tools/testing/selftests/mm/run_vmtests.sh | 4 + > 16 files changed, 2736 insertions(+), 24 deletions(-) > create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c > -- BR, Muhammad Usama Anjum