From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24D1EC7EE23 for ; Wed, 7 Jun 2023 06:02:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85A2A8E0002; Wed, 7 Jun 2023 02:02:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 809228E0001; Wed, 7 Jun 2023 02:02:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D0AA8E0002; Wed, 7 Jun 2023 02:02:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5D5778E0001 for ; Wed, 7 Jun 2023 02:02:52 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 22B301C6F4E for ; Wed, 7 Jun 2023 06:02:52 +0000 (UTC) X-FDA: 80874908184.13.7D45B96 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 30B31180003 for ; Wed, 7 Jun 2023 06:02:49 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=gqNFZBaR; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf24.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686117770; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Pl+OW/GkxIOfrcqkgJV4Ezi7Ve3f84KmylqOrJHQ9Fs=; b=fOrgGabeGMUpS7aHUiOKHcVRKhFrIgnz9gqhE/TlHpk1OAUoK75wL/0aJZd2jX8TiW4paD 2IpcNrAl8+O3w7ihibbaWmJUOvREBNbZiZcmORI5I7kSg+pULMBW5woST41IqYkqOdh7Cd MBJCquKOeF9eUzr6JBnQVN51/N1J2sE= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=gqNFZBaR; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf24.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686117770; a=rsa-sha256; cv=none; b=cUbGFkoYD5RuCy0eVsK7sj+TOCrX5ckfbk5DhdZTRz2EVhyGCcPxwmy71LIvE9HCmjthE2 os4DDCgVQiq6HOz7dJo4IrFV4o3RqHBeroobQ1t+Cl/PjE971RUmNBE1HwxSR2sVOnU664 zYQfFNH7kuijzuW1EUtdueLXJnnrNN8= Received: from [192.168.10.48] (unknown [119.152.150.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 5ABC76606EEC; Wed, 7 Jun 2023 07:02:46 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1686117768; bh=TOqvNPV5bkzPBh8Lzm0XD7+VQlpSX0YERMYtUJ/FD6s=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=gqNFZBaRq71o/hVxp4cUvtrQLvWe5q998aZBWsNoJk5TsK17tI40t8xe8nbl9+Svo whnZIQ49HNmcgM+qYiioZAFznweHLAw8PZDIDdf60Z8muFhRCenzmrLm2YUXQztVKJ LgtQe/++rztTMlhSaPheKHc9IIJ8x5wvHLxh2ClX60BuZf49kBDV0up4cfctq0epwX a7smmJbfS3UyDM18JE1C2de4VAQ1NGAftbtTa7CFyf/ArQj8swdZkUnBpcGZO44qEZ Hq2CKFR38onifRs66L6K/9G+s4DJ3M2ky+J8ABWY/juSvkF1KdklQwNMJuWjMohSGZ J/a0jvcqO66Gw== Message-ID: <44d9cbca-333a-379b-356e-b6ec8b422075@collabora.com> Date: Wed, 7 Jun 2023 11:02:42 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cc: "kernel@collabora.com" , Andrew Morton , "open list : MEMORY MANAGEMENT" , open list Subject: Re: [PATCH v16 0/5] Implement IOCTL to get and optionally clear info about PTEs Content-Language: en-US To: David Hildenbrand , =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Danylo Mocherniuk , Mike Rapoport , Andrei Vagin References: <20230525085517.281529-1-usama.anjum@collabora.com> <598965cb-85d3-5b33-a1d4-2f49e94ee8ea@collabora.com> From: Muhammad Usama Anjum In-Reply-To: <598965cb-85d3-5b33-a1d4-2f49e94ee8ea@collabora.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 30B31180003 X-Stat-Signature: uhnddsgz1tt9qpxs8sa5seu4odf6ojsp X-Rspam-User: X-HE-Tag: 1686117769-132862 X-HE-Meta: U2FsdGVkX1+XCO3LRDQce6IXc3vf0QN+8N+SBz967iHlBezx1fKrSVakPTTLWtgLAhA71Cshb9Hf1akFNxKOkZtCeaMLjFcmGED9JhCsr4DxToA41ig3P95avSoVgPAc0khsKoYroSaQWEM3dFKxw2zHNeIgvedaJsTM3jiU6z1BBvN1oRQ9elDVy1P9FvC9dWzsrxALRRll3yJExdaNVZS8WrT//qdbXk6ZSc/mYEHp1CSbYa9QY3a6tUeeMzc80EvEYkqXprqkIHiRA28f6OgBNLd2kKZRuehPwERxklNpRkqtBvabCP9/5RQgpyFMxz97jlkBthQomCXyRmtdgAWVuOMJyHLpCDfpKfd278g20eN+r6CJt62PEPas2ur2FpGEymL5COOwDTh5SDhrlxHcrep9WINEuToHs0GbZKaxFlV8rm+fQ0VNjFlTFC9mhWv0u4L+mxcPAtdmrFNesDtpFk7cjh1apTba/iF9T/FiT6lvgu7/sV7TKf1N8/ZWfDsO0tT4oOrWtdfui75kdAc5gvzz/1btomlpQVMibqyTEayvqozdwQcHFr+4ESnILTfsAmJJj04hjqwUMrPwhNMkFS488TRzRpYJsJEPRammBFs83se5gDOVVD4XcEmuQ35XmBlv/sLhKdCRjJhGzLpRwFl7J/xAkpqKhPvLshkpW8cXbaRGIzhdcw6EwEboCFamu63z5h1wVsX/Hjmjgqn4tEnDP6nIc3JipeEbdjyQJYI+8hBM0TOLby9KDtUxZLWWr+cKUCCWY2P4bfyUt8WqGP1plwmuSZO2ntdn11TjLVOMhhNSfjai2LGuovjrXDnQ2cmpkfVLmOzWW7mlU7KnqSovebjq9EmaMQdqIvO4pKlymRrrd9d5RBR3JAODrq+pMt3Bncc7StZfF+Cos3Sl9feXwDFkyOap49UdvBCm4RaF4HwntJEQzlANnk4p+B+JIaYm9mtFr0wpoS4 Ky8rlJ9k qG6UhKXAR2uq0y0CGITjKai465krLG8goF16hlm7TkGZHx0SAsMoCOPXVChRKPOwy60dhS2IjLx+tS2OEMQ047+DCyskVfVtruUakuhymQgX43cxMNRcgbrMoDmxiv5rZg9W2fqpFEnCXvs0+ceg/30eBF8Q5B62mbSF0NF/sAe+0yQoLiHqVeip9bjMxnD9ex9wA80wUr/UBBHt33p/4HxYpzYJZ5MGHSv5yign6mBz3/CinUu0TxtbUUNpv24quI5rgZaSnWg+yimD2EnVybFdythbOn2BdWKYCYzIix2ztBwZ6J6sZB1iKWJS5qONfRoL/hihFw0le3iMQqM9TTLrsPUdh0XPleQxl9mvXKGg85tHuPivjdiQCp9fYwBbmnRywZKbRyr/sb3DIJAHu41kIxC3xLausAO0pu6H03MaCvAp6InwUe3F5AP9wIGBEnRHfFrCCLLIk6ZYjZWcOfouMSUsBBQBVUJIM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/30/23 7:07 PM, Muhammad Usama Anjum wrote: > Hello Peter, David, Michal, Mike, Danylo and Andrei, > > I hope you are well. You (and some other) guys have been helping and > reviewing up to this point. Thank you so much! Please review them again and > mention if anything comes to your mind. Please send review tags or tested > by tags in the hope that we merge these soon for next release. Current > patches fulfill all of our requirement regarding ABI and performance. I > guess, we should merge now (if somethings comes up, we can always fix them > along the way). Your ack/review/tested by would be much appreciated. You guys have contributed while development. I believe we are very close. I really want your comments to get this merged soon. Latest revision (v17) for review: https://lore.kernel.org/all/20230606060822.1065182-1-usama.anjum@collabora.com > > Any thoughts/comments are welcome. I'm sending this email as we have not > had much reviews in the past 3-4 revisions. > > Thanks, > Usama > > On 5/25/23 1:55 PM, Muhammad Usama Anjum wrote: >> *Changes in v16* >> - Fix a corner case >> - Add exclusive PM_SCAN_OP_WP back >> >> *Changes in v15* >> - Build fix (Add missed build fix in RESEND) >> >> *Changes in v14* >> - Fix build error caused by #ifdef added at last minute in some configs >> >> *Changes in v13* >> - Rebase on top of next-20230414 >> - Give-up on using uffd_wp_range() and write new helpers, flush tlb only >> once >> >> *Changes in v12* >> - Update and other memory types to UFFD_FEATURE_WP_ASYNC >> - Rebaase on top of next-20230406 >> - Review updates >> >> *Changes in v11* >> - Rebase on top of next-20230307 >> - Base patches on UFFD_FEATURE_WP_UNPOPULATED >> - Do a lot of cosmetic changes and review updates >> - Remove ENGAGE_WP + !GET operation as it can be performed with >> UFFDIO_WRITEPROTECT >> >> *Changes in v10* >> - Add specific condition to return error if hugetlb is used with wp >> async >> - Move changes in tools/include/uapi/linux/fs.h to separate patch >> - Add documentation >> >> *Changes in v9:* >> - Correct fault resolution for userfaultfd wp async >> - Fix build warnings and errors which were happening on some configs >> - Simplify pagemap ioctl's code >> >> *Changes in v8:* >> - Update uffd async wp implementation >> - Improve PAGEMAP_IOCTL implementation >> >> *Changes in v7:* >> - Add uffd wp async >> - Update the IOCTL to use uffd under the hood instead of soft-dirty >> flags >> >> *Motivation* >> The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows >> GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of >> the pages that are written to in a region of virtual memory. >> >> This syscall is used in Windows applications and games etc. This syscall is >> being emulated in pretty slow manner in userspace. Our purpose is to >> enhance the kernel such that we translate it efficiently in a better way. >> Currently some out of tree hack patches are being used to efficiently >> emulate it in some kernels. We intend to replace those with these patches. >> So the whole gaming on Linux can effectively get benefit from this. It >> means there would be tons of users of this code. >> >> CRIU use case [2] was mentioned by Andrei and Danylo: >>> Use cases for migrating sparse VMAs are binaries sanitized with ASAN, >>> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of >>> shadow memory [4]. Being able to migrate such binaries allows to highly >>> reduce the amount of work needed to identify and fix post-migration >>> crashes, which happen constantly. >> >> Andrei's defines the following uses of this code: >> * it is more granular and allows us to track changed pages more >> effectively. The current interface can clear dirty bits for the entire >> process only. In addition, reading info about pages is a separate >> operation. It means we must freeze the process to read information >> about all its pages, reset dirty bits, only then we can start dumping >> pages. The information about pages becomes more and more outdated, >> while we are processing pages. The new interface solves both these >> downsides. First, it allows us to read pte bits and clear the >> soft-dirty bit atomically. It means that CRIU will not need to freeze >> processes to pre-dump their memory. Second, it clears soft-dirty bits >> for a specified region of memory. It means CRIU will have actual info >> about pages to the moment of dumping them. >> * The new interface has to be much faster because basic page filtering >> is happening in the kernel. With the old interface, we have to read >> pagemap for each page. >> >> *Implementation Evolution (Short Summary)* >> From the definition of GetWriteWatch(), we feel like kernel's soft-dirty >> feature can be used under the hood with some additions like: >> * reset soft-dirty flag for only a specific region of memory instead of >> clearing the flag for the entire process >> * get and clear soft-dirty flag for a specific region atomically >> >> So we decided to use ioctl on pagemap file to read or/and reset soft-dirty >> flag. But using soft-dirty flag, sometimes we get extra pages which weren't >> even written. They had become soft-dirty because of VMA merging and >> VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were >> able to by-pass this short coming by ignoring VM_SOFTDIRTY until David >> reported that mprotect etc messes up the soft-dirty flag while ignoring >> VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We >> discussed if we can revert these patches. But we could not reach to any >> conclusion. So at this point, I made couple of tries to solve this whole >> VM_SOFTDIRTY issue by correcting the soft-dirty implementation: >> * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause >> regression. We left it behind. >> * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I >> got the reply don't increase the size of the VMA by 8 bytes. >> >> At this point, we left soft-dirty considering it is too much delicate and >> userfaultfd [9] seemed like the only way forward. From there onward, we >> have been basing soft-dirty emulation on userfaultfd wp feature where >> kernel resolves the faults itself when WP_ASYNC feature is used. It was >> straight forward to add WP_ASYNC feature in userfautlfd. Now we get only >> those pages dirty or written-to which are really written in reality. (PS >> There is another WP_UNPOPULATED userfautfd feature is required which is >> needed to avoid pre-faulting memory before write-protecting [9].) >> >> All the different masks were added on the request of CRIU devs to create >> interface more generic and better. >> >> [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch >> [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com >> [3] https://github.com/google/sanitizers >> [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit >> [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com >> [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ >> [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com >> [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com >> [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com >> [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com >> >> * Original Cover letter from v8* >> Hello, >> >> Note: >> Soft-dirty pages and pages which have been written-to are synonyms. As >> kernel already has soft-dirty feature inside which we have given up to >> use, we are using written-to terminology while using UFFD async WP under >> the hood. >> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are >> supported in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >> pages have been written-to. >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >> >> It is possible to find and clear soft-dirty pages entirely in userspace. >> But it isn't efficient: >> - The mprotect and SIGSEGV handler for bookkeeping >> - The userfaultfd wp (synchronous) with the handler for bookkeeping >> >> Some benchmarks can be seen here[1]. This series adds features that weren't >> present earlier: >> - There is no atomic get soft-dirty/Written-to status and clear present in >> the kernel. >> - The pages which have been written-to can not be found in accurate way. >> (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty >> pages than there actually are.) >> >> Historically, soft-dirty PTE bit tracking has been used in the CRIU >> project. The procfs interface is enough for finding the soft-dirty bit >> status and clearing the soft-dirty bit of all the pages of a process. >> We have the use case where we need to track the soft-dirty PTE bit for >> only specific pages on-demand. We need this tracking and clear mechanism >> of a region of memory while the process is running to emulate the >> getWriteWatch() syscall of Windows. >> >> *(Moved to using UFFD instead of soft-dirtyi feature to find pages which >> have been written-to from v7 patch series)*: >> Stop using the soft-dirty flags for finding which pages have been >> written to. It is too delicate and wrong as it shows more soft-dirty >> pages than the actual soft-dirty pages. There is no interest in >> correcting it [2][3] as this is how the feature was written years ago. >> It shouldn't be updated to changed behaviour. Peter Xu has suggested >> using the async version of the UFFD WP [4] as it is based inherently >> on the PTEs. >> >> So in this patch series, I've added a new mode to the UFFD which is >> asynchronous version of the write protect. When this variant of the >> UFFD WP is used, the page faults are resolved automatically by the >> kernel. The pages which have been written-to can be found by reading >> pagemap file (!PM_UFFD_WP). This feature can be used successfully to >> find which pages have been written to from the time the pages were >> write protected. This works just like the soft-dirty flag without >> showing any extra pages which aren't soft-dirty in reality. >> >> The information related to pages if the page is file mapped, present and >> swapped is required for the CRIU project [5][6]. The addition of the >> required mask, any mask, excluded mask and return masks are also required >> for the CRIU project [5]. >> >> The IOCTL returns the addresses of the pages which match the specific >> masks. The page addresses are returned in struct page_region in a compact >> form. The max_pages is needed to support a use case where user only wants >> to get a specific number of pages. So there is no need to find all the >> pages of interest in the range when max_pages is specified. The IOCTL >> returns when the maximum number of the pages are found. The max_pages is >> optional. If max_pages is specified, it must be equal or greater than the >> vec_size. This restriction is needed to handle worse case when one >> page_region only contains info of one page and it cannot be compacted. >> This is needed to emulate the Windows getWriteWatch() syscall. >> >> The patch series include the detailed selftest which can be used as an >> example for the uffd async wp test and PAGEMAP_IOCTL. It shows the >> interface usages as well. >> >> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/ >> [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com >> [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com >> [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n >> [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ >> [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/ >> >> Regards, >> Muhammad Usama Anjum >> >> Muhammad Usama Anjum (4): >> fs/proc/task_mmu: Implement IOCTL to get and optionally clear info >> about PTEs >> tools headers UAPI: Update linux/fs.h with the kernel sources >> mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL >> selftests: mm: add pagemap ioctl tests >> >> Peter Xu (1): >> userfaultfd: UFFD_FEATURE_WP_ASYNC >> >> Documentation/admin-guide/mm/pagemap.rst | 58 + >> Documentation/admin-guide/mm/userfaultfd.rst | 35 + >> fs/proc/task_mmu.c | 503 ++++++ >> fs/userfaultfd.c | 26 +- >> include/linux/userfaultfd_k.h | 21 +- >> include/uapi/linux/fs.h | 53 + >> include/uapi/linux/userfaultfd.h | 9 +- >> mm/hugetlb.c | 32 +- >> mm/memory.c | 27 +- >> tools/include/uapi/linux/fs.h | 53 + >> tools/testing/selftests/mm/.gitignore | 1 + >> tools/testing/selftests/mm/Makefile | 3 +- >> tools/testing/selftests/mm/config | 1 + >> tools/testing/selftests/mm/pagemap_ioctl.c | 1459 ++++++++++++++++++ >> tools/testing/selftests/mm/run_vmtests.sh | 4 + >> 15 files changed, 2262 insertions(+), 23 deletions(-) >> create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c >> mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh >> > -- BR, Muhammad Usama Anjum