From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: "Cyrill Gorcunov" <gorcunov@gmail.com>,
"Peter Xu" <peterx@redhat.com>,
"David Hildenbrand" <david@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Michał Mirosław" <emmir@google.com>,
"Andrei Vagin" <avagin@gmail.com>,
"Danylo Mocherniuk" <mdanylo@google.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Shuah Khan <shuah@kernel.org>,
Christian Brauner <brauner@kernel.org>,
Yang Shi <shy828301@gmail.com>, Vlastimil Babka <vbabka@suse.cz>,
"Liam R . Howlett" <Liam.Howlett@Oracle.com>,
Yun Zhou <yun.zhou@windriver.com>,
Suren Baghdasaryan <surenb@google.com>,
Alex Sierra <alex.sierra@amd.com>,
Matthew Wilcox <willy@infradead.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Mike Rapoport <rppt@kernel.org>, Nadav Amit <namit@vmware.com>,
Axel Rasmussen <axelrasmussen@google.com>,
"Gustavo A . R . Silva" <gustavoars@kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Greg KH <gregkh@linuxfoundation.org>,
kernel@collabora.com, Paul Gofman <pgofman@codeweavers.com>
Subject: Re: [PATCH v7 0/4] Implement IOCTL to get and/or the clear info about PTEs
Date: Wed, 18 Jan 2023 11:55:00 +0500 [thread overview]
Message-ID: <9bc72983-91dc-74b5-54dd-cf419d6deab4@collabora.com> (raw)
In-Reply-To: <20230109064519.3555250-1-usama.anjum@collabora.com>
On 1/9/23 11:45 AM, Muhammad Usama Anjum wrote:
> *Changes in v7:*
> - Add uffd wp async
> - Update the IOCTL to use uffd under the hood instead of soft-dirty
> flags
>
> Stop using the soft-dirty flags for finding which pages have been
> written to. It is too delicate and wrong as it shows more soft-dirty
> pages than the actual soft-dirty pages. There is no interest in
> correcting it [A][B] as this is how the feature was written years ago.
> It shouldn't be updated to changed behaviour. Peter Xu has suggested
> using the async version of the UFFD WP [C] as it is based inherently
> on the PTEs.
>
> So in this patch series, I've added a new mode to the UFFD which is
> asynchronous version of the write protect. When this variant of the
> UFFD WP is used, the page faults are resolved automatically by the
> kernel. The pages which have been written-to can be found by reading
> pagemap file (!PM_UFFD_WP). This feature can be used successfully to
> find which pages have been written to from the time the pages were
> write protected. This works just like the soft-dirty flag without
> showing any extra pages which aren't soft-dirty in reality.
Any thoughts on this version are highly welcome. Please review.
>
> [A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
> [B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
> [C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
>
> *Changes in v6:*
> - Updated the interface and made cosmetic changes
>
> *Cover Letter in v5:*
> Hello,
>
> This patch series implements IOCTL on the pagemap procfs file to get the
> information about the page table entries (PTEs). The following operations
> are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
> or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages atomically.
>
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen here[1]. This series adds features that weren't
> present earlier:
> - There is no atomic get soft-dirty PTE bit status and clear operation
> possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
>
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The procfs interface is enough for finding the soft-dirty bit
> status and clearing the soft-dirty bit of all the pages of a process.
> We have the use case where we need to track the soft-dirty PTE bit for
> only specific pages on demand. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages to process only the dirty pages.
>
> The information related to pages if the page is file mapped, present and
> swapped is required for the CRIU project[2][3]. The addition of the
> required mask, any mask, excluded mask and return masks are also required
> for the CRIU project[2].
>
> The IOCTL returns the addresses of the pages which match the specific masks.
> The page addresses are returned in struct page_region in a compact form.
> The max_pages is needed to support a use case where user only wants to get
> a specific number of pages. So there is no need to find all the pages of
> interest in the range when max_pages is specified. The IOCTL returns when
> the maximum number of the pages are found. The max_pages is optional. If
> max_pages is specified, it must be equal or greater than the vec_size.
> This restriction is needed to handle worse case when one page_region only
> contains info of one page and it cannot be compacted. This is needed to
> emulate the Windows getWriteWatch() syscall.
>
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity (such as VMA merging as soft-dirty bit difference isn't
> considered while deciding to merge VMAs). The dirty bit of the pages is
> stored in the VMA flags and in the per page flags. If any of these two bits
> are set, the page is considered to be soft dirty. Suppose you have cleared
> the soft dirty bit of half of VMA which will be done by splitting the VMA
> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
> again. This splitting/merging costs performance. The application receives
> a lot of pages which aren't dirty in reality but marked as dirty.
> Performance is lost again here. Also sometimes user doesn't want the newly
> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
> solves both the problems. It is used to not depend on the soft dirty flag
> in the VMA flags. So VMA splitting and merging doesn't happen. It only
> depends on the soft dirty bit of the individual pages. Thus by using this
> flag, there may be a scenerio such that the new memory regions which are
> just created, doesn't look dirty when seen with the IOCTL, but look dirty
> when seen from procfs. This seems okay as the user of this flag know the
> implication of using it.
>
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> [2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
> [3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
>
> Regards,
> Muhammad Usama Anjum
>
> Muhammad Usama Anjum (4):
> userfaultfd: Add UFFD WP Async support
> userfaultfd: split mwriteprotect_range()
> fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
> PTEs
> selftests: vm: add pagemap ioctl tests
>
> fs/proc/task_mmu.c | 300 +++++++
> fs/userfaultfd.c | 161 ++--
> include/linux/userfaultfd_k.h | 10 +
> include/uapi/linux/fs.h | 50 ++
> include/uapi/linux/userfaultfd.h | 6 +
> mm/userfaultfd.c | 40 +-
> tools/include/uapi/linux/fs.h | 50 ++
> tools/testing/selftests/vm/.gitignore | 1 +
> tools/testing/selftests/vm/Makefile | 5 +-
> tools/testing/selftests/vm/pagemap_ioctl.c | 884 +++++++++++++++++++++
> 10 files changed, 1424 insertions(+), 83 deletions(-)
> create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
>
--
BR,
Muhammad Usama Anjum
next prev parent reply other threads:[~2023-01-18 6:55 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-09 6:45 Muhammad Usama Anjum
2023-01-09 6:45 ` [PATCH v7 1/4] userfaultfd: Add UFFD WP Async support Muhammad Usama Anjum
2023-01-18 16:54 ` Peter Xu
2023-01-19 15:09 ` Muhammad Usama Anjum
2023-01-19 16:35 ` Peter Xu
2023-01-20 14:53 ` Peter Xu
2023-01-23 10:11 ` Muhammad Usama Anjum
2023-01-24 17:26 ` Peter Xu
2023-01-25 12:18 ` Muhammad Usama Anjum
2023-01-09 6:45 ` [PATCH v7 2/4] userfaultfd: split mwriteprotect_range() Muhammad Usama Anjum
2023-01-09 6:45 ` [PATCH v7 3/4] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2023-01-18 22:28 ` Peter Xu
2023-01-23 12:18 ` Muhammad Usama Anjum
2023-01-24 17:30 ` Peter Xu
2023-01-26 14:32 ` Muhammad Usama Anjum
2023-01-09 6:45 ` [PATCH v7 4/4] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
2023-01-18 6:55 ` Muhammad Usama Anjum [this message]
2023-01-18 22:12 ` [PATCH v7 0/4] Implement IOCTL to get and/or the clear info about PTEs Peter Xu
2023-01-23 13:15 ` Muhammad Usama Anjum
2023-01-24 19:49 ` Peter Xu
2023-01-25 14:45 ` Danylo Mocherniuk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9bc72983-91dc-74b5-54dd-cf419d6deab4@collabora.com \
--to=usama.anjum@collabora.com \
--cc=Liam.Howlett@Oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex.sierra@amd.com \
--cc=avagin@gmail.com \
--cc=axelrasmussen@google.com \
--cc=brauner@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=emmir@google.com \
--cc=gorcunov@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=gustavoars@kernel.org \
--cc=kernel@collabora.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mdanylo@google.com \
--cc=namit@vmware.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterx@redhat.com \
--cc=pgofman@codeweavers.com \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yun.zhou@windriver.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox