* [PATCH 3/5] mm: wire up process_memwatch syscall for x86
[not found] <20220726161854.276359-1-usama.anjum@collabora.com>
@ 2022-07-26 16:18 ` Muhammad Usama Anjum
2022-08-10 8:45 ` [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Muhammad Usama Anjum @ 2022-07-26 16:18 UTC (permalink / raw)
To: Jonathan Corbet, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, krisman
Cc: Muhammad Usama Anjum, kernel
Wire up syscall entry point for both i386 and x86_64 architectures.
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 3 ++-
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/sys_ni.c | 1 +
tools/include/uapi/asm-generic/unistd.h | 5 ++++-
tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 1 +
7 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..601d33909880 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 i386 process_memwatch sys_process_memwatch
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..3bddea588ce7 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common process_memwatch sys_process_memwatch
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..efa240510e4c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -939,7 +939,6 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
const int __user *nodes,
int __user *status,
int flags);
-
asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig,
siginfo_t __user *uinfo);
asmlinkage long sys_perf_event_open(
@@ -1056,6 +1055,8 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_process_memwatch(int pidfd, void __user *addr, int len,
+ unsigned int flags, loff_t __user *vec, int vec_len);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..805a8d5cf0c4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_process_memwatch 451
+__SC_COMP(__NR_process_memwatch, sys_process_memwatch, compat_sys_process_memwatch)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a492f159624f..74f31317481a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -298,6 +298,7 @@ COND_SYSCALL(set_mempolicy);
COND_SYSCALL(migrate_pages);
COND_SYSCALL(move_pages);
COND_SYSCALL(set_mempolicy_home_node);
+COND_SYSCALL(process_memwatch);
COND_SYSCALL(perf_event_open);
COND_SYSCALL(accept4);
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..805a8d5cf0c4 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_process_memwatch 451
+__SC_COMP(__NR_process_memwatch, sys_process_memwatch, compat_sys_process_memwatch)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
/*
* 32 bit systems traditionally used different
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..3bddea588ce7 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common process_memwatch sys_process_memwatch
#
# Due to a historical design error, certain syscalls are numbered differently
--
2.30.2
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/5] Add process_memwatch syscall
[not found] <20220726161854.276359-1-usama.anjum@collabora.com>
2022-07-26 16:18 ` [PATCH 3/5] mm: wire up process_memwatch syscall for x86 Muhammad Usama Anjum
@ 2022-08-10 8:45 ` Muhammad Usama Anjum
2022-08-10 9:03 ` David Hildenbrand
2022-08-10 9:22 ` Peter.Enderborg
3 siblings, 0 replies; 6+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-10 8:45 UTC (permalink / raw)
To: Jonathan Corbet, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, krisman
Cc: usama.anjum, kernel
On 7/26/22 9:18 PM, Muhammad Usama Anjum wrote:
> Hello,
>
> This patch series implements a new syscall, process_memwatch. Currently,
> only the support to watch soft-dirty PTE bit is added. This syscall is
> generic to watch the memory of the process. There is enough room to add
> more operations like this to watch memory in the future.
>
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the memory in a process can be
> cleared by writing to the clear_refs file. This series adds features that
> weren't possible through the Proc FS interface.
> - There is no atomic get soft-dirty PTE bit status and clear operation
> possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
>
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The Proc FS interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for running processes. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to keep
> track of dirty pages and keep processing only the dirty pages. This
> syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
>
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire processi) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
>
> long process_memwatch(int pidfd, unsigned long start, int len,
> unsigned int flags, void *vec, int vec_len);
Any thoughts?
>
> This syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this syscall:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
>
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec
> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and process_memwatch() syscall will
> abort after finding the 10 pages. This behaviour is needed to support
> Windows' getWriteWatch(). The behaviour like mincore() can be achieved by
> passing output buffer of 100 size. This interface can be used for any
> desired behaviour.
>
> Regards,
> Muhammad Usama Anjum
>
> Muhammad Usama Anjum (5):
> fs/proc/task_mmu: make functions global to be used in other files
> mm: Implement process_memwatch syscall
> mm: wire up process_memwatch syscall for x86
> selftests: vm: add process_memwatch syscall tests
> mm: add process_memwatch syscall documentation
>
> Documentation/admin-guide/mm/soft-dirty.rst | 48 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/proc/task_mmu.c | 84 +--
> include/linux/mm_inline.h | 99 +++
> include/linux/syscalls.h | 3 +-
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/memwatch.h | 12 +
> kernel/sys_ni.c | 1 +
> mm/Makefile | 2 +-
> mm/memwatch.c | 285 ++++++++
> tools/include/uapi/asm-generic/unistd.h | 5 +-
> .../arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> tools/testing/selftests/vm/.gitignore | 1 +
> tools/testing/selftests/vm/Makefile | 2 +
> tools/testing/selftests/vm/memwatch_test.c | 635 ++++++++++++++++++
> 16 files changed, 1098 insertions(+), 87 deletions(-)
> create mode 100644 include/uapi/linux/memwatch.h
> create mode 100644 mm/memwatch.c
> create mode 100644 tools/testing/selftests/vm/memwatch_test.c
>
--
Muhammad Usama Anjum
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/5] Add process_memwatch syscall
[not found] <20220726161854.276359-1-usama.anjum@collabora.com>
2022-07-26 16:18 ` [PATCH 3/5] mm: wire up process_memwatch syscall for x86 Muhammad Usama Anjum
2022-08-10 8:45 ` [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
@ 2022-08-10 9:03 ` David Hildenbrand
2022-08-10 17:05 ` Gabriel Krisman Bertazi
2022-08-10 9:22 ` Peter.Enderborg
3 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2022-08-10 9:03 UTC (permalink / raw)
To: Muhammad Usama Anjum, Jonathan Corbet, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, krisman
Cc: kernel
On 26.07.22 18:18, Muhammad Usama Anjum wrote:
> Hello,
Hi,
>
> This patch series implements a new syscall, process_memwatch. Currently,
> only the support to watch soft-dirty PTE bit is added. This syscall is
> generic to watch the memory of the process. There is enough room to add
> more operations like this to watch memory in the future.
>
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the memory in a process can be
> cleared by writing to the clear_refs file. This series adds features that
> weren't possible through the Proc FS interface.
> - There is no atomic get soft-dirty PTE bit status and clear operation
> possible.
Such an interface might be easy to add, no?
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
Same.
So I'm curious why we need a new syscall for that.
>
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The Proc FS interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for running processes. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to keep
> track of dirty pages and keep processing only the dirty pages. This
> syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
>
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire processi) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
You write "poor performance". Did you actually implement a prototype
using userfaultfd-wp? Can you share numbers for comparison?
Adding an new syscall just for handling a corner case feature
(soft-dirty, which we all love, of course) needs good justification.
>
> long process_memwatch(int pidfd, unsigned long start, int len,
> unsigned int flags, void *vec, int vec_len);
>
> This syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this syscall:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
Huh, why? VM_SOFTDIRTY is an internal implementation detail and should
remain such.
VM_SOFTDIRTY translates to "all pages in this VMA are soft-dirty".
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/5] Add process_memwatch syscall
[not found] <20220726161854.276359-1-usama.anjum@collabora.com>
` (2 preceding siblings ...)
2022-08-10 9:03 ` David Hildenbrand
@ 2022-08-10 9:22 ` Peter.Enderborg
2022-08-10 16:53 ` Gabriel Krisman Bertazi
3 siblings, 1 reply; 6+ messages in thread
From: Peter.Enderborg @ 2022-08-10 9:22 UTC (permalink / raw)
To: Muhammad Usama Anjum, Jonathan Corbet, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, krisman
Cc: kernel
On 7/26/22 18:18, Muhammad Usama Anjum wrote:
> Hello,
>
> This patch series implements a new syscall, process_memwatch. Currently,
> only the support to watch soft-dirty PTE bit is added. This syscall is
> generic to watch the memory of the process. There is enough room to add
> more operations like this to watch memory in the future.
>
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the memory in a process can be
> cleared by writing to the clear_refs file. This series adds features that
> weren't possible through the Proc FS interface.
> - There is no atomic get soft-dirty PTE bit status and clear operation
> possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
>
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The Proc FS interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for running processes. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to keep
> track of dirty pages and keep processing only the dirty pages. This
> syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
>
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire processi) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
>
> long process_memwatch(int pidfd, unsigned long start, int len,
> unsigned int flags, void *vec, int vec_len);
>
> This syscall can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this syscall:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
>
Why can it not be done as a IOCTL?
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec
> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and process_memwatch() syscall will
> abort after finding the 10 pages. This behaviour is needed to support
> Windows' getWriteWatch(). The behaviour like mincore() can be achieved by
> passing output buffer of 100 size. This interface can be used for any
> desired behaviour.
>
> Regards,
> Muhammad Usama Anjum
>
> Muhammad Usama Anjum (5):
> fs/proc/task_mmu: make functions global to be used in other files
> mm: Implement process_memwatch syscall
> mm: wire up process_memwatch syscall for x86
> selftests: vm: add process_memwatch syscall tests
> mm: add process_memwatch syscall documentation
>
> Documentation/admin-guide/mm/soft-dirty.rst | 48 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/proc/task_mmu.c | 84 +--
> include/linux/mm_inline.h | 99 +++
> include/linux/syscalls.h | 3 +-
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/memwatch.h | 12 +
> kernel/sys_ni.c | 1 +
> mm/Makefile | 2 +-
> mm/memwatch.c | 285 ++++++++
> tools/include/uapi/asm-generic/unistd.h | 5 +-
> .../arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> tools/testing/selftests/vm/.gitignore | 1 +
> tools/testing/selftests/vm/Makefile | 2 +
> tools/testing/selftests/vm/memwatch_test.c | 635 ++++++++++++++++++
> 16 files changed, 1098 insertions(+), 87 deletions(-)
> create mode 100644 include/uapi/linux/memwatch.h
> create mode 100644 mm/memwatch.c
> create mode 100644 tools/testing/selftests/vm/memwatch_test.c
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/5] Add process_memwatch syscall
2022-08-10 9:22 ` Peter.Enderborg
@ 2022-08-10 16:53 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-08-10 16:53 UTC (permalink / raw)
To: Peter.Enderborg
Cc: Muhammad Usama Anjum, Jonathan Corbet, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, kernel
"Peter.Enderborg@sony.com" <Peter.Enderborg@sony.com> writes:
>>
>> This syscall can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information. The following operations are
>> supported in this syscall:
>> - Get the pages that are soft-dirty.
>> - Clear the pages which are soft-dirty.
>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>> soft-dirty PTE bit
>>
Hi Peter,
(For context, I wrote a previous version of this patch and have been
working with Usama on the current patch).
> Why can it not be done as a IOCTL?
Considering an ioctl is basically a namespaced syscall with extra-steps,
surely we can do it :) There are a few reasons we haven't, though:
1) ioctl auditing/controling is much harder than syscall
2) There is a concern for performance, since this might be executed
frequently by windows applications running over wine. There is an extra
cost with unnecessary copy_[from/to]_user that we wanted to avoid, even
though we haven't measured.
3) I originally wrote this at the time process_memadvise was merged. I
felt it fits the same kind of interface exposed by
process_memadvise/process_mrelease, recently merged.
4) Not obvious whether the ioctl would be against pagemap/clear_refs.
Neither file name describes both input and output semantics.
Obviously, all of those reasons can be worked around, and we can turn
this into an ioctl.
Thanks,
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/5] Add process_memwatch syscall
2022-08-10 9:03 ` David Hildenbrand
@ 2022-08-10 17:05 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-08-10 17:05 UTC (permalink / raw)
To: David Hildenbrand
Cc: Muhammad Usama Anjum, Jonathan Corbet, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
open list, open list:PROC FILESYSTEM, open list:ABI/API,
open list:GENERIC INCLUDE/ASM HEADER FILES,
open list:MEMORY MANAGEMENT,
open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:KERNEL SELFTEST FRAMEWORK, kernel
David Hildenbrand <david@redhat.com> writes:
> On 26.07.22 18:18, Muhammad Usama Anjum wrote:
>> Hello,
>
> Hi,
>
>>
>> This patch series implements a new syscall, process_memwatch. Currently,
>> only the support to watch soft-dirty PTE bit is added. This syscall is
>> generic to watch the memory of the process. There is enough room to add
>> more operations like this to watch memory in the future.
>>
>> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
>> procfs file. The soft-dirty PTE bit for the memory in a process can be
>> cleared by writing to the clear_refs file. This series adds features that
>> weren't possible through the Proc FS interface.
>> - There is no atomic get soft-dirty PTE bit status and clear operation
>> possible.
>
> Such an interface might be easy to add, no?
>
>> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
>
> Same.
>
> So I'm curious why we need a new syscall for that.
Hi David,
Yes, sure. Though it has to be through an ioctl since we need both input
and output semantics at the same call to keep the atomic semantics.
I answered Peter Enderborg about our concerns when turning this into an
ioctl. But they are possible to overcome.
>> project. The Proc FS interface is enough for that as I think the process
>> is frozen. We have the use case where we need to track the soft-dirty
>> PTE bit for running processes. We need this tracking and clear mechanism
>> of a region of memory while the process is running to emulate the
>> getWriteWatch() syscall of Windows. This syscall is used by games to keep
>> track of dirty pages and keep processing only the dirty pages. This
>> syscall can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information.
>>
>> As in the current kernel there is no way to clear a part of memory (instead
>> of clearing the Soft-Dirty bits for the entire processi) and get+clear
>> operation cannot be performed atomically, there are other methods to mimic
>> this information entirely in userspace with poor performance:
>> - The mprotect syscall and SIGSEGV handler for bookkeeping
>> - The userfaultfd syscall with the handler for bookkeeping
>
> You write "poor performance". Did you actually implement a prototype
> using userfaultfd-wp? Can you share numbers for comparison?
Yes, we did. I think Usama can share some numbers.
The problem with userfaultfd, as far as I understand, is that it will
require a second userspace process to be called in order to handle the
annotation that a page was touched, before remapping the page to make it
accessible to the originating process, every time a page is touched.
This context switch is prohibitively expensive to our use case, where
Windows applications might invoke it quite often. Soft-dirty bit
instead, allows the page tracking to be done entirely in kernelspace.
If I understand correctly, userfaultfd is usefull for VM/container
migration, where the cost of the context switch is not a real concern,
since there are much bigger costs from the migration itself.
Maybe we're missing some feature about userfaultfd that would allow us
to avoid the cost, but from our observations we didn't find a way to
overcome it.
>> long process_memwatch(int pidfd, unsigned long start, int len,
>> unsigned int flags, void *vec, int vec_len);
>>
>> This syscall can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information. The following operations are
>> supported in this syscall:
>> - Get the pages that are soft-dirty.
>> - Clear the pages which are soft-dirty.
>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>> soft-dirty PTE bit
>
> Huh, why? VM_SOFTDIRTY is an internal implementation detail and should
> remain such.
> VM_SOFTDIRTY translates to "all pages in this VMA are soft-dirty".
That is something very specific about our use case, and we should
explain it a bit better. The problem is that VM_SOFTDIRTY modifications
introduce the overhead of the mm write lock acquisition, which is very
visible in our benchmarks of Windows games running over Wine.
Since the main reason for VM_SOFTDIRTY to exist, as far as we understand
it, is to track vma remapping, and this is a use case we don't need to
worry about when implementing windows semantics, we'd like to be able to
avoid this extra overhead, optionally, iff userspace knows it can be
done safely.
VM_SOFTDIRTY is indeed an internal interface. Which is why we are
proposing to expose the feature in terms of tracking VMA reuse.
Thanks,
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-08-10 17:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20220726161854.276359-1-usama.anjum@collabora.com>
2022-07-26 16:18 ` [PATCH 3/5] mm: wire up process_memwatch syscall for x86 Muhammad Usama Anjum
2022-08-10 8:45 ` [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
2022-08-10 9:03 ` David Hildenbrand
2022-08-10 17:05 ` Gabriel Krisman Bertazi
2022-08-10 9:22 ` Peter.Enderborg
2022-08-10 16:53 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox