Add the fincore() system call. fincore() is mincore() for file descriptors. The functionality of fincore() can be emulated with an mmap(), mincore(), and munmap(), but this emulation requires more system calls and requires page table modifications. fincore() can provide a significant performance improvement for non-sequential in-core queries. CC: Andi Kleen CC: Wu Fengguang CC: Andrew Morton Signed-off-by: Chris Frost Signed-off-by: Steve VanDeBogart --- Andi, Wu, and Andrew, Thanks for the feedback. I have incorporated most of your suggestions into this patch. A man page for fincore(2) is attached. A few questions about the suggestions: * Early return when a signal is queued Andi pointed out that it would be good to return early if a signal is queued for the calling process. I see two options to do this: 1. Return -EINTR (the patch in this email takes this approach). 2. Return the results discovered so far. And the tradeoffs: 1. Returning -EINTR abandons good work. 2. fincore needs to inform the caller of the number of valid vec entries. Three approaches: - Return the number of file bytes with valid page entries. Issue: The return type is a long. For a 32 bit process, sizeof(long) is significantly less than sizeof(loff_t). - Return the number of pages with valid page entries (use type ssize_t). Issue: I don't see any significant issues. - Add an 'loff_t *result' parameter to the system call (a la _llseek). Issue: does this push the number of arguments too high? Given the above, I feel that returning the number of valid page entries is the best approach. Feedback? * Radix gang lookup Andi and Andrew suggested using radix gang lookups for larger ranges. My benchmarks show that my change to do this isn't a strict performance win. Attached is my patch that changes the below code to do gang lookups. Here are the microbenchmark results for querying a 1 GiB file (results are times in seconds): density | none in-core | many in-core | pages/syscall | 1 | 8192 | 1 | 8192 | --------------+------+-------+------+-------| fincore | .055 | .0045 | .078 | .0250 | fincore-batch | .059 | .0010 | .123 | .0170 | "none in-core" means none of the file pages were in ram and "many in-core" that the majority were in ram. the next row indicates how many pages are queried in each fincore(2) call by the process (1 or 8192 pages). These results show that fincore-batch is ~2-5x faster for large count calls, but that it is slower for small count calls. Both classes of calls are made by our macrobenchmarks. The macrobenchmarks (a SQLite query and a GIMP image blur) show the batch version slows performance by 2-12%. Perhaps my gang patch can be further optimized, or is incorrect? (Of note, I did not avoid the gets and puts.) If not, fincore_pages() could be changed to decide choose which of a single vs. gang lookup to do based on a heuristic tradeoff value. Or it could stay simple and only make single lookups. * Kernel output buffer size Andi pointed out that the optimization to avoid allocating a buffer page for small queries could be changed to claim more than 64 bytes of on-stack buffer space. I haven't seen increasing this space to significantly affect my particular macrobenchmarks. I'm ok with leaving it at 64 bytes or changing it; this number was picked fairly arbitrarily. * Upper query size limit Andi suggested imposing a reasonable upper limit on the number of pages that can be queried in a single fincore call, because the process would not be debuggable while the system call is executing. Andi, could you cite an example system call for me to look at that takes this approach? * System call argument feasibility across architectures The fincore(2) parameters are now reordered so that they should fit in six registers on all platforms. I intend for the libc wrapper to expose the call as shown in the man page, rather than the system call's ordering. I think this issue is now resolved, but wanted to mention it just in case I missed anything; does this look good for all architectures? include/linux/syscalls.h | 2 fs/fincore.c | 126 +++++++++++++++++++++++++++++++++++++ fs/Makefile | 2 include/asm-generic/unistd.h | 5 + arch/x86/ia32/ia32entry.S | 1 arch/x86/include/asm/unistd_32.h | 3 arch/x86/include/asm/unistd_64.h | 2 arch/x86/kernel/syscall_table_32.S | 1 8 files changed, 139 insertions(+), 3 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a990ace..814e4f5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); +asmlinkage long sys_fincore(unsigned int fd, unsigned char __user *vec, + loff_t start, loff_t len); asmlinkage long sys_pivot_root(const char __user *new_root, const char __user *put_old); diff --git a/fs/fincore.c b/fs/fincore.c new file mode 100644 index 0000000..f329fe4 --- /dev/null +++ b/fs/fincore.c @@ -0,0 +1,126 @@ +/* + * fs/fincore.c + * + * Copyright (C) 2009, 2010 Chris Frost, UC Regents + * Copyright (C) 2008 Steve VanDeBogart, UC Regents + */ + +/* + * The fincore() system call. + */ +#include +#include +#include +#include +#include + +static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff) +{ + unsigned char present = 0; + struct page *page = find_get_page(mapping, pgoff); + if (page) { + present = PageUptodate(page); + page_cache_release(page); + } + + return present; +} + +/* + * The fincore(2) system call. + * + * fincore() returns the memory residency status of the pages backing + * a file range specified by fd and [start, start + len). + * The status is returned in a vector of bytes. The least significant + * bit of each byte is 1 if the referenced page is in memory, otherwise + * it is zero. + * + * Because the status of a page can change after fincore() checks it + * but before it returns to the application, the returned vector may + * contain stale information. Only locked pages are guaranteed to + * remain in memory. + * + * Note that the parameter order for this system calls differ from the order + * for the libc wrapper. This syscall order allows the parameters to fit + * in six registers on all architectures. + * + * return values: + * zero - success + * -EBADF - fd is an illegal file descriptor + * -EFAULT - vec points to an illegal address + * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE + * -EAGAIN - A kernel resource was temporarily unavailable + * -EINTR - The call was interrupted by a signal + */ +SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec, + loff_t, start, loff_t, len) +{ + long retval; + pgoff_t pgoff = start >> PAGE_SHIFT; + pgoff_t npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgoff_t pgend = pgoff + npages; + struct file *filp; + int fput_needed; + loff_t file_nbytes; + pgoff_t file_npages; + unsigned char *kernel_vec = NULL; + unsigned char kernel_vec_small[64]; + unsigned kernel_vec_count; + int i; + + /* Check the start address: needs to be page-aligned.. */ + if (start & ~PAGE_CACHE_MASK) + return -EINVAL; + + filp = fget_light(fd, &fput_needed); + if (!filp) + return -EBADF; + + file_nbytes = i_size_read(filp->f_mapping->host); + + file_npages = (file_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT; + + /* + * Allocate buffer vector page. + * Optimize allocation for small values of npages because the + * __get_free_page() call doubles fincore(2) runtime when npages == 1. + */ + if (npages <= sizeof(kernel_vec_small)) { + kernel_vec = kernel_vec_small; + kernel_vec_count = sizeof(kernel_vec_small); + } else { + kernel_vec = (void *) __get_free_page(GFP_USER); + if (!kernel_vec) { + retval = -EAGAIN; + goto done; + } + kernel_vec_count = PAGE_SIZE; + } + + while (pgoff < pgend) { + /* + * Do at most kernel_vec_count entries per iteration, due to + * the limited buffer size. + */ + for (i = 0; pgoff < pgend && i < kernel_vec_count; pgoff++, i++) + kernel_vec[i] = fincore_page(filp->f_mapping, pgoff); + + if (copy_to_user(vec, kernel_vec, i)) { + retval = -EFAULT; + break; + } + vec += i; + + if (signal_pending(current)) { + retval = -EINTR; + break; + } + cond_resched(); + } + retval = 0; +done: + if (kernel_vec && kernel_vec != kernel_vec_small) + free_page((unsigned long) kernel_vec); + fput_light(filp, fput_needed); + return retval; +} diff --git a/fs/Makefile b/fs/Makefile index af6d047..a3ccd6b 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o fs_struct.o + stack.o fs_struct.o fincore.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h index d76b66a..ce76dc8 100644 --- a/include/asm-generic/unistd.h +++ b/include/asm-generic/unistd.h @@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) #define __NR_perf_event_open 241 __SYSCALL(__NR_perf_event_open, sys_perf_event_open) +#define __NR_fincore 242 +__SYSCALL(__NR_fincore, sys_fincore) + #undef __NR_syscalls -#define __NR_syscalls 242 +#define __NR_syscalls 243 /* * All syscalls below here should go away really, diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 581b056..cbf96e6 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -841,4 +841,5 @@ ia32_sys_call_table: .quad compat_sys_pwritev .quad compat_sys_rt_tgsigqueueinfo /* 335 */ .quad sys_perf_event_open + .quad sys_fincore ia32_syscall_end: diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6fb3c20..088b235 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -342,10 +342,11 @@ #define __NR_pwritev 334 #define __NR_rt_tgsigqueueinfo 335 #define __NR_perf_event_open 336 +#define __NR_fincore 337 #ifdef __KERNEL__ -#define NR_syscalls 337 +#define NR_syscalls 338 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index 8d3ad0a..ebc04b5 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev) __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) #define __NR_perf_event_open 298 __SYSCALL(__NR_perf_event_open, sys_perf_event_open) +#define __NR_fincore 299 +__SYSCALL(__NR_fincore, sys_fincore) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 0157cd2..1fdc8bc 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -336,3 +336,4 @@ ENTRY(sys_call_table) .long sys_pwritev .long sys_rt_tgsigqueueinfo /* 335 */ .long sys_perf_event_open + .long sys_fincore -- Chris Frost http://www.frostnet.net/chris/