On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote: > On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li wrote: > > > In jemalloc, a free(3) doesn't immediately free the memory to OS even > > the memory is page aligned/size, and hope the memory can be reused soon. > > Later the virtual address becomes fragmented, and more and more free > > memory are aggregated. If the free memory size is large, jemalloc uses > > madvise(DONT_NEED) to actually free the memory back to OS. > > > > The madvise has significantly overhead paritcularly because of TLB > > flush. jemalloc does madvise for several virtual address space ranges > > one time. Instead of calling madvise for each of the ranges, we > > introduce a new syscall to purge memory for several ranges one time. In > > this way, we can merge several TLB flush for the ranges to one big TLB > > flush. This also reduce mmap_sem locking and kernel/userspace switching. > > > > I'm running a simple memory allocation benchmark. 32 threads do random > > malloc/free/realloc. > > CPU count? (Does that matter much?) 32. It does. the tlb flush overhead depends on the cpu count. > > Corresponding jemalloc patch to utilize this API is > > attached. > > No it isn't ;) Sorry, I attached it in first post, but not this one. Attached is the one I tested against this patch. > Who maintains jemalloc? Are they signed up to actually apply the > patch? It would be bad to add the patch to the kernel and then find > that the jemalloc maintainers choose not to use it! Jason Evans (cced) is the author of jemalloc. I talked to him before, he is very positive to this new syscall. > > Without patch: > > real 0m18.923s > > user 1m11.819s > > sys 7m44.626s > > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush > > is hotest functions. mmap_sem read locking (because of page fault) is > > also heavy. > > > > with patch: > > real 0m15.026s > > user 0m48.548s > > sys 6m41.153s > > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at > > all. mmap_sem read locking (still because of page fault) becomes the > > sole hot spot. > > This is a somewhat underwhelming improvement, given that it's a > synthetic microbenchmark. Yes, this test does malloc, free, calloc, realloc, so it doesn't only benchmark the madvisev. > > Another test malloc a bunch of memory in 48 threads, then all threads > > free the memory. I measure the time of the memory free. > > Without patch: 34.332s > > With patch: 17.429s > > This is more whelming. > > Do we have a feel for how much benefit this patch will have for > real-world workloads? That's pretty important. Sure, we'll post some real-world data. > > MADV_FREE does the same TLB flush as MADV_NEED, this also applies to > > I'll do s/MADV_NEED/MADV_DONTNEED/ > > > MADV_FREE. Other madvise type can have small benefits too, like reduce > > syscalls/mmap_sem locking. > > Could we please get a testcase for the syscall(s) into > tools/testing/selftests/vm? For long-term maintenance reasons and as a > service to arch maintainers - make it easy for them to check the > functionality without having to roll their own (possibly incomplete) > test app. > > I'm not sure *how* we'd develop a test case. Use mincore()? Ok, I'll add this later. > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -21,7 +21,10 @@ > > #include > > #include > > #include > > - > > +#include > > +#ifdef CONFIG_COMPAT > > +#include > > +#endif > > I'll nuke the ifdefs - compat.h already does that. > > > It would be good for us to have a look at the manpage before going too > far with the patch - this helps reviewers to think about the proposed > interface and behaviour. > > I'll queue this up for a bit of testing, although it won't get tested > much. The syscall fuzzers will presumably hit on it. Thanks!