From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by kanga.kvack.org (Postfix) with ESMTP id AF04D6B0038 for ; Tue, 1 Dec 2015 18:59:29 -0500 (EST) Received: by wmww144 with SMTP id w144so35244040wmw.0 for ; Tue, 01 Dec 2015 15:59:29 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id aw7si402333wjc.90.2015.12.01.15.59.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Dec 2015 15:59:28 -0800 (PST) Date: Tue, 1 Dec 2015 15:59:26 -0800 From: Andrew Morton Subject: Re: [PATCH V2] mm: add a new vector based madvise syscall Message-Id: <20151201155926.6291e5e541c49d453079849b@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, je@fb.com, Kernel-team@fb.com, Rik van Riel , Mel Gorman , Hugh Dickins , Johannes Weiner , Andrea Arcangeli , Andi Kleen , Minchan Kim On Mon, 9 Nov 2015 11:44:54 -0800 Shaohua Li wrote: > In jemalloc, a free(3) doesn't immediately free the memory to OS even > the memory is page aligned/size, and hope the memory can be reused soon. > Later the virtual address becomes fragmented, and more and more free > memory are aggregated. If the free memory size is large, jemalloc uses > madvise(DONT_NEED) to actually free the memory back to OS. > > The madvise has significantly overhead paritcularly because of TLB > flush. jemalloc does madvise for several virtual address space ranges > one time. Instead of calling madvise for each of the ranges, we > introduce a new syscall to purge memory for several ranges one time. In > this way, we can merge several TLB flush for the ranges to one big TLB > flush. This also reduce mmap_sem locking and kernel/userspace switching. > > I'm running a simple memory allocation benchmark. 32 threads do random > malloc/free/realloc. Corresponding jemalloc patch to utilize this API is > attached. > Without patch: > real 0m18.923s > user 1m11.819s > sys 7m44.626s > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush > is hotest functions. mmap_sem read locking (because of page fault) is > also heavy. > > with patch: > real 0m15.026s > user 0m48.548s > sys 6m41.153s > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at > all. mmap_sem read locking (still because of page fault) becomes the > sole hot spot. > > Another test malloc a bunch of memory in 48 threads, then all threads > free the memory. I measure the time of the memory free. > Without patch: 34.332s > With patch: 17.429s > > Current implementation only supports MADV_DONTNEED. Should be trival to > support MADV_FREE if necessary later. I'd like to see a full description of the proposed userspace interface: arguments, data structures, return values, etc. A propotype manpage, basically. I'd also like to see an analysis of which other userspace allocators will benefit from this. glibc? tcmalloc? > > ... > > +/* > + * The vector madvise(). Like madvise except running for a vector of virtual > + * address ranges > + */ > +SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector, > + unsigned long, nr_segs, int, behavior) > +{ > + struct iovec iovstack[UIO_FASTIOV]; > + struct iovec *iov = NULL; > + unsigned long start, end = 0; > + int unmapped_error = 0; > + size_t len; > + struct mmu_gather tlb; > + int error; > + int i; > + > + if (behavior != MADV_DONTNEED) > + return -EINVAL; > + > + error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs, > + UIO_FASTIOV, iovstack, &iov); > + if (error <= 0) > + goto out; > + /* Make sure address in ascend order */ > + sort(iov, nr_segs, sizeof(struct iovec), iov_cmp_func, NULL); Do we really need to sort the addresses? That's something which can be done in userspace and we can easily add a check-for-sortedness to the below loop. It depends on whether userspace can easily generate a sorted array. If basically all userspace will always need to run sort() then it doesn't matter much whether it's done in the kernel or in userspace. But if *some* userspace can naturally generate its array in sorted form then neither userspace nor the kernel needs to run sort() and we should take this out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org