From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <46130BC8.9050905@yahoo.com.au> Date: Wed, 04 Apr 2007 12:22:00 +1000 From: Nick Piggin MIME-Version: 1.0 Subject: Re: missing madvise functionality References: <46128051.9000609@redhat.com> <46128CC2.9090809@redhat.com> <20070403172841.GB23689@one.firstfloor.org> <20070403125903.3e8577f4.akpm@linux-foundation.org> <4612B645.7030902@redhat.com> <20070403202937.GE355@devserv.devel.redhat.com> <20070403144948.fe8eede6.akpm@linux-foundation.org> <4612DCC6.7000504@cosmosbay.com> In-Reply-To: <4612DCC6.7000504@cosmosbay.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Return-Path: To: Eric Dumazet Cc: Andrew Morton , Jakub Jelinek , Ulrich Drepper , Andi Kleen , Rik van Riel , Linux Kernel , linux-mm@kvack.org, Hugh Dickins List-ID: Eric Dumazet wrote: > Andrew Morton a ecrit : > >> On Tue, 3 Apr 2007 16:29:37 -0400 >> Jakub Jelinek wrote: >> >>> On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote: >>> >>>> Andrew Morton wrote: >>>> >>>>> Ulrich, could you suggest a little test app which would demonstrate >>>>> this >>>>> behaviour? >>>> >>>> It's not really reliably possible to demonstrate this with a small >>>> program using malloc. You'd need something like this mysql test case >>>> which Rik said is not hard to run by yourself. >>>> >>>> If somebody adds a kernel interface I can easily produce a glibc patch >>>> so that the test can be run in the new environment. >>>> >>>> But it's of course easy enough to simulate the specific problem in a >>>> micro benchmark. If you want that let me know. >>> >>> I think something like following testcase which simulates what free >>> and malloc do when trimming/growing a non-main arena. >>> >>> My guess is that all the page zeroing is pretty expensive as well and >>> takes significant time, but I haven't profiled it. >>> >>> #include >>> #include >>> #include >>> #include >>> >>> void * >>> tf (void *arg) >>> { >>> (void) arg; >>> size_t ps = sysconf (_SC_PAGE_SIZE); >>> void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE, >>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >>> if (p == MAP_FAILED) >>> exit (1); >>> int i; >>> for (i = 0; i < 100000; i++) >>> { >>> /* Pretend to use the buffer. */ >>> char *q, *r = (char *) p + 128 * ps; >>> size_t s; >>> for (q = (char *) p; q < r; q += ps) >>> *q = 1; >>> for (s = 0, q = (char *) p; q < r; q += ps) >>> s += *q; >>> /* Free it. Replace this mmap with >>> madvise (p, 128 * ps, MADV_THROWAWAY) when implemented. */ >>> if (mmap (p, 128 * ps, PROT_NONE, >>> MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p) >>> exit (2); >>> /* And immediately malloc again. This would then be deleted. */ >>> if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE)) >>> exit (3); >>> } >>> return NULL; >>> } >>> >>> int >>> main (void) >>> { >>> pthread_t th[32]; >>> int i; >>> for (i = 0; i < 32; i++) >>> if (pthread_create (&th[i], NULL, tf, NULL)) >>> exit (4); >>> for (i = 0; i < 32; i++) >>> pthread_join (th[i], NULL); >>> return 0; >>> } >>> >> >> whee. 135,000 context switches/sec on a slow 2-way. mmap_sem, most >> likely. That is ungood. >> >> Did anyone monitor the context switch rate with the mysql test? >> >> Interestingly, your test app (with s/100000/1000) runs to completion >> in 13 >> seocnd on the slow 2-way. On a fast 8-way, it took 52 seconds and >> sustained 40,000 context switches/sec. That's a bit unexpected. >> >> Both machines show ~8% idle time, too :( > > > Yes... then add to this some futex work, and you get the picture. > > I do think such workloads might benefit from a vma_cache not shared by > all threads but private to each thread. A sequence could invalidate the > cache(s). > > ie instead of a mm->mmap_cache, having a mm->sequence, and each thread > having a current->mmap_cache and current->mm_sequence I have a patchset to do exactly this, btw. Anyway what is the status of the private futex work. I don't think that is very intrusive or complicated, so it should get merged ASAP (so then at least we have the interface there). -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org