* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? [not found] <af8810200808121736q76640cc1kb814385072fe9b29@mail.gmail.com> @ 2008-08-13 0:45 ` Pardo 2008-08-13 10:44 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Pardo @ 2008-08-13 0:45 UTC (permalink / raw) To: akpm, mingo, hugh, linux-mm, linux-kernel; +Cc: briangrant, cgd, mbligh [First send rejected by vger.kernel.org due to HTML and/or test program attachment. Re-send without, please contact me for the test program.] mmap() is slow on MAP_32BIT allocation failure, sometimes causing NPTL's pthread_create() to run about three orders of magnitude slower. As example, in one case creating new threads goes from about 35,000 cycles up to about 25,000,000 cycles -- which is under 100 threads per second. Larger stacks reduce the severity of slowdown but also make slowdown happen after allocating a few thousand threads. Costs vary with platform, stack size, etc., but thread allocation rates drop suddenly on all of a half-dozen platforms I tried. The cause is NPTL allocates stacks with code of the form (e.g., glibc 2.7 nptl/allocatestack.c): sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...); if (sto == MAP_FAILED) sto = mmap(0, ..., MAP_PRIVATE, ...); That is, try to allocate in the low 4GB, and when low addresses are exhausted, allocate from any location. Thus, once low addresses run out, every stack allocation does a failing mmap() followed by a successful mmap(). The failing mmap() is slow because it does a linear search of all low-space vma's. Low-address stacks are preferred because some machines context switch much faster when the stack address has only 32 significant bits. Slow allocation was discussed in 2003 but without resolution. See, e.g., http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With increasing use of threads, slow allocation is becoming a problem. Some old machines were faster switching 32b stacks, but new machines seem to switch as fast or faster using 64b stacks. I measured thread-to-thread context switches on two AMD processors and five Intel procesors. Tests used the same code with 32b or 64b stack pointers; tests covered varying numbers of threads switched and varying methods of allocating stacks. Two systems gave indistinguishable performance with 32b or 64b stacks, four gave 5%-10% better performance using 64b stacks, and of the systems I tested, only the P4 microarchitecture x86-64 system gave better performance for 32b stacks, in that case vastly better. Most systems had thread-to-thread switch costs around 800-1200 cycles. The P4 microarchitecture system had 32b context switch costs around 3,000 cycles and 64b context switches around 4,800 cycles. It appears the kernel's 64-bit switch path handles all 32-bit cases. So on machines with a fast 64-bit path, context switch speed would presumably be improved yet further by eliminating the special 32-bit path. It appears this would also collapse the task state's fs and fsindex fields, and the gs and gsindex fields. These could further reduce memory, cache, and branch predictor pressure. Various things would address the slow pthread_create(). Choices include: - Be more platform-aware about when to use MAP_32BIT. - Abandon use of MAP_32BIT entirely, with worse performance on some machines. - Change the mmap() algorithm to be faster on allocation failure (avoid a linear search of vmas). Options to improve context switch times include: - Do nothing. - Be more platform-aware about when to use different 32b and 64b paths. - Get rid of the 32b path, which also appears it would make contexts smaller. [Not] Attached is a program to measure context switch costs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 0:45 ` pthread_create() slow for many threads; also time to revisit 64b context switch optimization? Pardo @ 2008-08-13 10:44 ` Ingo Molnar 2008-08-13 13:35 ` Arjan van de Ven 0 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-13 10:44 UTC (permalink / raw) To: Pardo Cc: akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Ulrich Drepper, Linus Torvalds, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven * Pardo <pardo@google.com> wrote: > As example, in one case creating new threads goes from about 35,000 > cycles up to about 25,000,000 cycles -- which is under 100 threads per > second. [...] > Various things would address the slow pthread_create(). Choices > include: > - Be more platform-aware about when to use MAP_32BIT. > - Abandon use of MAP_32BIT entirely, with worse performance on some machines. > - Change the mmap() algorithm to be faster on allocation failure > (avoid a linear search of vmas). Sigh, unfortunately MAP_32BIT use in 64-bit apps for stacks was apparently created without foresight about what would happen in the MM when thread stacks exhaust 4GB. The problem is that MAP_32BIT is used both as a performance hack for 64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot just start disregarding MAP_32BIT in the kernel - we'd break 32-bit compat apps and/or compat 32-bit libraries. There are various other options to solve the (severe!) performance breakdown: 1- glibc could start not using MAP_32BIT for 64-bit thread stacks (the boxes where context-switching is slow probably do not matter all that much anymore - they were very slow at everything 64-bit anyway) Pros: easiest solution. Cons: slows down the affected machines and needs a new glibc. 2- We could introduce a new MAP_64BIT_STACK flag which we could propagate it into MAP_32BIT on those old CPUs. It would be disregarded on modern CPUs and thread stacks would be 64-bit. Pros: cleanest solution. Cons: needs both new glibc and new kernel to take advantage of. 3- We could detect the first-4G-is-full condition and cache it. Problem is, there will likely be small holes in it so it's rather hard to do it in a sane way. Also, every munmap() of a thread stack will invalidate this - triggering a slow linear search every now and then. Pros: only needs a new kernel to take advantage of. Cons: is the most complex and messiest solution with no clear benefit to other workloads. Also, does not 100% solve the performance problem and prolongues the 4GB stack threads hack. i'd go for 1) or 2). Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 10:44 ` Ingo Molnar @ 2008-08-13 13:35 ` Arjan van de Ven 2008-08-13 14:21 ` Ulrich Drepper 0 siblings, 1 reply; 27+ messages in thread From: Arjan van de Ven @ 2008-08-13 13:35 UTC (permalink / raw) To: Ingo Molnar Cc: Pardo, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Ulrich Drepper, Linus Torvalds, Thomas Gleixner, H. Peter Anvin On Wed, 13 Aug 2008 12:44:45 +0200 Ingo Molnar <mingo@elte.hu> wrote: > There are various other options to solve the (severe!) performance > breakdown: > > 1- glibc could start not using MAP_32BIT for 64-bit thread stacks > (the boxes where context-switching is slow probably do not matter all > that much anymore - they were very slow at everything 64-bit anyway) > > Pros: easiest solution. > Cons: slows down the affected machines and needs a new glibc. > > > i'd go for 1) or 2). I would go for 1) clearly; it's the cleanest thing going forward for sure. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 13:35 ` Arjan van de Ven @ 2008-08-13 14:21 ` Ulrich Drepper 2008-08-13 14:25 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 14:21 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Arjan van de Ven wrote: >> i'd go for 1) or 2). > > I would go for 1) clearly; it's the cleanest thing going forward for > sure. I want to see numbers first. If there are problems visible I definitely would want to see 2. Andi at the time I wrote that code was very adamant that I use the flag. - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii7gcACgkQ2ijCOnn/RHTveQCeIefB1R5QpuQ71RNMihKL5oWD ZVoAnjjjKgXznRx8qtbrF+fgvcNwsngA =dAz2 -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 14:21 ` Ulrich Drepper @ 2008-08-13 14:25 ` Ingo Molnar 2008-08-13 14:36 ` Ulrich Drepper 0 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-13 14:25 UTC (permalink / raw) To: Ulrich Drepper Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Arjan van de Ven wrote: > >> i'd go for 1) or 2). > > > > I would go for 1) clearly; it's the cleanest thing going forward for > > sure. > > I want to see numbers first. If there are problems visible I > definitely would want to see 2. Andi at the time I wrote that code > was very adamant that I use the flag. not sure exactly what numbers you mean, but there are lots of numbers in the first mail, attached below. For example: | As example, in one case creating new threads goes from about 35,000 | cycles up to about 25,000,000 cycles -- which is under 100 threads per | second. Larger stacks reduce the severity of slowdown but also make being able to create only 100 threads per second brings us back to 33 MHz 386 DX Linux performance. Ingo ----------------------> mmap() is slow on MAP_32BIT allocation failure, sometimes causing NPTL's pthread_create() to run about three orders of magnitude slower. As example, in one case creating new threads goes from about 35,000 cycles up to about 25,000,000 cycles -- which is under 100 threads per second. Larger stacks reduce the severity of slowdown but also make slowdown happen after allocating a few thousand threads. Costs vary with platform, stack size, etc., but thread allocation rates drop suddenly on all of a half-dozen platforms I tried. The cause is NPTL allocates stacks with code of the form (e.g., glibc 2.7 nptl/allocatestack.c): sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...); if (sto == MAP_FAILED) sto = mmap(0, ..., MAP_PRIVATE, ...); That is, try to allocate in the low 4GB, and when low addresses are exhausted, allocate from any location. Thus, once low addresses run out, every stack allocation does a failing mmap() followed by a successful mmap(). The failing mmap() is slow because it does a linear search of all low-space vma's. Low-address stacks are preferred because some machines context switch much faster when the stack address has only 32 significant bits. Slow allocation was discussed in 2003 but without resolution. See, e.g., http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With increasing use of threads, slow allocation is becoming a problem. Some old machines were faster switching 32b stacks, but new machines seem to switch as fast or faster using 64b stacks. I measured thread-to-thread context switches on two AMD processors and five Intel procesors. Tests used the same code with 32b or 64b stack pointers; tests covered varying numbers of threads switched and varying methods of allocating stacks. Two systems gave indistinguishable performance with 32b or 64b stacks, four gave 5%-10% better performance using 64b stacks, and of the systems I tested, only the P4 microarchitecture x86-64 system gave better performance for 32b stacks, in that case vastly better. Most systems had thread-to-thread switch costs around 800-1200 cycles. The P4 microarchitecture system had 32b context switch costs around 3,000 cycles and 64b context switches around 4,800 cycles. It appears the kernel's 64-bit switch path handles all 32-bit cases. So on machines with a fast 64-bit path, context switch speed would presumably be improved yet further by eliminating the special 32-bit path. It appears this would also collapse the task state's fs and fsindex fields, and the gs and gsindex fields. These could further reduce memory, cache, and branch predictor pressure. Various things would address the slow pthread_create(). Choices include: - Be more platform-aware about when to use MAP_32BIT. - Abandon use of MAP_32BIT entirely, with worse performance on some machines. - Change the mmap() algorithm to be faster on allocation failure (avoid a linear search of vmas). Options to improve context switch times include: - Do nothing. - Be more platform-aware about when to use different 32b and 64b paths. - Get rid of the 32b path, which also appears it would make contexts smaller. [Not] Attached is a program to measure context switch costs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 14:25 ` Ingo Molnar @ 2008-08-13 14:36 ` Ulrich Drepper 2008-08-13 15:10 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 14:36 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ingo Molnar wrote: > not sure exactly what numbers you mean, but there are lots of numbers in > the first mail, attached below. For example: I mean numbers indicating that it doesn't hurt performance on any of today's machines. If there are machines where it makes a difference then we need the flag to indicate the _preference_ for a low stack, as opposed to indicating a _requirement_. - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii8VcACgkQ2ijCOnn/RHTiLQCfcZ9xJHMi0Jv59l700ZNJUoi6 aEcAn370XuGhs1u1YeD2Gqq35zQnKh26 =rC0v -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 14:36 ` Ulrich Drepper @ 2008-08-13 15:10 ` Ingo Molnar 2008-08-13 15:21 ` Ulrich Drepper 2008-08-13 20:42 ` Andi Kleen 0 siblings, 2 replies; 27+ messages in thread From: Ingo Molnar @ 2008-08-13 15:10 UTC (permalink / raw) To: Ulrich Drepper Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ingo Molnar wrote: > > not sure exactly what numbers you mean, but there are lots of numbers in > > the first mail, attached below. For example: > > I mean numbers indicating that it doesn't hurt performance on any of > today's machines. If there are machines where it makes a difference > then we need the flag to indicate the _preference_ for a low stack, as > opposed to indicating a _requirement_. there were a few numbers about that as well, and a test-app. The test app is below. The numbers were: | I measured thread-to-thread context switches on two AMD processors and | five Intel procesors. Tests used the same code with 32b or 64b stack | pointers; tests covered varying numbers of threads switched and | varying methods of allocating stacks. Two systems gave | indistinguishable performance with 32b or 64b stacks, four gave 5%-10% | better performance using 64b stacks, and of the systems I tested, only | the P4 microarchitecture x86-64 system gave better performance for 32b | stacks, in that case vastly better. Most systems had thread-to-thread | switch costs around 800-1200 cycles. The P4 microarchitecture system | had 32b context switch costs around 3,000 cycles and 64b context | switches around 4,800 cycles. i find it pretty unacceptable these days that we limit any aspect of pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). [other than the small execution model which is 2GB obviously.] Ingo ---------------------> // switch.cc -- measure thread-to-thread context switch times // using either low-address stacks or high-address stacks #include <sys/mman.h> #include <sys/types.h> #include <pthread.h> #include <sched.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> const int kRequestedSwaps = 10000; const int kNumThreads = 2; const int kRequestedSwapsPerThread = kRequestedSwaps / kNumThreads; const int kStackSize = 64 * 1024; const int kTrials = 100; typedef long long Tsc; #define LARGEST_TSC (static_cast<Tsc>(1ULL << (8 * sizeof(Tsc) - 2) - 1)) Tsc now() { unsigned int eax_lo, edx_hi; Tsc now; asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi)); now = ((Tsc)eax_lo) | ((Tsc)(edx_hi) << 32); return now; } // Use 0/1 for size to allow array subscripting. const int pointer_sizes[] = { 32, 64 }; #define SZ_N (sizeof(pointer_sizes) / sizeof(pointer_sizes[0])) typedef int PointerSize; PointerSize address_size(const void *vaddr) { intptr_t iaddr = reinterpret_cast<intptr_t>(vaddr); return ((iaddr >> 32) == 0) ? 0 : 1; } // One instance poitned to by every PerThread. struct SharedArgs { // Read-only during a given test: cpu_set_t cpu; // Only one bit set; all threads run on this CPU. // Read/write during a given test: pthread_barrier_t start_barrier; pthread_barrier_t stop_barrier; }; // One per thread. struct PerThread { // Thread args SharedArgs *shared_args; Tsc *stamps; // Per-thread storage. pthread_t thread; void *stack[SZ_N]; // mmap()'d storage pthread_attr_t attr; }; // Distinguish betwen start/stop timestamp for each iteration typedef enum { START, STOP } StartStop; // Record each timestamp in isolation for minimum runtime cache footprint; // after a run, copy each timestamp to one of these so can sort and also track // start/stop, etc. struct Event { Tsc time; StartStop start_stop; int thread_num; int iter; }; // Sort events in increasing time order. int event_pred(const void *ve0, const void *ve1) { const Event *e0 = static_cast<const Event *>(ve0); const Event *e1 = static_cast<const Event *>(ve1); return e0->time - e1->time; } // Data to aggregate across runs. Print only after runs are all over, in order // to minimize possible overlap of I/O and benchmark. struct Result { int pointer_size; int swaps; Tsc fastest; }; // Each thread runs this worker. void *worker(void *v_per_thread) { const PerThread *per_thread = static_cast<const PerThread *>(v_per_thread); SharedArgs *shared_args = per_thread->shared_args; // Run all threads on the same CPU. const cpu_set_t *cpu = &shared_args->cpu; int cc = sched_setaffinity(0/*self*/, sizeof(*cpu), cpu); if (cc != 0) { perror("sched_setaffinity"); exit(1); } // Wait for all workers to be ready before running the inner loop. cc = pthread_barrier_wait(&shared_args->start_barrier); if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) { perror("pthread_barrier_wait"); exit(1); } // Inner loop: track time before and after a swap. In principle we // can use just one timestamp per iteration, but that gives more // variance between timestamps from overheads such as cache misses // not related to the context switch. Tsc *stamp = per_thread->stamps; for (int i = 0; i < kRequestedSwapsPerThread; ++i) { // Run timed critical section in as much isolation as possible. // Notably, read stamps but avoid saving them to memory and taking // cache misses until after both %tsc reads. asm volatile ("nop" ::: "memory"); Tsc start = now(); sched_yield(); Tsc stop = now(); asm volatile ("nop" ::: "memory"); *stamp++ = start; *stamp++ = stop; } // Release the manager to clean up. cc = pthread_barrier_wait(&shared_args->stop_barrier); if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) { perror("pthread_barrier_wait"); exit(1); } return NULL; } // Manager code that creates and starts worker threads, waits, then cleans up. void run_test(PerThread *per_thread, PointerSize ps) { // Create worker threads. for (int th = 0; th < kNumThreads; ++th) { int cc = pthread_attr_setstack(&per_thread[th].attr, per_thread[th].stack[ps], kStackSize); if (cc != 0) { perror("pthread_attr_setstack"); exit(1); } cc = pthread_create(&per_thread[th].thread, &per_thread[th].attr, worker, &per_thread[th]); if (cc != 0) { perror("pthread_create"); exit(1); } } // Release all worker threads to run their inner loop, // then wait for all to finish before joining any. SharedArgs *shared_args = per_thread->shared_args; int cc = pthread_barrier_wait(&shared_args->start_barrier); if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) { perror("pthread_barrier_wait"); exit(1); } cc = pthread_barrier_wait(&shared_args->stop_barrier); if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) { perror("pthread_barrier_wait"); exit(1); } // Clean up worker threads. for (int th = 0; th < kNumThreads; ++th) { int cc = pthread_join(per_thread[th].thread, NULL); if (cc != 0) { perror("pthread_join"); exit(1); } } } // After a run, find out which sched_yield() calls actually did a yield, // then find out the fastest sched_yield() that occured during the run. Result process_data(Event *event, const PerThread per_thread[], int requested_swaps_per_thread, PointerSize pointer_size) { // Copy timestamps in to a struct to associate timestamps with thread number. int event_num = 0; for (int th = 0; th < kNumThreads; ++th) { const Tsc *stamps = per_thread[th].stamps; int stamp_num = 0; StartStop start_stop = START; // 2* because there's a start stamp and stop stamp for each swap for (int iter = 0; iter < (2 * requested_swaps_per_thread); ++iter) { event[event_num].time = stamps[stamp_num++]; event[event_num].start_stop = start_stop; start_stop = (start_stop == START) ? STOP : START; event[event_num].thread_num = th; event[event_num].iter = iter; ++event_num; } } int num_events = event_num; // Sort data in timestamp order. qsort(event, num_events, sizeof(event[0]), event_pred); // A context switch occurred ff two adjacent stamps are for // different threads. A requested context switch very likely // occured if a context switch was between a START stamp in the // first thread and a STOP stamp in the second. Note that some // non-requested context switches also get logged. As example, a // preemptive cswap could have occured, and the following // sched_yield() may have done a yield-to-self. Tsc fastest = LARGEST_TSC; int swaps = 0; for (int e = 0; e < (num_events - 1); ++e) { if ((event[e].thread_num != event[e+1].thread_num) && (event[e].start_stop == START) && (event[e+1].start_stop == STOP)) { ++swaps; Tsc t = event[e+1].time - event[e].time; if (t < fastest) fastest = t; } } Result result; result.pointer_size = pointer_size; result.swaps = swaps; result.fastest = fastest; return result; } // Dump results for one run. Also aggregate "best of best" and "worst of best". void dump_one_run(Tsc best[SZ_N], Tsc worst[SZ_N], int trial_num, const Result *result) { Tsc t = result->fastest; PointerSize ps = result->pointer_size; int cc = printf("run: %d pointer-size: %d requested-swaps: %d got-swaps: %d fastest: %lld\n", trial_num, pointer_sizes[ps], kRequestedSwaps, result->swaps, result->fastest); if (cc < 0) { perror("printf"); exit(1); } if (t < best[ps]) best[ps] = t; if (t > worst[ps]) worst[ps] = t; } void *mmap_stack(PointerSize pointer_size) { int location_flag; switch(pointer_sizes[pointer_size]) { case 32: location_flag = MAP_32BIT; break; case 64: location_flag = 0x0; break; default: fprintf(stderr, "Implementation error: unhandled stack placement\n"); exit(1); } void *stack = mmap(0, kStackSize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|location_flag, 0, 0); if (stack == MAP_FAILED) { perror("mmap"); exit(1); } // Check we got the stack location we requested PointerSize got = address_size(stack); if (got != pointer_size) { // Note: MSWindohs and Linux are asymmetrical about %p: one prints // with a leading 0x, the other does not. Assume here it does not matter. fprintf(stderr, "Did not get requested pointer size\n"); exit(1); } return stack; } void munmap_stack(void *stack) { int cc = munmap(stack, kStackSize); if (cc != 0) { perror("munmap"); exit(1); } } int main(int argc, char **argv) { SharedArgs shared_args; // Find the highest-numbered CPU, all threads run on that thread only. { cpu_set_t set; int sz = sched_getaffinity(0, sizeof(set), &set); // Documentation says sched_getaffinity() returns the size used by // the kernel, but by experiment it returns zero on some 2.6.18 // systems, but with a sensible mask nonetheless. if (sz < 0) { perror ("sched_getaffinity"); exit(1); } // Find an available processor/core. If possible grab something other // than CPU 0 to minimize interference from interrupts preferentially // delivered to core 0. int proc; for (proc=CPU_SETSIZE-1; proc>=0; --proc) if (CPU_ISSET(proc, &set)) break; if (proc >= CPU_SETSIZE) { fprintf (stderr, "No virtual processors!?\n"); exit(1); } CPU_ZERO(&shared_args.cpu); CPU_SET(proc, &shared_args.cpu); } // Reusable per-thread setup PerThread per_thread[kNumThreads]; for (int th = 0; th < kNumThreads; ++th) { per_thread[th].stamps = new Tsc[2 * kRequestedSwaps]; per_thread[th].shared_args = &shared_args; for (int ps = 0; ps < SZ_N; ++ps) per_thread[th].stack[ps] = mmap_stack(static_cast<PointerSize>(ps)); int cc = pthread_attr_init(&per_thread[th].attr); if (cc != 0) { perror("pthread_attr_init"); exit(1); } } // Storage for post-processing timestamps from one trial run. // 2 stamps per iteration. 'new' the storage since long runs // otherwise overflow the stack. Event *event = new Event[kNumThreads * (2 * kRequestedSwaps)]; // Post-processed data for all trial runs. Written during the "run // tests" phase and read during the "dump data" phase. int kNumRuns = kTrials * SZ_N; Result result[kNumRuns]; int result_num = 0; // Pthread barriers are cyclic, so can reuse them. +1 for the manager thread pthread_barrier_init(&shared_args.start_barrier, NULL, kNumThreads + 1); pthread_barrier_init(&shared_args.stop_barrier, NULL, kNumThreads + 1); // Warming runs { run_test(per_thread, static_cast<PointerSize>(0/*32b*/)); run_test(per_thread, static_cast<PointerSize>(1/*64b*/)); } // Run tests for (int trial = 0; trial < kTrials; ++trial) { int requested_swaps_per_thread = kRequestedSwaps / kNumThreads; for (int ps = 0; ps < SZ_N; ++ps) { PointerSize pointer_size = static_cast<PointerSize>(ps); run_test(per_thread, pointer_size); // Process data and save to RAM. Do not do explicit I/O here on the // basis background activity may interfere with context switches. result[result_num++] = process_data(event, per_thread, requested_swaps_per_thread, pointer_size); } } // Cleanup pthread_barrier_destroy(&shared_args.start_barrier); pthread_barrier_destroy(&shared_args.stop_barrier); for (int th = 0; th < kNumThreads; ++th) { delete[] per_thread[th].stamps; for (int ps = 0; ps < SZ_N; ++ps) munmap_stack(per_thread[th].stack[ps]); int cc = pthread_attr_destroy(&per_thread[th].attr); if (cc != 0) { perror("pthread_attr_destory"); exit(1); } } delete[] event; // Dump data from RAM to stdout. Tsc best[SZ_N] = { LARGEST_TSC, LARGEST_TSC }; Tsc worst[SZ_N] = { 0, 0 }; for (int r = 0; r < result_num; ++r) dump_one_run(best, worst, r, &result[r]); for (int sz = 0; sz < SZ_N; ++sz) { int cc = printf("best-of-best[%d]: %lld\nworst-of-best[%d]: %lld\n", pointer_sizes[sz], best[sz], pointer_sizes[sz], worst[sz]); if (cc < 0) { perror("printf"); exit(1); } } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:10 ` Ingo Molnar @ 2008-08-13 15:21 ` Ulrich Drepper 2008-08-13 15:40 ` Ingo Molnar 2008-08-13 16:05 ` H. Peter Anvin 2008-08-13 20:42 ` Andi Kleen 1 sibling, 2 replies; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 15:21 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ingo Molnar wrote: > i find it pretty unacceptable these days that we limit any aspect of > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). Sure, but if we can pin-point the sub-archs for which it is the problem then a flag to optionally request it is even easier to handle. You'd simply ignore the flag for anything but the P4 architecture. I personally have no problem removing the whole thing because I have no such machine running anymore. But there are people out there who have. - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii/BcACgkQ2ijCOnn/RHQ8FACfZFV+WaBmS6UNqZZ/xDfV/Z7z gIAAoJSmbauchdaIVIebz8N2rPrszAMF =WAzJ -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:21 ` Ulrich Drepper @ 2008-08-13 15:40 ` Ingo Molnar 2008-08-13 15:55 ` Ulrich Drepper 2008-08-13 16:05 ` H. Peter Anvin 1 sibling, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-13 15:40 UTC (permalink / raw) To: Ulrich Drepper Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ingo Molnar wrote: > > i find it pretty unacceptable these days that we limit any aspect of > > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). > > Sure, but if we can pin-point the sub-archs for which it is the > problem then a flag to optionally request it is even easier to handle. > You'd simply ignore the flag for anything but the P4 architecture. i suspect you are talking about option #2 i described. It is the option which will take the most time to trickle down to people. > I personally have no problem removing the whole thing because I have > no such machine running anymore. But there are people out there who > have. hm, i think the set of people running on such boxes _and_ then upgrading to a new glibc and expecting everything to be just as fast to the microsecond as before should be miniscule. Those P4 derived 64-bit boxes were astonishingly painful in 64-bit mode - most of that hw is running 32-bit i suspect, because 64-bit on it was really a joke. Btw., can you see any problems with option #1: simply removing MAP_32BIT from 64-bit stack allocations in glibc unconditionally? It's the fastest to execute and also the most obvious solution. +1 usecs overhead in the 64-bit context-switch path on those old slow boxes wont matter much. 10 _millisecs_ to start a single thread on top-of-the-line hw is quite unaccepable. (and there's little sane we can do in the kernel about allocation overhead when we have an imperfectly filled 4GB box for all allocations) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:40 ` Ingo Molnar @ 2008-08-13 15:55 ` Ulrich Drepper 2008-08-13 16:02 ` Ingo Molnar 2008-08-13 17:09 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 15:55 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ingo Molnar wrote: > Btw., can you see any problems with option #1: simply removing MAP_32BIT > from 64-bit stack allocations in glibc unconditionally? Yes, as we both agree, there are still such machines out there. The real problem is: what to do if somebody complains? If we would have the extra flag such people could be accommodated. If there is no such flag then distributions cannot just add the flag (it's part of the kernel API) and they would be caught between a rock and a hard place. Option #2 provides the biggest flexibility. I upstream kernel truly doesn't care about such machines anymore there are two options: - - really do nothing at all - - at least reserve a flag in case somebody wants/has to implement option #2 - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijA+4ACgkQ2ijCOnn/RHRhLQCdGNvwikwY4hMHBuYUP4WDqsy3 cfcAn2hrN1MoOkN3UIC4iSUCtqD2Yl6W =yG5T -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:55 ` Ulrich Drepper @ 2008-08-13 16:02 ` Ingo Molnar 2008-08-15 15:54 ` Jamie Lokier 2008-08-13 17:09 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-13 16:02 UTC (permalink / raw) To: Ulrich Drepper Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ingo Molnar wrote: > > Btw., can you see any problems with option #1: simply removing MAP_32BIT > > from 64-bit stack allocations in glibc unconditionally? > > Yes, as we both agree, there are still such machines out there. > > The real problem is: what to do if somebody complains? If we would > have the extra flag such people could be accommodated. If there is no > such flag then distributions cannot just add the flag (it's part of > the kernel API) and they would be caught between a rock and a hard > place. Option #2 provides the biggest flexibility. > > I upstream kernel truly doesn't care about such machines anymore there > are two options: > > - - really do nothing at all do nothing at all is not an option - thread creation can take 10 msecs on top-of-the-line hardware. > - - at least reserve a flag in case somebody wants/has to implement option > #2 yeah, i already had a patch for that when i wrote my first mail [attached below] and listed it as option #4 - then erased the comment figuring that we'd want to do #1 ;-) As unimplemented flags just get ignored by the kernel, if this flag goes into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a plain old 64-bit [47-bit] allocation), then you could do the glibc change straight away, correct? So then if people complain we can fix it in the kernel purely. how about this then? Ingo ---------------------> Subject: mmap: add MAP_64BIT_STACK From: Ingo Molnar <mingo@elte.hu> Date: Wed Aug 13 12:41:54 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-x86/mman.h | 1 + 1 file changed, 1 insertion(+) Index: linux/include/asm-x86/mman.h =================================================================== --- linux.orig/include/asm-x86/mman.h +++ linux/include/asm-x86/mman.h @@ -12,6 +12,7 @@ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 16:02 ` Ingo Molnar @ 2008-08-15 15:54 ` Jamie Lokier 2008-08-15 16:03 ` Ingo Molnar 2008-08-15 17:13 ` Ulrich Drepper 0 siblings, 2 replies; 27+ messages in thread From: Jamie Lokier @ 2008-08-15 15:54 UTC (permalink / raw) To: Ingo Molnar Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin Ingo Molnar wrote: > As unimplemented flags just get ignored by the kernel, if this flag goes > into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a > plain old 64-bit [47-bit] allocation), then you could do the glibc > change straight away, correct? So then if people complain we can fix it > in the kernel purely. > > how about this then? > +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */ I think the flag makes sense but it's name is confusing - 64BIT for a flag which means "maybe request 32-bit stack"! Suggest: +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */ + /* whichever is faster on this CPU */ Also, is this _only_ useful for thread stacks, or are there other memory allocations where 31-bitness affects execution speed on old P4s? -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 15:54 ` Jamie Lokier @ 2008-08-15 16:03 ` Ingo Molnar 2008-08-15 17:13 ` Ulrich Drepper 1 sibling, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2008-08-15 16:03 UTC (permalink / raw) To: Jamie Lokier Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Jamie Lokier <jamie@shareable.org> wrote: > > how about this then? > > > +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */ > > I think the flag makes sense but it's name is confusing - 64BIT for a > flag which means "maybe request 32-bit stack"! Suggest: > > +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */ > + /* whichever is faster on this CPU */ ok. I've applied the patch below to tip/x86/urgent. > Also, is this _only_ useful for thread stacks, or are there other > memory allocations where 31-bitness affects execution speed on old > P4s? just about anything i guess - but since those CPUs do not really matter anymore in terms of bleeding-edge performance, what we care about is the intended current use of this flag: thread stacks. Ingo --------------------> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 15:54 ` Jamie Lokier 2008-08-15 16:03 ` Ingo Molnar @ 2008-08-15 17:13 ` Ulrich Drepper 2008-08-15 17:19 ` Ingo Molnar 1 sibling, 1 reply; 27+ messages in thread From: Ulrich Drepper @ 2008-08-15 17:13 UTC (permalink / raw) To: Jamie Lokier Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jamie Lokier wrote: > Suggest: > > +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */ > + /* whichever is faster on this CPU */ I agree. Except for the comment. > Also, is this _only_ useful for thread stacks, or are there other > memory allocations where 31-bitness affects execution speed on old P4s? Actually, I would define the flag as "do whatever is best assuming the allocation is used for stacks". For instance, minimally the /proc/*/maps output could show "[user stack]" or something like this. For security, perhaps, setting of PROC_EXEC can be prevented. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkiluUMACgkQ2ijCOnn/RHSb5gCfb5VhiLA/wbamoAVqfxR32k4N tSIAoK/KAmwcVd+RjkPnb9RSuAeL/KLV =2ynl -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 17:13 ` Ulrich Drepper @ 2008-08-15 17:19 ` Ingo Molnar 2008-08-15 17:23 ` Ulrich Drepper 0 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-15 17:19 UTC (permalink / raw) To: Ulrich Drepper Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Jamie Lokier wrote: > > Suggest: > > > > +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */ > > + /* whichever is faster on this CPU */ > > I agree. Except for the comment. > > > > Also, is this _only_ useful for thread stacks, or are there other > > memory allocations where 31-bitness affects execution speed on old P4s? > > Actually, I would define the flag as "do whatever is best assuming the > allocation is used for stacks". > > For instance, minimally the /proc/*/maps output could show "[user > stack]" or something like this. For security, perhaps, setting of > PROC_EXEC can be prevented. makes sense. Updated patch below. I've also added your Acked-by. Queued it up in tip/x86/urgent, for v2.6.27 merging. ( also, just to make sure: all Linux kernel versions will ignore such extra flags, so you can just update glibc to use this flag unconditionally, correct? ) Ingo ---------------------------> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 17:19 ` Ingo Molnar @ 2008-08-15 17:23 ` Ulrich Drepper 2008-08-15 19:00 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Ulrich Drepper @ 2008-08-15 17:23 UTC (permalink / raw) To: Ingo Molnar Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <mingo@elte.hu> wrote: > ( also, just to make sure: all Linux kernel versions will ignore such > extra flags, so you can just update glibc to use this flag > unconditionally, correct? ) As soon as the patch hits Linus' tree I can change the code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 17:23 ` Ulrich Drepper @ 2008-08-15 19:00 ` Ingo Molnar 0 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2008-08-15 19:00 UTC (permalink / raw) To: Ulrich Drepper Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Ulrich Drepper <drepper@gmail.com> wrote: > On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <mingo@elte.hu> wrote: > > ( also, just to make sure: all Linux kernel versions will ignore such > > extra flags, so you can just update glibc to use this flag > > unconditionally, correct? ) > > As soon as the patch hits Linus' tree I can change the code. it's upstream now: | commit cd98a04a59e2f94fa64d5bf1e26498d27427d5e7 | Author: Ingo Molnar <mingo@elte.hu> | Date: Wed Aug 13 18:02:18 2008 +0200 | | x86: add MAP_STACK mmap flag thanks everyone, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:55 ` Ulrich Drepper 2008-08-13 16:02 ` Ingo Molnar @ 2008-08-13 17:09 ` Linus Torvalds 2008-08-13 18:04 ` Ulrich Drepper 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2008-08-13 17:09 UTC (permalink / raw) To: Ulrich Drepper Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin On Wed, 13 Aug 2008, Ulrich Drepper wrote: > > The real problem is: what to do if somebody complains? Ulrich, I don't understand why you worry more about a _potential_ (and fairly unlikely) complaint, than about a real one today. Thinking ahead may be good, but you take it to absolutely ridiculous heights, to the point where you make potential problems be bigger than -actual- problems. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 17:09 ` Linus Torvalds @ 2008-08-13 18:04 ` Ulrich Drepper 2008-08-13 18:16 ` Arjan van de Ven 0 siblings, 1 reply; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 18:04 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Linus Torvalds wrote: > Ulrich, I don't understand why you worry more about a _potential_ (and > fairly unlikely) complaint, than about a real one today. Of course I care. All I try to do is to prevent going from one extreme (all focus on P4s) to the other (ignore P4s completely). Even ignoring this one case here, I think it's in any case useful for userlevel to tell the kernel that an anonymous memory region is needed for a stack. This might allow better optimizations and/or security implementations. - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijIi0ACgkQ2ijCOnn/RHRqCwCcCAeJw+BzO9MSwKRtemm5VAq3 FBYAoKbMwR1pkthjLvNlpCSVS76CCoAq =UfmJ -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 18:04 ` Ulrich Drepper @ 2008-08-13 18:16 ` Arjan van de Ven 2008-08-13 18:22 ` Ulrich Drepper 0 siblings, 1 reply; 27+ messages in thread From: Arjan van de Ven @ 2008-08-13 18:16 UTC (permalink / raw) To: Ulrich Drepper Cc: Linus Torvalds, Ingo Molnar, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin On Wed, 13 Aug 2008 11:04:29 -0700 Ulrich Drepper <drepper@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Linus Torvalds wrote: > > Ulrich, I don't understand why you worry more about a _potential_ > > (and fairly unlikely) complaint, than about a real one today. > > Of course I care. All I try to do is to prevent going from one > extreme (all focus on P4s) to the other (ignore P4s completely). (fwiw as far as I know this is only about early 64 bit P4s, not later generations) > > Even ignoring this one case here, I think it's in any case useful for > userlevel to tell the kernel that an anonymous memory region is needed > for a stack. This might allow better optimizations and/or security > implementations. yeah maybe we should also tell it we expect it to be used downwards. Oh wait.. MAP_GROWSDOWN ? -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 18:16 ` Arjan van de Ven @ 2008-08-13 18:22 ` Ulrich Drepper 0 siblings, 0 replies; 27+ messages in thread From: Ulrich Drepper @ 2008-08-13 18:22 UTC (permalink / raw) To: Arjan van de Ven Cc: Linus Torvalds, Ingo Molnar, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Arjan van de Ven wrote: > yeah maybe we should also tell it we expect it to be used downwards. > Oh wait.. MAP_GROWSDOWN ? MAP_GROWSDOWN is unusable because we have to allocate the entire address range for the stack. Otherwise some other allocation happens in that range and all of a sudden the stack cannot grow as much as needed anymore. These flags really can be removed. They should not be used because they are outright dangerous. - -- a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a?? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijJm8ACgkQ2ijCOnn/RHQ7/wCfcrLJPlKmtY5AC3c+fuX9LGe8 +YwAnRqLCdSQvwOUdsAz8Hq9H3dmnqEA =BKsz -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:21 ` Ulrich Drepper 2008-08-13 15:40 ` Ingo Molnar @ 2008-08-13 16:05 ` H. Peter Anvin 1 sibling, 0 replies; 27+ messages in thread From: H. Peter Anvin @ 2008-08-13 16:05 UTC (permalink / raw) To: Ulrich Drepper Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ingo Molnar wrote: >> i find it pretty unacceptable these days that we limit any aspect of >> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). > > Sure, but if we can pin-point the sub-archs for which it is the problem > then a flag to optionally request it is even easier to handle. You'd > simply ignore the flag for anything but the P4 architecture. > > I personally have no problem removing the whole thing because I have no > such machine running anymore. But there are people out there who have. > This could also be done entirely in glibc (thus removing the dependency on the kernel): set the flag if and only if you detect a P4 CPU. You don't even need to enumerate all the CPUs in the system (which would be more painful) if you make the CPUID test wide enough that it catches all compatible CPUs. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 15:10 ` Ingo Molnar 2008-08-13 15:21 ` Ulrich Drepper @ 2008-08-13 20:42 ` Andi Kleen 2008-08-13 20:56 ` Andrew Morton 2008-08-15 12:43 ` Ingo Molnar 1 sibling, 2 replies; 27+ messages in thread From: Andi Kleen @ 2008-08-13 20:42 UTC (permalink / raw) To: Ingo Molnar Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin Ingo Molnar <mingo@elte.hu> writes: > > i find it pretty unacceptable these days that we limit any aspect of > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). It's not limited to 2GB, there's a fallback to >4GB of course. Ok admittedly the fallback is slow, but it's there. I would prefer to not slow down the P4s. There are **lots** of them in field. And they ran 64bit still quite well. Also back then I benchmarked on early K8 and it also made a difference there (but I admit I forgot the numbers) I think it would be better to fix the VM because there are other use cases of applications who prefer to allocate in a lower area. For example Java JVMs now widely use a technique called pointer compression where they dynamically adjust the pointer size based on how much memory the process uses. For that you have to get low memory in the 47bit VM too. The VM should deal with that gracefully. To be honest I always thought the linear search in the VMA list was a little dumb. I'm sure there are other cases where it hurts too. Perhaps this would be really an opportunity to do something about it :) -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 20:42 ` Andi Kleen @ 2008-08-13 20:56 ` Andrew Morton 2008-08-13 21:46 ` Andi Kleen 2008-08-15 12:43 ` Ingo Molnar 1 sibling, 1 reply; 27+ messages in thread From: Andrew Morton @ 2008-08-13 20:56 UTC (permalink / raw) To: Andi Kleen Cc: mingo, drepper, arjan, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, torvalds, tglx, hpa On Wed, 13 Aug 2008 22:42:48 +0200 Andi Kleen <andi@firstfloor.org> wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > > i find it pretty unacceptable these days that we limit any aspect of > > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). > > It's not limited to 2GB, there's a fallback to >4GB of course. Ok > admittedly the fallback is slow, but it's there. > > I would prefer to not slow down the P4s. There are **lots** of them in > field. And they ran 64bit still quite well. Also back then I > benchmarked on early K8 and it also made a difference there (but I > admit I forgot the numbers) > > I think it would be better to fix the VM because there are > other use cases of applications who prefer to allocate in a lower area. > For example Java JVMs now widely use a technique called pointer > compression where they dynamically adjust the pointer size based > on how much memory the process uses. For that you have to get > low memory in the 47bit VM too. The VM should deal with that gracefully. > > To be honest I always thought the linear search in the VMA list > was a little dumb. I'm sure there are other cases where it hurts > too. Perhaps this would be really an opportunity to do something about it :) > Yes, the free_area_cache is always going to have failure modes - I think we've been kind of waiting for it to explode. I do think that we need an O(log(n)) search in there. It could still be on the fallback path, so we retain the mostly-O(1) benefits of free_area_cache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 20:56 ` Andrew Morton @ 2008-08-13 21:46 ` Andi Kleen 0 siblings, 0 replies; 27+ messages in thread From: Andi Kleen @ 2008-08-13 21:46 UTC (permalink / raw) To: Andrew Morton Cc: Andi Kleen, mingo, drepper, arjan, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, torvalds, tglx, hpa > Yes, the free_area_cache is always going to have failure modes - I > think we've been kind of waiting for it to explode. > > I do think that we need an O(log(n)) search in there. It could still > be on the fallback path, so we retain the mostly-O(1) benefits of > free_area_cache. The standard dumb way to do that would be to have two parallel trees, one to index free space (similar to e.g. the free space btrees in XFS) and the other to index the objects (like today). That would increase the constant factor somewhat by bloating the VMAs, increasing cache overhead etc, and also would be more brute force than elegant. But it would be simple and straight forward. Perhaps the combined data structure experience of linux-kernel can come up with something better and some data structure that allows to look up both efficiently? This would be also an opportunity to reevaluate rbtrees for the object index. One drawback of them is that they are not really optimized to be cache friendly because their nodes are too small. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-13 20:42 ` Andi Kleen 2008-08-13 20:56 ` Andrew Morton @ 2008-08-15 12:43 ` Ingo Molnar 2008-08-15 13:33 ` Andi Kleen 1 sibling, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2008-08-15 12:43 UTC (permalink / raw) To: Andi Kleen Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin * Andi Kleen <andi@firstfloor.org> wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > > i find it pretty unacceptable these days that we limit any aspect of > > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). > > It's not limited to 2GB, there's a fallback to >4GB of course. Ok > admittedly the fallback is slow, but it's there. Of course - what you are missing is that _10 milliseconds_ thread creation overhead is completely unacceptable overhead: it is so bad as if we didnt even support it. > I would prefer to not slow down the P4s. There are **lots** of them in > field. And they ran 64bit still quite well. [...] Nonsense, i had such a P4 based 64-bit box and it was painful. Everyone with half a brain used them as 32-bit machines. Nor is the context-switch overhead in any way significant. Plus, as Arjan mentioned it, only the earliest P4 64-bit CPUs had this problem. > [...] Also back then I benchmarked on early K8 and it also made a > difference there (but I admit I forgot the numbers) that's a lot of handwaving with no actual numbers. The numbers in this discussion show that the context-switch overhead is small and that the overhead on perfectly good systems that hit this limit is obscurely high. I'd love to zap MAP_32BIT this very minute from the kernel, but you originally shaped the whole thing in such a stupid way that makes its elimination impossible now due to ABI constraints. It would have cost you _nothing_ to have added MAP_64BIT_STACK back then, but the quick & sloppy solution was to reuse MAP_32BIT for 64-bit tasks. And you are stupid about it even now. Bleh. The correct solution is to eliminate this flag from glibc right now, and maybe add the MAP_64BIT_STACK flag as well, as i posted it - if anyone with such old boxes still cares (i doubt anyone does). That flag then will take its usual slow route. Ulrich? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization? 2008-08-15 12:43 ` Ingo Molnar @ 2008-08-15 13:33 ` Andi Kleen 0 siblings, 0 replies; 27+ messages in thread From: Andi Kleen @ 2008-08-15 13:33 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin On Fri, Aug 15, 2008 at 02:43:50PM +0200, Ingo Molnar wrote: > i had such a P4 based 64-bit box and it was painful. I used them as 64bit machines and they weren't painful at all. > I'd love to zap MAP_32BIT this very minute from the kernel, but you > originally shaped the whole thing in such a stupid way that makes its > elimination impossible now due to ABI constraints. It would have cost MAP_32BIT was not actually added for this originally. It was originally added for the X server's old dynamic loader, which needed 2GB memory. It's main failing, which I freely admit, was to not call it MAP_31BIT. > you _nothing_ to have added MAP_64BIT_STACK back then, but the quick & Not sure what the semantics of that would be. For me it would seem ugly to hardcode specific semantics in the kernel for this ("mechanism not policy") But for most possible semantics I can think of the data structure would still need to be fixed I think. > The correct solution is to eliminate this flag from glibc right now, and IMHO the correct solution is to fix the data structure to not have such a bad complexity in this corner case. We typically do this for all other data structures as we discover such cases. No reason the VMAs should be any different. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2008-08-15 19:00 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <af8810200808121736q76640cc1kb814385072fe9b29@mail.gmail.com>
2008-08-13 0:45 ` pthread_create() slow for many threads; also time to revisit 64b context switch optimization? Pardo
2008-08-13 10:44 ` Ingo Molnar
2008-08-13 13:35 ` Arjan van de Ven
2008-08-13 14:21 ` Ulrich Drepper
2008-08-13 14:25 ` Ingo Molnar
2008-08-13 14:36 ` Ulrich Drepper
2008-08-13 15:10 ` Ingo Molnar
2008-08-13 15:21 ` Ulrich Drepper
2008-08-13 15:40 ` Ingo Molnar
2008-08-13 15:55 ` Ulrich Drepper
2008-08-13 16:02 ` Ingo Molnar
2008-08-15 15:54 ` Jamie Lokier
2008-08-15 16:03 ` Ingo Molnar
2008-08-15 17:13 ` Ulrich Drepper
2008-08-15 17:19 ` Ingo Molnar
2008-08-15 17:23 ` Ulrich Drepper
2008-08-15 19:00 ` Ingo Molnar
2008-08-13 17:09 ` Linus Torvalds
2008-08-13 18:04 ` Ulrich Drepper
2008-08-13 18:16 ` Arjan van de Ven
2008-08-13 18:22 ` Ulrich Drepper
2008-08-13 16:05 ` H. Peter Anvin
2008-08-13 20:42 ` Andi Kleen
2008-08-13 20:56 ` Andrew Morton
2008-08-13 21:46 ` Andi Kleen
2008-08-15 12:43 ` Ingo Molnar
2008-08-15 13:33 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox