Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
       [not found] <af8810200808121736q76640cc1kb814385072fe9b29@mail.gmail.com>
@ 2008-08-13  0:45 ` Pardo
  2008-08-13 10:44   ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Pardo @ 2008-08-13  0:45 UTC (permalink / raw)
  To: akpm, mingo, hugh, linux-mm, linux-kernel; +Cc: briangrant, cgd, mbligh

[First send rejected by vger.kernel.org due to HTML and/or test
program attachment.  Re-send without, please contact me for the test
program.]

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL's pthread_create() to run about three orders of magnitude slower.
 As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles -- which is under 100 threads per
second.  Larger stacks reduce the severity of slowdown but also make
slowdown happen after allocating a few thousand threads.  Costs vary
with platform, stack size, etc., but thread allocation rates drop
suddenly on all of a half-dozen platforms I tried.

The cause is NPTL allocates stacks with code of the form (e.g., glibc
2.7 nptl/allocatestack.c):

sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...);
if (sto == MAP_FAILED)
  sto = mmap(0, ..., MAP_PRIVATE, ...);

That is, try to allocate in the low 4GB, and when low addresses are
exhausted, allocate from any location.  Thus, once low addresses run
out, every stack allocation does a failing mmap() followed by a
successful mmap().  The failing mmap() is slow because it does a
linear search of all low-space vma's.

Low-address stacks are preferred because some machines context switch
much faster when the stack address has only 32 significant bits.  Slow
allocation was discussed in 2003 but without resolution.  See, e.g.,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With
increasing use of threads, slow allocation is becoming a problem.

Some old machines were faster switching 32b stacks, but new machines
seem to switch as fast or faster using 64b stacks.  I measured
thread-to-thread context switches on two AMD processors and five Intel
procesors.  Tests used the same code with 32b or 64b stack pointers;
tests covered varying numbers of threads switched and varying methods
of allocating stacks.  Two systems gave indistinguishable performance
with 32b or 64b stacks, four gave 5%-10% better performance using 64b
stacks, and of the systems I tested, only the P4 microarchitecture
x86-64 system gave better performance for 32b stacks, in that case
vastly better.  Most systems had thread-to-thread switch costs around
800-1200 cycles.  The P4 microarchitecture system had 32b context
switch costs around 3,000 cycles and 64b context switches around 4,800
cycles.

It appears the kernel's 64-bit switch path handles all 32-bit cases.
So on machines with a fast 64-bit path, context switch speed would
presumably be improved yet further by eliminating the special 32-bit
path.  It appears this would also collapse the task state's fs and
fsindex fields, and the gs and gsindex fields.  These could further
reduce memory, cache, and branch predictor pressure.

Various things would address the slow pthread_create().  Choices include:
 - Be more platform-aware about when to use MAP_32BIT.
 - Abandon use of MAP_32BIT entirely, with worse performance on some machines.
 - Change the mmap() algorithm to be faster on allocation failure
(avoid a linear search of vmas).

Options to improve context switch times include:

 - Do nothing.
 - Be more platform-aware about when to use different 32b and 64b paths.
 - Get rid of the 32b path, which also appears it would make contexts smaller.

[Not] Attached is a program to measure context switch costs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13  0:45 ` pthread_create() slow for many threads; also time to revisit 64b context switch optimization? Pardo
@ 2008-08-13 10:44   ` Ingo Molnar
  2008-08-13 13:35     ` Arjan van de Ven
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-13 10:44 UTC (permalink / raw)
  To: Pardo
  Cc: akpm, hugh, linux-mm, linux-kernel, briangrant, cgd, mbligh,
	Ulrich Drepper, Linus Torvalds, Thomas Gleixner, H. Peter Anvin,
	Arjan van de Ven

* Pardo <pardo@google.com> wrote:

>  As example, in one case creating new threads goes from about 35,000 
> cycles up to about 25,000,000 cycles -- which is under 100 threads per 
> second. [...]

> Various things would address the slow pthread_create().  Choices 
> include:
>  - Be more platform-aware about when to use MAP_32BIT.
>  - Abandon use of MAP_32BIT entirely, with worse performance on some machines.
>  - Change the mmap() algorithm to be faster on allocation failure
> (avoid a linear search of vmas).

Sigh, unfortunately MAP_32BIT use in 64-bit apps for stacks was 
apparently created without foresight about what would happen in the MM 
when thread stacks exhaust 4GB.

The problem is that MAP_32BIT is used both as a performance hack for 
64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot 
just start disregarding MAP_32BIT in the kernel - we'd break 32-bit 
compat apps and/or compat 32-bit libraries.

There are various other options to solve the (severe!) performance 
breakdown:

1- glibc could start not using MAP_32BIT for 64-bit thread stacks (the 
   boxes where context-switching is slow probably do not matter all that 
   much anymore - they were very slow at everything 64-bit anyway)

     Pros: easiest solution.
     Cons: slows down the affected machines and needs a new glibc.

2- We could introduce a new MAP_64BIT_STACK flag which we could
   propagate it into MAP_32BIT on those old CPUs. It would be 
   disregarded on modern CPUs and thread stacks would be 64-bit.

     Pros: cleanest solution.
     Cons: needs both new glibc and new kernel to take advantage of.

3- We could detect the first-4G-is-full condition and cache it. Problem
   is, there will likely be small holes in it so it's rather hard to do 
   it in a sane way. Also, every munmap() of a thread stack will 
   invalidate this - triggering a slow linear search every now and then.

     Pros: only needs a new kernel to take advantage of.
     Cons: is the most complex and messiest solution with no clear 
           benefit to other workloads. Also, does not 100% solve the 
           performance problem and prolongues the 4GB stack threads 
           hack.

i'd go for 1) or 2).

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 10:44   ` Ingo Molnar
@ 2008-08-13 13:35     ` Arjan van de Ven
  2008-08-13 14:21       ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2008-08-13 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pardo, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd,
	mbligh, Ulrich Drepper, Linus Torvalds, Thomas Gleixner,
	H. Peter Anvin

On Wed, 13 Aug 2008 12:44:45 +0200
Ingo Molnar <mingo@elte.hu> wrote:


> There are various other options to solve the (severe!) performance 
> breakdown:
> 
> 1- glibc could start not using MAP_32BIT for 64-bit thread stacks
> (the boxes where context-switching is slow probably do not matter all
> that much anymore - they were very slow at everything 64-bit anyway)
> 
>      Pros: easiest solution.
>      Cons: slows down the affected machines and needs a new glibc.
> 
> 
> i'd go for 1) or 2).

I would go for 1) clearly; it's the cleanest thing going forward for
sure.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 13:35     ` Arjan van de Ven
@ 2008-08-13 14:21       ` Ulrich Drepper
  2008-08-13 14:25         ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 14:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, akpm, hugh, linux-mm, linux-kernel, briangrant, cgd,
	mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Arjan van de Ven wrote:
>> i'd go for 1) or 2).
> 
> I would go for 1) clearly; it's the cleanest thing going forward for
> sure.

I want to see numbers first.  If there are problems visible I definitely
would want to see 2.  Andi at the time I wrote that code was very
adamant that I use the flag.

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii7gcACgkQ2ijCOnn/RHTveQCeIefB1R5QpuQ71RNMihKL5oWD
ZVoAnjjjKgXznRx8qtbrF+fgvcNwsngA
=dAz2
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 14:21       ` Ulrich Drepper
@ 2008-08-13 14:25         ` Ingo Molnar
  2008-08-13 14:36           ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-13 14:25 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Arjan van de Ven wrote:
> >> i'd go for 1) or 2).
> > 
> > I would go for 1) clearly; it's the cleanest thing going forward for
> > sure.
> 
> I want to see numbers first.  If there are problems visible I 
> definitely would want to see 2.  Andi at the time I wrote that code 
> was very adamant that I use the flag.

not sure exactly what numbers you mean, but there are lots of numbers in 
the first mail, attached below. For example:

| As example, in one case creating new threads goes from about 35,000 
| cycles up to about 25,000,000 cycles -- which is under 100 threads per 
| second.  Larger stacks reduce the severity of slowdown but also make

being able to create only 100 threads per second brings us back to 33 
MHz 386 DX Linux performance.

	Ingo

---------------------->

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL's pthread_create() to run about three orders of magnitude slower.
 As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles -- which is under 100 threads per
second.  Larger stacks reduce the severity of slowdown but also make
slowdown happen after allocating a few thousand threads.  Costs vary
with platform, stack size, etc., but thread allocation rates drop
suddenly on all of a half-dozen platforms I tried.

The cause is NPTL allocates stacks with code of the form (e.g., glibc
2.7 nptl/allocatestack.c):

sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...);
if (sto == MAP_FAILED)
  sto = mmap(0, ..., MAP_PRIVATE, ...);

That is, try to allocate in the low 4GB, and when low addresses are
exhausted, allocate from any location.  Thus, once low addresses run
out, every stack allocation does a failing mmap() followed by a
successful mmap().  The failing mmap() is slow because it does a
linear search of all low-space vma's.

Low-address stacks are preferred because some machines context switch
much faster when the stack address has only 32 significant bits.  Slow
allocation was discussed in 2003 but without resolution.  See, e.g.,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With
increasing use of threads, slow allocation is becoming a problem.

Some old machines were faster switching 32b stacks, but new machines
seem to switch as fast or faster using 64b stacks.  I measured
thread-to-thread context switches on two AMD processors and five Intel
procesors.  Tests used the same code with 32b or 64b stack pointers;
tests covered varying numbers of threads switched and varying methods
of allocating stacks.  Two systems gave indistinguishable performance
with 32b or 64b stacks, four gave 5%-10% better performance using 64b
stacks, and of the systems I tested, only the P4 microarchitecture
x86-64 system gave better performance for 32b stacks, in that case
vastly better.  Most systems had thread-to-thread switch costs around
800-1200 cycles.  The P4 microarchitecture system had 32b context
switch costs around 3,000 cycles and 64b context switches around 4,800
cycles.

It appears the kernel's 64-bit switch path handles all 32-bit cases.
So on machines with a fast 64-bit path, context switch speed would
presumably be improved yet further by eliminating the special 32-bit
path.  It appears this would also collapse the task state's fs and
fsindex fields, and the gs and gsindex fields.  These could further
reduce memory, cache, and branch predictor pressure.

Various things would address the slow pthread_create().  Choices include:
 - Be more platform-aware about when to use MAP_32BIT.
 - Abandon use of MAP_32BIT entirely, with worse performance on some machines.
 - Change the mmap() algorithm to be faster on allocation failure
(avoid a linear search of vmas).

Options to improve context switch times include:

 - Do nothing.
 - Be more platform-aware about when to use different 32b and 64b paths.
 - Get rid of the 32b path, which also appears it would make contexts smaller.

[Not] Attached is a program to measure context switch costs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 14:25         ` Ingo Molnar
@ 2008-08-13 14:36           ` Ulrich Drepper
  2008-08-13 15:10             ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 14:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> not sure exactly what numbers you mean, but there are lots of numbers in 
> the first mail, attached below. For example:

I mean numbers indicating that it doesn't hurt performance on any of
today's machines.  If there are machines where it makes a difference
then we need the flag to indicate the _preference_ for a low stack, as
opposed to indicating a _requirement_.

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii8VcACgkQ2ijCOnn/RHTiLQCfcZ9xJHMi0Jv59l700ZNJUoi6
aEcAn370XuGhs1u1YeD2Gqq35zQnKh26
=rC0v
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 14:36           ` Ulrich Drepper
@ 2008-08-13 15:10             ` Ingo Molnar
  2008-08-13 15:21               ` Ulrich Drepper
  2008-08-13 20:42               ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2008-08-13 15:10 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ingo Molnar wrote:
> > not sure exactly what numbers you mean, but there are lots of numbers in 
> > the first mail, attached below. For example:
> 
> I mean numbers indicating that it doesn't hurt performance on any of 
> today's machines.  If there are machines where it makes a difference 
> then we need the flag to indicate the _preference_ for a low stack, as 
> opposed to indicating a _requirement_.

there were a few numbers about that as well, and a test-app. The test 
app is below. The numbers were:

| I measured thread-to-thread context switches on two AMD processors and 
| five Intel procesors.  Tests used the same code with 32b or 64b stack 
| pointers; tests covered varying numbers of threads switched and 
| varying methods of allocating stacks.  Two systems gave 
| indistinguishable performance with 32b or 64b stacks, four gave 5%-10% 
| better performance using 64b stacks, and of the systems I tested, only 
| the P4 microarchitecture x86-64 system gave better performance for 32b 
| stacks, in that case vastly better.  Most systems had thread-to-thread 
| switch costs around 800-1200 cycles.  The P4 microarchitecture system 
| had 32b context switch costs around 3,000 cycles and 64b context 
| switches around 4,800 cycles.

i find it pretty unacceptable these days that we limit any aspect of 
pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 
[other than the small execution model which is 2GB obviously.]

	Ingo

--------------------->
// switch.cc -- measure thread-to-thread context switch times
// using either low-address stacks or high-address stacks

#include <sys/mman.h>
#include <sys/types.h>
#include <pthread.h>
#include <sched.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

const int kRequestedSwaps = 10000;
const int kNumThreads = 2;
const int kRequestedSwapsPerThread = kRequestedSwaps / kNumThreads;
const int kStackSize = 64 * 1024;
const int kTrials = 100;



typedef long long Tsc;
#define LARGEST_TSC	(static_cast<Tsc>(1ULL << (8 * sizeof(Tsc) - 2) - 1))

Tsc now() {
  unsigned int eax_lo, edx_hi;
  Tsc now;
  asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi));
  now = ((Tsc)eax_lo) | ((Tsc)(edx_hi) << 32);
  return now;
}



// Use 0/1 for size to allow array subscripting.
const int pointer_sizes[] = { 32, 64 };
#define SZ_N  (sizeof(pointer_sizes) / sizeof(pointer_sizes[0]))
typedef int PointerSize;

PointerSize address_size(const void *vaddr) {
  intptr_t iaddr = reinterpret_cast<intptr_t>(vaddr);
  return ((iaddr >> 32) == 0) ? 0 : 1;
}



// One instance poitned to by every PerThread.
struct SharedArgs {
  // Read-only during a given test:
  cpu_set_t cpu;          // Only one bit set; all threads run on this CPU.

  // Read/write during a given test:
  pthread_barrier_t start_barrier;
  pthread_barrier_t stop_barrier;
};

// One per thread.
struct PerThread {
  // Thread args
  SharedArgs *shared_args;
  Tsc *stamps;

  // Per-thread storage.
  pthread_t thread;
  void *stack[SZ_N];                    // mmap()'d storage
  pthread_attr_t attr;
};



// Distinguish betwen start/stop timestamp for each iteration
typedef enum { START, STOP } StartStop;

// Record each timestamp in isolation for minimum runtime cache footprint;
// after a run, copy each timestamp to one of these so can sort and also track
// start/stop, etc.
struct Event {
  Tsc time;
  StartStop start_stop;
  int thread_num;
  int iter;
};

// Sort events in increasing time order.
int event_pred(const void *ve0, const void *ve1) {
  const Event *e0 = static_cast<const Event *>(ve0);
  const Event *e1 = static_cast<const Event *>(ve1);
  return e0->time - e1->time;
}

// Data to aggregate across runs.  Print only after runs are all over, in order
// to minimize possible overlap of I/O and benchmark.
struct Result {
  int pointer_size;
  int swaps;
  Tsc fastest;
};



// Each thread runs this worker.
void *worker(void *v_per_thread) {
  const PerThread *per_thread = static_cast<const PerThread *>(v_per_thread);
  SharedArgs *shared_args = per_thread->shared_args;

  // Run all threads on the same CPU.
  const cpu_set_t *cpu = &shared_args->cpu;
  int cc = sched_setaffinity(0/*self*/, sizeof(*cpu), cpu);
  if (cc != 0) {
    perror("sched_setaffinity");
    exit(1);
  }

  // Wait for all workers to be ready before running the inner loop.
  cc = pthread_barrier_wait(&shared_args->start_barrier);
  if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
    perror("pthread_barrier_wait");
    exit(1);
  }

  // Inner loop: track time before and after a swap.  In principle we
  // can use just one timestamp per iteration, but that gives more
  // variance between timestamps from overheads such as cache misses
  // not related to the context switch.
  Tsc *stamp = per_thread->stamps;
  for (int i = 0; i < kRequestedSwapsPerThread; ++i) {
    // Run timed critical section in as much isolation as possible.
    // Notably, read stamps but avoid saving them to memory and taking
    // cache misses until after both %tsc reads.
    asm volatile ("nop" ::: "memory");
    Tsc start = now();
    sched_yield();
    Tsc stop = now();
    asm volatile ("nop" ::: "memory");
    *stamp++ = start;
    *stamp++ = stop;
  }

  // Release the manager to clean up.
  cc = pthread_barrier_wait(&shared_args->stop_barrier);
  if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
    perror("pthread_barrier_wait");
    exit(1);
  }

  return NULL;
}


// Manager code that creates and starts worker threads, waits, then cleans up.
void run_test(PerThread *per_thread, PointerSize ps) {
  // Create worker threads.
  for (int th = 0; th < kNumThreads; ++th) {
    int cc = pthread_attr_setstack(&per_thread[th].attr,
                                   per_thread[th].stack[ps], kStackSize);
    if (cc != 0) {
      perror("pthread_attr_setstack");
      exit(1);
    }

    cc = pthread_create(&per_thread[th].thread, &per_thread[th].attr,
                        worker, &per_thread[th]);
    if (cc != 0) {
      perror("pthread_create");
      exit(1);
    }
  }

  // Release all worker threads to run their inner loop,
  // then wait for all to finish before joining any.
  SharedArgs *shared_args = per_thread->shared_args;
  int cc = pthread_barrier_wait(&shared_args->start_barrier);
  if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
    perror("pthread_barrier_wait");
    exit(1);
  }
  cc = pthread_barrier_wait(&shared_args->stop_barrier);
  if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
    perror("pthread_barrier_wait");
    exit(1);
  }

  // Clean up worker threads.
  for (int th = 0; th < kNumThreads; ++th) {
    int cc = pthread_join(per_thread[th].thread, NULL);
    if (cc != 0) {
      perror("pthread_join");
      exit(1);
    }
  }
}


// After a run, find out which sched_yield() calls actually did a yield,
// then find out the fastest sched_yield() that occured during the run.
Result process_data(Event *event, const PerThread per_thread[],
                    int requested_swaps_per_thread, PointerSize pointer_size) {
  // Copy timestamps in to a struct to associate timestamps with thread number.
  int event_num = 0;
  for (int th = 0; th < kNumThreads; ++th) {
    const Tsc *stamps = per_thread[th].stamps;
    int stamp_num = 0;
    StartStop start_stop = START;
    // 2* because there's a start stamp and stop stamp for each swap
    for (int iter = 0; iter < (2 * requested_swaps_per_thread); ++iter) {
      event[event_num].time = stamps[stamp_num++];
      event[event_num].start_stop = start_stop;
      start_stop = (start_stop == START) ? STOP : START;
      event[event_num].thread_num = th;
      event[event_num].iter = iter;
      ++event_num;
    }
  }
  int num_events = event_num;

  // Sort data in timestamp order.
  qsort(event, num_events, sizeof(event[0]), event_pred);

  // A context switch occurred ff two adjacent stamps are for
  // different threads.  A requested context switch very likely
  // occured if a context switch was between a START stamp in the
  // first thread and a STOP stamp in the second.  Note that some
  // non-requested context switches also get logged.  As example, a
  // preemptive cswap could have occured, and the following
  // sched_yield() may have done a yield-to-self.
  Tsc fastest = LARGEST_TSC;
  int swaps = 0;
  for (int e = 0; e < (num_events - 1); ++e) {
    if ((event[e].thread_num != event[e+1].thread_num) &&
        (event[e].start_stop == START) && (event[e+1].start_stop == STOP)) {
      ++swaps;
      Tsc t = event[e+1].time - event[e].time;
      if (t < fastest)
        fastest = t;
    }
  }

  Result result;
  result.pointer_size = pointer_size;
  result.swaps = swaps;
  result.fastest = fastest;
  return result;
}


// Dump results for one run.  Also aggregate "best of best" and "worst of best".
void dump_one_run(Tsc best[SZ_N], Tsc worst[SZ_N], int trial_num,
                  const Result *result) {
  Tsc t = result->fastest;
  PointerSize ps = result->pointer_size;
  int cc = printf("run: %d pointer-size: %d requested-swaps: %d got-swaps: %d fastest: %lld\n",
                  trial_num, pointer_sizes[ps],
                  kRequestedSwaps, result->swaps, result->fastest);
  if (cc < 0) {
    perror("printf");
    exit(1);
  }
  if (t < best[ps])
    best[ps] = t;
  if (t > worst[ps])
    worst[ps] = t;
}

void *mmap_stack(PointerSize pointer_size) {
  int location_flag;
  switch(pointer_sizes[pointer_size]) {
    case 32: location_flag = MAP_32BIT; break;
    case 64: location_flag = 0x0; break;
    default:
      fprintf(stderr, "Implementation error: unhandled stack placement\n");
      exit(1);
  }

  void *stack = mmap(0, kStackSize, PROT_READ|PROT_WRITE,
                     MAP_PRIVATE|MAP_ANONYMOUS|location_flag, 0, 0);
  if (stack == MAP_FAILED) {
    perror("mmap");
    exit(1);
  }

  // Check we got the stack location we requested
  PointerSize got = address_size(stack);
  if (got != pointer_size) {
    // Note: MSWindohs and Linux are asymmetrical about %p: one prints
    // with a leading 0x, the other does not.  Assume here it does not matter.
    fprintf(stderr, "Did not get requested pointer size\n");
    exit(1);
  }

  return stack;
}

void munmap_stack(void *stack) {
  int cc = munmap(stack, kStackSize);
  if (cc != 0) {
    perror("munmap");
    exit(1);
  }
}

int main(int argc, char **argv) {
  SharedArgs shared_args;

  // Find the highest-numbered CPU, all threads run on that thread only.
  {
    cpu_set_t set;
    int sz = sched_getaffinity(0, sizeof(set), &set);
    // Documentation says sched_getaffinity() returns the size used by
    // the kernel, but by experiment it returns zero on some 2.6.18
    // systems, but with a sensible mask nonetheless.
    if (sz < 0) {
      perror ("sched_getaffinity");
      exit(1);
    }
    // Find an available processor/core.  If possible grab something other
    // than CPU 0 to minimize interference from interrupts preferentially
    // delivered to core 0.
    int proc;
    for (proc=CPU_SETSIZE-1; proc>=0; --proc)
      if (CPU_ISSET(proc, &set))
        break;
    if (proc >= CPU_SETSIZE) {
      fprintf (stderr, "No virtual processors!?\n");
      exit(1);
    }
    CPU_ZERO(&shared_args.cpu);
    CPU_SET(proc, &shared_args.cpu);
  }

  // Reusable per-thread setup
  PerThread per_thread[kNumThreads];
  for (int th = 0; th < kNumThreads; ++th) {
    per_thread[th].stamps = new Tsc[2 * kRequestedSwaps];
    per_thread[th].shared_args = &shared_args;
    for (int ps = 0; ps < SZ_N; ++ps)
      per_thread[th].stack[ps] = mmap_stack(static_cast<PointerSize>(ps));
    int cc = pthread_attr_init(&per_thread[th].attr);
    if (cc != 0) {
      perror("pthread_attr_init");
      exit(1);
    }
  }

  // Storage for post-processing timestamps from one trial run.
  // 2 stamps per iteration.  'new' the storage since long runs
  // otherwise overflow the stack.
  Event *event = new Event[kNumThreads * (2 * kRequestedSwaps)];

  // Post-processed data for all trial runs.  Written during the "run
  // tests" phase and read during the "dump data" phase.
  int kNumRuns = kTrials * SZ_N;
  Result result[kNumRuns];
  int result_num = 0;

  // Pthread barriers are cyclic, so can reuse them. +1 for the manager thread
  pthread_barrier_init(&shared_args.start_barrier, NULL, kNumThreads + 1);
  pthread_barrier_init(&shared_args.stop_barrier, NULL, kNumThreads + 1);

  // Warming runs
  {
    run_test(per_thread, static_cast<PointerSize>(0/*32b*/));
    run_test(per_thread, static_cast<PointerSize>(1/*64b*/));
  }

  // Run tests
  for (int trial = 0; trial < kTrials; ++trial) {
    int requested_swaps_per_thread = kRequestedSwaps / kNumThreads;
    for (int ps = 0; ps < SZ_N; ++ps) {
      PointerSize pointer_size = static_cast<PointerSize>(ps);
      run_test(per_thread, pointer_size);

      // Process data and save to RAM.  Do not do explicit I/O here on the
      // basis background activity may interfere with context switches.
      result[result_num++] = process_data(event,
                                          per_thread,
                                          requested_swaps_per_thread,
                                          pointer_size);
    }
  }

  // Cleanup
  pthread_barrier_destroy(&shared_args.start_barrier);
  pthread_barrier_destroy(&shared_args.stop_barrier);

  for (int th = 0; th < kNumThreads; ++th) {
    delete[] per_thread[th].stamps;
    for (int ps = 0; ps < SZ_N; ++ps)
      munmap_stack(per_thread[th].stack[ps]);
    int cc = pthread_attr_destroy(&per_thread[th].attr);
    if (cc != 0) {
      perror("pthread_attr_destory");
      exit(1);
    }
  }
  delete[] event;

  // Dump data from RAM to stdout.
  Tsc best[SZ_N] = { LARGEST_TSC, LARGEST_TSC };
  Tsc worst[SZ_N] = { 0, 0 };
  for (int r = 0; r < result_num; ++r)
    dump_one_run(best, worst, r, &result[r]);
  for (int sz = 0; sz < SZ_N; ++sz) {
    int cc = printf("best-of-best[%d]: %lld\nworst-of-best[%d]: %lld\n",
                    pointer_sizes[sz], best[sz], pointer_sizes[sz], worst[sz]);
    if (cc < 0) {
      perror("printf");
      exit(1);
    }
  }
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:10             ` Ingo Molnar
@ 2008-08-13 15:21               ` Ulrich Drepper
  2008-08-13 15:40                 ` Ingo Molnar
  2008-08-13 16:05                 ` H. Peter Anvin
  2008-08-13 20:42               ` Andi Kleen
  1 sibling, 2 replies; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 15:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> i find it pretty unacceptable these days that we limit any aspect of 
> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 

Sure, but if we can pin-point the sub-archs for which it is the problem
then a flag to optionally request it is even easier to handle.  You'd
simply ignore the flag for anything but the P4 architecture.

I personally have no problem removing the whole thing because I have no
such machine running anymore.  But there are people out there who have.

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii/BcACgkQ2ijCOnn/RHQ8FACfZFV+WaBmS6UNqZZ/xDfV/Z7z
gIAAoJSmbauchdaIVIebz8N2rPrszAMF
=WAzJ
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:21               ` Ulrich Drepper
@ 2008-08-13 15:40                 ` Ingo Molnar
  2008-08-13 15:55                   ` Ulrich Drepper
  2008-08-13 16:05                 ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-13 15:40 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ingo Molnar wrote:
> > i find it pretty unacceptable these days that we limit any aspect of 
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 
> 
> Sure, but if we can pin-point the sub-archs for which it is the 
> problem then a flag to optionally request it is even easier to handle.  
> You'd simply ignore the flag for anything but the P4 architecture.

i suspect you are talking about option #2 i described. It is the option 
which will take the most time to trickle down to people.

> I personally have no problem removing the whole thing because I have 
> no such machine running anymore.  But there are people out there who 
> have.

hm, i think the set of people running on such boxes _and_ then upgrading 
to a new glibc and expecting everything to be just as fast to the 
microsecond as before should be miniscule. Those P4 derived 64-bit boxes 
were astonishingly painful in 64-bit mode - most of that hw is running 
32-bit i suspect, because 64-bit on it was really a joke.

Btw., can you see any problems with option #1: simply removing MAP_32BIT 
from 64-bit stack allocations in glibc unconditionally? It's the fastest 
to execute and also the most obvious solution. +1 usecs overhead in the 
64-bit context-switch path on those old slow boxes wont matter much. 

10 _millisecs_ to start a single thread on top-of-the-line hw is quite 
unaccepable. (and there's little sane we can do in the kernel about 
allocation overhead when we have an imperfectly filled 4GB box for all 
allocations)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:40                 ` Ingo Molnar
@ 2008-08-13 15:55                   ` Ulrich Drepper
  2008-08-13 16:02                     ` Ingo Molnar
  2008-08-13 17:09                     ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 15:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> Btw., can you see any problems with option #1: simply removing MAP_32BIT 
> from 64-bit stack allocations in glibc unconditionally?

Yes, as we both agree, there are still such machines out there.

The real problem is: what to do if somebody complains?  If we would have
the extra flag such people could be accommodated.  If there is no such
flag then distributions cannot just add the flag (it's part of the
kernel API) and they would be caught between a rock and a hard place.
Option #2 provides the biggest flexibility.

I upstream kernel truly doesn't care about such machines anymore there
are two options:

- - really do nothing at all

- - at least reserve a flag in case somebody wants/has to implement option
  #2

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijA+4ACgkQ2ijCOnn/RHRhLQCdGNvwikwY4hMHBuYUP4WDqsy3
cfcAn2hrN1MoOkN3UIC4iSUCtqD2Yl6W
=yG5T
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:55                   ` Ulrich Drepper
@ 2008-08-13 16:02                     ` Ingo Molnar
  2008-08-15 15:54                       ` Jamie Lokier
  2008-08-13 17:09                     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-13 16:02 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, akpm, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, Linus Torvalds, Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ingo Molnar wrote:
> > Btw., can you see any problems with option #1: simply removing MAP_32BIT 
> > from 64-bit stack allocations in glibc unconditionally?
> 
> Yes, as we both agree, there are still such machines out there.
> 
> The real problem is: what to do if somebody complains?  If we would 
> have the extra flag such people could be accommodated.  If there is no 
> such flag then distributions cannot just add the flag (it's part of 
> the kernel API) and they would be caught between a rock and a hard 
> place. Option #2 provides the biggest flexibility.
> 
> I upstream kernel truly doesn't care about such machines anymore there
> are two options:
> 
> - - really do nothing at all

do nothing at all is not an option - thread creation can take 10 msecs 
on top-of-the-line hardware.

> - - at least reserve a flag in case somebody wants/has to implement option
>   #2

yeah, i already had a patch for that when i wrote my first mail 
[attached below] and listed it as option #4 - then erased the comment 
figuring that we'd want to do #1 ;-)

As unimplemented flags just get ignored by the kernel, if this flag goes 
into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a 
plain old 64-bit [47-bit] allocation), then you could do the glibc 
change straight away, correct? So then if people complain we can fix it 
in the kernel purely.

how about this then?

	Ingo

--------------------->
Subject: mmap: add MAP_64BIT_STACK
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Aug 13 12:41:54 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-x86/mman.h |    1 +
 1 file changed, 1 insertion(+)

Index: linux/include/asm-x86/mman.h
===================================================================
--- linux.orig/include/asm-x86/mman.h
+++ linux/include/asm-x86/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_64BIT_STACK	0x20000		/* give out 32bit addresses on old CPUs */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 16:02                     ` Ingo Molnar
@ 2008-08-15 15:54                       ` Jamie Lokier
  2008-08-15 16:03                         ` Ingo Molnar
  2008-08-15 17:13                         ` Ulrich Drepper
  0 siblings, 2 replies; 27+ messages in thread
From: Jamie Lokier @ 2008-08-15 15:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

Ingo Molnar wrote:
> As unimplemented flags just get ignored by the kernel, if this flag goes 
> into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a 
> plain old 64-bit [47-bit] allocation), then you could do the glibc 
> change straight away, correct? So then if people complain we can fix it 
> in the kernel purely.
> 
> how about this then?

> +#define MAP_64BIT_STACK 0x20000         /* give out 32bit addresses on old CPUs */

I think the flag makes sense but it's name is confusing - 64BIT for a
flag which means "maybe request 32-bit stack"!  Suggest:

+#define MAP_STACK       0x20000         /* 31bit or 64bit address for stack, */
+                                        /* whichever is faster on this CPU */

Also, is this _only_ useful for thread stacks, or are there other
memory allocations where 31-bitness affects execution speed on old P4s?

-- Jamie

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 15:54                       ` Jamie Lokier
@ 2008-08-15 16:03                         ` Ingo Molnar
  2008-08-15 17:13                         ` Ulrich Drepper
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2008-08-15 16:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

* Jamie Lokier <jamie@shareable.org> wrote:

> > how about this then?
> 
> > +#define MAP_64BIT_STACK 0x20000         /* give out 32bit addresses on old CPUs */
> 
> I think the flag makes sense but it's name is confusing - 64BIT for a 
> flag which means "maybe request 32-bit stack"!  Suggest:
> 
> +#define MAP_STACK       0x20000         /* 31bit or 64bit address for stack, */
> +                                        /* whichever is faster on this CPU */

ok. I've applied the patch below to tip/x86/urgent.

> Also, is this _only_ useful for thread stacks, or are there other 
> memory allocations where 31-bitness affects execution speed on old 
> P4s?

just about anything i guess - but since those CPUs do not really matter 
anymore in terms of bleeding-edge performance, what we care about is the 
intended current use of this flag: thread stacks.

	Ingo

-------------------->

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 15:54                       ` Jamie Lokier
  2008-08-15 16:03                         ` Ingo Molnar
@ 2008-08-15 17:13                         ` Ulrich Drepper
  2008-08-15 17:19                           ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-15 17:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:
> Suggest:
> 
> +#define MAP_STACK       0x20000         /* 31bit or 64bit address for stack, */
> +                                        /* whichever is faster on this CPU */

I agree.  Except for the comment.


> Also, is this _only_ useful for thread stacks, or are there other
> memory allocations where 31-bitness affects execution speed on old P4s?

Actually, I would define the flag as "do whatever is best assuming the
allocation is used for stacks".

For instance, minimally the /proc/*/maps output could show "[user
stack]" or something like this.  For security, perhaps, setting of
PROC_EXEC can be prevented.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkiluUMACgkQ2ijCOnn/RHSb5gCfb5VhiLA/wbamoAVqfxR32k4N
tSIAoK/KAmwcVd+RjkPnb9RSuAeL/KLV
=2ynl
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 17:13                         ` Ulrich Drepper
@ 2008-08-15 17:19                           ` Ingo Molnar
  2008-08-15 17:23                             ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-15 17:19 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@gmail.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jamie Lokier wrote:
> > Suggest:
> > 
> > +#define MAP_STACK       0x20000         /* 31bit or 64bit address for stack, */
> > +                                        /* whichever is faster on this CPU */
> 
> I agree.  Except for the comment.
> 
> 
> > Also, is this _only_ useful for thread stacks, or are there other
> > memory allocations where 31-bitness affects execution speed on old P4s?
> 
> Actually, I would define the flag as "do whatever is best assuming the
> allocation is used for stacks".
> 
> For instance, minimally the /proc/*/maps output could show "[user
> stack]" or something like this.  For security, perhaps, setting of
> PROC_EXEC can be prevented.

makes sense. Updated patch below. I've also added your Acked-by. Queued 
it up in tip/x86/urgent, for v2.6.27 merging.

( also, just to make sure: all Linux kernel versions will ignore such 
  extra flags, so you can just update glibc to use this flag 
  unconditionally, correct? )

	Ingo

--------------------------->

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 17:19                           ` Ingo Molnar
@ 2008-08-15 17:23                             ` Ulrich Drepper
  2008-08-15 19:00                               ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-15 17:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <mingo@elte.hu> wrote:
> ( also, just to make sure: all Linux kernel versions will ignore such
>  extra flags, so you can just update glibc to use this flag
>  unconditionally, correct? )

As soon as the patch hits Linus' tree I can change the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 17:23                             ` Ulrich Drepper
@ 2008-08-15 19:00                               ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2008-08-15 19:00 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jamie Lokier, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

* Ulrich Drepper <drepper@gmail.com> wrote:

> On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <mingo@elte.hu> wrote:
> > ( also, just to make sure: all Linux kernel versions will ignore such
> >  extra flags, so you can just update glibc to use this flag
> >  unconditionally, correct? )
> 
> As soon as the patch hits Linus' tree I can change the code.

it's upstream now:

| commit cd98a04a59e2f94fa64d5bf1e26498d27427d5e7
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Wed Aug 13 18:02:18 2008 +0200
|
|     x86: add MAP_STACK mmap flag

thanks everyone,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:55                   ` Ulrich Drepper
  2008-08-13 16:02                     ` Ingo Molnar
@ 2008-08-13 17:09                     ` Linus Torvalds
  2008-08-13 18:04                       ` Ulrich Drepper
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-13 17:09 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner,
	H. Peter Anvin


On Wed, 13 Aug 2008, Ulrich Drepper wrote:
> 
> The real problem is: what to do if somebody complains?

Ulrich, I don't understand why you worry more about a _potential_ (and 
fairly unlikely) complaint, than about a real one today.

Thinking ahead may be good, but you take it to absolutely ridiculous 
heights, to the point where you make potential problems be bigger than 
-actual- problems.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 17:09                     ` Linus Torvalds
@ 2008-08-13 18:04                       ` Ulrich Drepper
  2008-08-13 18:16                         ` Arjan van de Ven
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Thomas Gleixner,
	H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> Ulrich, I don't understand why you worry more about a _potential_ (and 
> fairly unlikely) complaint, than about a real one today.

Of course I care.  All I try to do is to prevent going from one extreme
(all focus on P4s) to the other (ignore P4s completely).

Even ignoring this one case here, I think it's in any case useful for
userlevel to tell the kernel that an anonymous memory region is needed
for a stack.  This might allow better optimizations and/or security
implementations.

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijIi0ACgkQ2ijCOnn/RHRqCwCcCAeJw+BzO9MSwKRtemm5VAq3
FBYAoKbMwR1pkthjLvNlpCSVS76CCoAq
=UfmJ
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 18:04                       ` Ulrich Drepper
@ 2008-08-13 18:16                         ` Arjan van de Ven
  2008-08-13 18:22                           ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2008-08-13 18:16 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Ingo Molnar, akpm, hugh, linux-mm, linux-kernel,
	briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin

On Wed, 13 Aug 2008 11:04:29 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Linus Torvalds wrote:
> > Ulrich, I don't understand why you worry more about a _potential_
> > (and fairly unlikely) complaint, than about a real one today.
> 
> Of course I care.  All I try to do is to prevent going from one
> extreme (all focus on P4s) to the other (ignore P4s completely).

(fwiw as far as I know this is only about early 64 bit P4s, not later
generations)
> 
> Even ignoring this one case here, I think it's in any case useful for
> userlevel to tell the kernel that an anonymous memory region is needed
> for a stack.  This might allow better optimizations and/or security
> implementations.

yeah maybe we should also tell it we expect it to be used downwards.
Oh wait.. MAP_GROWSDOWN ?

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 18:16                         ` Arjan van de Ven
@ 2008-08-13 18:22                           ` Ulrich Drepper
  0 siblings, 0 replies; 27+ messages in thread
From: Ulrich Drepper @ 2008-08-13 18:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, akpm, hugh, linux-mm, linux-kernel,
	briangrant, cgd, mbligh, Thomas Gleixner, H. Peter Anvin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Arjan van de Ven wrote:
> yeah maybe we should also tell it we expect it to be used downwards.
> Oh wait.. MAP_GROWSDOWN ?

MAP_GROWSDOWN is unusable because we have to allocate the entire address
range for the stack.  Otherwise some other allocation happens in that
range and all of a sudden the stack cannot grow as much as needed anymore.

These flags really can be removed.  They should not be used because they
are outright dangerous.

- --
a?? Ulrich Drepper a?? Red Hat, Inc. a?? 444 Castro St a?? Mountain View, CA a??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijJm8ACgkQ2ijCOnn/RHQ7/wCfcrLJPlKmtY5AC3c+fuX9LGe8
+YwAnRqLCdSQvwOUdsAz8Hq9H3dmnqEA
=BKsz
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:21               ` Ulrich Drepper
  2008-08-13 15:40                 ` Ingo Molnar
@ 2008-08-13 16:05                 ` H. Peter Anvin
  1 sibling, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2008-08-13 16:05 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Ingo Molnar, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ingo Molnar wrote:
>> i find it pretty unacceptable these days that we limit any aspect of 
>> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 
> 
> Sure, but if we can pin-point the sub-archs for which it is the problem
> then a flag to optionally request it is even easier to handle.  You'd
> simply ignore the flag for anything but the P4 architecture.
> 
> I personally have no problem removing the whole thing because I have no
> such machine running anymore.  But there are people out there who have.
 >

This could also be done entirely in glibc (thus removing the dependency 
on the kernel): set the flag if and only if you detect a P4 CPU.  You 
don't even need to enumerate all the CPUs in the system (which would be 
more painful) if you make the CPUID test wide enough that it catches all 
compatible CPUs.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 15:10             ` Ingo Molnar
  2008-08-13 15:21               ` Ulrich Drepper
@ 2008-08-13 20:42               ` Andi Kleen
  2008-08-13 20:56                 ` Andrew Morton
  2008-08-15 12:43                 ` Ingo Molnar
  1 sibling, 2 replies; 27+ messages in thread
From: Andi Kleen @ 2008-08-13 20:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

Ingo Molnar <mingo@elte.hu> writes:
>
> i find it pretty unacceptable these days that we limit any aspect of 
> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 

It's not limited to 2GB, there's a fallback to >4GB of course. Ok
admittedly the fallback is slow, but it's there.

I would prefer to not slow down the P4s. There are **lots** of them in
field. And they ran 64bit still quite well. Also back then I
benchmarked on early K8 and it also made a difference there (but I
admit I forgot the numbers)

I think it would be better to fix the VM because there are
other use cases of applications who prefer to allocate in a lower area.
For example Java JVMs now widely use a technique called pointer
compression where they dynamically adjust the pointer size based
on how much memory the process uses. For that you have to get
low memory in the 47bit VM too. The VM should deal with that gracefully.

To be honest I always thought the linear search in the VMA list
was a little dumb. I'm sure there are other cases where it hurts
too. Perhaps this would be really an opportunity  to do something about it :)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 20:42               ` Andi Kleen
@ 2008-08-13 20:56                 ` Andrew Morton
  2008-08-13 21:46                   ` Andi Kleen
  2008-08-15 12:43                 ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2008-08-13 20:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mingo, drepper, arjan, hugh, linux-mm, linux-kernel, briangrant,
	cgd, mbligh, torvalds, tglx, hpa

On Wed, 13 Aug 2008 22:42:48 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> >
> > i find it pretty unacceptable these days that we limit any aspect of 
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 
> 
> It's not limited to 2GB, there's a fallback to >4GB of course. Ok
> admittedly the fallback is slow, but it's there.
> 
> I would prefer to not slow down the P4s. There are **lots** of them in
> field. And they ran 64bit still quite well. Also back then I
> benchmarked on early K8 and it also made a difference there (but I
> admit I forgot the numbers)
> 
> I think it would be better to fix the VM because there are
> other use cases of applications who prefer to allocate in a lower area.
> For example Java JVMs now widely use a technique called pointer
> compression where they dynamically adjust the pointer size based
> on how much memory the process uses. For that you have to get
> low memory in the 47bit VM too. The VM should deal with that gracefully.
> 
> To be honest I always thought the linear search in the VMA list
> was a little dumb. I'm sure there are other cases where it hurts
> too. Perhaps this would be really an opportunity  to do something about it :)
> 

Yes, the free_area_cache is always going to have failure modes - I
think we've been kind of waiting for it to explode.

I do think that we need an O(log(n)) search in there.  It could still
be on the fallback path, so we retain the mostly-O(1) benefits of
free_area_cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 20:56                 ` Andrew Morton
@ 2008-08-13 21:46                   ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2008-08-13 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, mingo, drepper, arjan, hugh, linux-mm, linux-kernel,
	briangrant, cgd, mbligh, torvalds, tglx, hpa

> Yes, the free_area_cache is always going to have failure modes - I
> think we've been kind of waiting for it to explode.
> 
> I do think that we need an O(log(n)) search in there.  It could still
> be on the fallback path, so we retain the mostly-O(1) benefits of
> free_area_cache.

The standard dumb way to do that would be to have two parallel trees, one to 
index free space (similar to e.g. the free space btrees in XFS) and the 
other to index the objects (like today). That would increase the constant 
factor somewhat by bloating the VMAs, increasing cache overhead etc, and
also would be more brute force than elegant.   But it would be simple
and straight forward.

Perhaps the combined data structure experience of linux-kernel can come
up with something better and some data structure that allows to look
up both efficiently?

This would be also an opportunity to reevaluate rbtrees for the object
index. One drawback of them is that they are not really optimized to be 
cache friendly because their nodes are too small.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-13 20:42               ` Andi Kleen
  2008-08-13 20:56                 ` Andrew Morton
@ 2008-08-15 12:43                 ` Ingo Molnar
  2008-08-15 13:33                   ` Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-08-15 12:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ulrich Drepper, Arjan van de Ven, akpm, hugh, linux-mm,
	linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

* Andi Kleen <andi@firstfloor.org> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> >
> > i find it pretty unacceptable these days that we limit any aspect of 
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit). 
> 
> It's not limited to 2GB, there's a fallback to >4GB of course. Ok 
> admittedly the fallback is slow, but it's there.

Of course - what you are missing is that _10 milliseconds_ thread 
creation overhead is completely unacceptable overhead: it is so bad as 
if we didnt even support it.

> I would prefer to not slow down the P4s. There are **lots** of them in 
> field. And they ran 64bit still quite well. [...]

Nonsense, i had such a P4 based 64-bit box and it was painful. Everyone 
with half a brain used them as 32-bit machines. Nor is the 
context-switch overhead in any way significant. Plus, as Arjan mentioned 
it, only the earliest P4 64-bit CPUs had this problem.

> [...] Also back then I benchmarked on early K8 and it also made a 
> difference there (but I admit I forgot the numbers)

that's a lot of handwaving with no actual numbers. The numbers in this 
discussion show that the context-switch overhead is small and that the 
overhead on perfectly good systems that hit this limit is obscurely 
high.

I'd love to zap MAP_32BIT this very minute from the kernel, but you 
originally shaped the whole thing in such a stupid way that makes its 
elimination impossible now due to ABI constraints. It would have cost 
you _nothing_ to have added MAP_64BIT_STACK back then, but the quick & 
sloppy solution was to reuse MAP_32BIT for 64-bit tasks. And you are 
stupid about it even now. Bleh.

The correct solution is to eliminate this flag from glibc right now, and 
maybe add the MAP_64BIT_STACK flag as well, as i posted it - if anyone 
with such old boxes still cares (i doubt anyone does). That flag then 
will take its usual slow route. Ulrich?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?
  2008-08-15 12:43                 ` Ingo Molnar
@ 2008-08-15 13:33                   ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2008-08-15 13:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Ulrich Drepper, Arjan van de Ven, akpm, hugh,
	linux-mm, linux-kernel, briangrant, cgd, mbligh, Linus Torvalds,
	Thomas Gleixner, H. Peter Anvin

On Fri, Aug 15, 2008 at 02:43:50PM +0200, Ingo Molnar wrote:
> i had such a P4 based 64-bit box and it was painful.

I used them as 64bit machines and they weren't painful at all.

> I'd love to zap MAP_32BIT this very minute from the kernel, but you 
> originally shaped the whole thing in such a stupid way that makes its 
> elimination impossible now due to ABI constraints. It would have cost 

MAP_32BIT was not actually added for this originally. It 
was originally added for the X server's old dynamic loader, which
needed 2GB memory.

It's main failing, which I freely admit, was to not call it MAP_31BIT.

> you _nothing_ to have added MAP_64BIT_STACK back then, but the quick & 

Not sure what the semantics of that would be. For me it would
seem ugly to hardcode specific semantics in the kernel for this
("mechanism not policy")

But for most possible semantics I can think of the data structure would still 
need to be fixed I think.

> The correct solution is to eliminate this flag from glibc right now, and 

IMHO the correct solution is to fix the data structure to not have such
a bad complexity in this corner case. We typically do this for all
other data structures as we discover such cases. No reason the VMAs
should be any different. 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-08-15 19:00 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <af8810200808121736q76640cc1kb814385072fe9b29@mail.gmail.com>
2008-08-13  0:45 ` pthread_create() slow for many threads; also time to revisit 64b context switch optimization? Pardo
2008-08-13 10:44   ` Ingo Molnar
2008-08-13 13:35     ` Arjan van de Ven
2008-08-13 14:21       ` Ulrich Drepper
2008-08-13 14:25         ` Ingo Molnar
2008-08-13 14:36           ` Ulrich Drepper
2008-08-13 15:10             ` Ingo Molnar
2008-08-13 15:21               ` Ulrich Drepper
2008-08-13 15:40                 ` Ingo Molnar
2008-08-13 15:55                   ` Ulrich Drepper
2008-08-13 16:02                     ` Ingo Molnar
2008-08-15 15:54                       ` Jamie Lokier
2008-08-15 16:03                         ` Ingo Molnar
2008-08-15 17:13                         ` Ulrich Drepper
2008-08-15 17:19                           ` Ingo Molnar
2008-08-15 17:23                             ` Ulrich Drepper
2008-08-15 19:00                               ` Ingo Molnar
2008-08-13 17:09                     ` Linus Torvalds
2008-08-13 18:04                       ` Ulrich Drepper
2008-08-13 18:16                         ` Arjan van de Ven
2008-08-13 18:22                           ` Ulrich Drepper
2008-08-13 16:05                 ` H. Peter Anvin
2008-08-13 20:42               ` Andi Kleen
2008-08-13 20:56                 ` Andrew Morton
2008-08-13 21:46                   ` Andi Kleen
2008-08-15 12:43                 ` Ingo Molnar
2008-08-15 13:33                   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox