Re: missing madvise functionality

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: missing madvise functionality
       [not found]     ` <20070403172841.GB23689@one.firstfloor.org>
@ 2007-04-03 19:59       ` Andrew Morton
  2007-04-03 20:09         ` Andi Kleen
  2007-04-03 20:17         ` Ulrich Drepper
  0 siblings, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 19:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ulrich Drepper, Rik van Riel, Linux Kernel, Jakub Jelinek,
	linux-mm, Hugh Dickins

On Tue, 3 Apr 2007 19:28:41 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, Apr 03, 2007 at 10:20:02AM -0700, Ulrich Drepper wrote:
> > Andi Kleen wrote:
> > > Why do you need a lock for that? I don't see any problem with
> > > two threads doing that in parallel. The kernel would
> > > serialize it internally and one would fail, but that shouldn't
> > > be a problem. 
> > 
> > There is no lock at all at userlevel.  I'm talking about locks in the
> > kernel.
> 
> mmap_sem? Your new operation wouldn't solve that neither.

It might, a bit.  Both mmap() and mprotect() currently take mmap_sem() for
writing.  If we're careful, we could probably arrange for MADV_ULRICH to
take it for reading, which will help a little bit, hopefully.

It's a little sad that mprotect() takes mmap_sem for writing, really.  I think
the only reason for doing that is because we might do a vma_merge() as a
result.  Perhaps this is on the wrong side of the speed/space tradeoff.

otoh, converting a down_write() to a down_read() may well not have much
effect.

Ulrich, could you suggest a little test app which would demonstrate this
behaviour?

> There were some proposals to fix mmap_sem (it's a big issue
> for futexes too) but they're are quite involved.

yup.

Question:

>   - if an access to a page in the range happens in the future it must
>     succeed.  The old page content can be provided or a new, empty page
>    can be provided

How important is this "use the old page if it is available" feature?  If we
were to simply implement a fast unconditional-free-the-page, so that
subsequent accesses always returned a new, zeroed page, do we expect that
this will be a 90%-good-enough thing, or will it be significantly
inefficient?

If we do implement this retain-the-old-page-if-possible feature, I'm
thinking that we can possibly reuse swapcache concepts.  Such a page is
very similar to a clean, unmapped swapcache page, only it doesn't actually
have a swap mapping (well, it might have a swap mapping, in which case we
don't need to do anything at all, except deactivate it).

So perhaps we can do something like chop swapper_space in half: the lower
50% represent offsets which have a swap mapping and the upper 50% are fake
swapcache pages which don't actually consume swapspace.  These pages are
unmapped from pagetables, marked clean, added to the fake part of
swapper_space and are deactivated.  Teach the low-level swap code to ignore
the request to free physical swapspace when these pages are released.

Or, if that's all too hacky, create a new address_space for these pages and
burn a new page flag.  But I suspect we'd end up duplicating so much
swapcache handling that this will end up looking silly.

This would all halve the maximum amount of swap which can be used.  iirc
i386 supports 27 bits of swapcache indexing, and 26 bits is 274GB, which
is hopefully enough..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 19:59       ` missing madvise functionality Andrew Morton
@ 2007-04-03 20:09         ` Andi Kleen
  2007-04-03 20:17         ` Ulrich Drepper
  1 sibling, 0 replies; 87+ messages in thread
From: Andi Kleen @ 2007-04-03 20:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ulrich Drepper, Rik van Riel, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

> It might, a bit.  Both mmap() and mprotect() currently take mmap_sem() for
> writing.  If we're careful, we could probably arrange for MADV_ULRICH to
> take it for reading, which will help a little bit, hopefully.

The cache line bounces would be still there. Not sure that would help MySQL
all that much. 

Besides if the down_write is the real problem one could convert 
the code for all cases over to optimistic locking assuming most calls 
don't merge.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 19:59       ` missing madvise functionality Andrew Morton
  2007-04-03 20:09         ` Andi Kleen
@ 2007-04-03 20:17         ` Ulrich Drepper
  2007-04-03 20:29           ` Jakub Jelinek
  2007-04-03 20:51           ` missing madvise functionality Andrew Morton
  1 sibling, 2 replies; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-03 20:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Rik van Riel, Linux Kernel, Jakub Jelinek, linux-mm,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

Andrew Morton wrote:
> Ulrich, could you suggest a little test app which would demonstrate this
> behaviour?

It's not really reliably possible to demonstrate this with a small
program using malloc.  You'd need something like this mysql test case
which Rik said is not hard to run by yourself.

If somebody adds a kernel interface I can easily produce a glibc patch
so that the test can be run in the new environment.

But it's of course easy enough to simulate the specific problem in a
micro benchmark.  If you want that let me know.

> Question:
> 
>>   - if an access to a page in the range happens in the future it must
>>     succeed.  The old page content can be provided or a new, empty page
>>    can be provided
> 
> How important is this "use the old page if it is available" feature?  If we
> were to simply implement a fast unconditional-free-the-page, so that
> subsequent accesses always returned a new, zeroed page, do we expect that
> this will be a 90%-good-enough thing, or will it be significantly
> inefficient?

My guess is that the page fault you'd get for every single page is a
huge part of the problem.  If you don't free the pages and just leave
them in the process processes which quickly reuse the memory pool will
experience no noticeable slowdown.  The only difference between not
freeing the memory and and doing it is that one madvise() syscall.

If you unconditionally free the page you we have later mprotect() call
(one mmap_sem lock saved).  But does every page fault then later
requires the semaphore?  Even if not, the additional kernel entry is a
killer.

> So perhaps we can do something like chop swapper_space in half: the lower
> 50% represent offsets which have a swap mapping and the upper 50% are fake
> swapcache pages which don't actually consume swapspace.  These pages are
> unmapped from pagetables, marked clean, added to the fake part of
> swapper_space and are deactivated.  Teach the low-level swap code to ignore
> the request to free physical swapspace when these pages are released.

Sounds good to me.

> This would all halve the maximum amount of swap which can be used.  iirc
> i386 supports 27 bits of swapcache indexing, and 26 bits is 274GB, which
> is hopefully enough..

Boo hoo, poor 32-bit machines.  People with demands of > 274G should get
a real machine instead.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:17         ` Ulrich Drepper
@ 2007-04-03 20:29           ` Jakub Jelinek
  2007-04-03 20:38             ` Rik van Riel
                               ` (5 more replies)
  2007-04-03 20:51           ` missing madvise functionality Andrew Morton
  1 sibling, 6 replies; 87+ messages in thread
From: Jakub Jelinek @ 2007-04-03 20:29 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Andrew Morton, Andi Kleen, Rik van Riel, Linux Kernel, linux-mm,
	Hugh Dickins

On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote:
> Andrew Morton wrote:
> > Ulrich, could you suggest a little test app which would demonstrate this
> > behaviour?
> 
> It's not really reliably possible to demonstrate this with a small
> program using malloc.  You'd need something like this mysql test case
> which Rik said is not hard to run by yourself.
> 
> If somebody adds a kernel interface I can easily produce a glibc patch
> so that the test can be run in the new environment.
> 
> But it's of course easy enough to simulate the specific problem in a
> micro benchmark.  If you want that let me know.

I think something like following testcase which simulates what free
and malloc do when trimming/growing a non-main arena.

My guess is that all the page zeroing is pretty expensive as well and
takes significant time, but I haven't profiled it.

#include <pthread.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

void *
tf (void *arg)
{
  (void) arg;
  size_t ps = sysconf (_SC_PAGE_SIZE);
  void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (p == MAP_FAILED)
    exit (1);
  int i;
  for (i = 0; i < 100000; i++)
    {
      /* Pretend to use the buffer.  */
      char *q, *r = (char *) p + 128 * ps;
      size_t s;
      for (q = (char *) p; q < r; q += ps)
        *q = 1;
      for (s = 0, q = (char *) p; q < r; q += ps)
        s += *q;
      /* Free it.  Replace this mmap with
         madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
      if (mmap (p, 128 * ps, PROT_NONE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
        exit (2);
      /* And immediately malloc again.  This would then be deleted.  */
      if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
        exit (3);
    }
  return NULL;
}

int
main (void)
{
  pthread_t th[32];
  int i;
  for (i = 0; i < 32; i++)
    if (pthread_create (&th[i], NULL, tf, NULL))
      exit (4);
  for (i = 0; i < 32; i++)
    pthread_join (th[i], NULL);
  return 0;
}

	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:29           ` Jakub Jelinek
@ 2007-04-03 20:38             ` Rik van Riel
  2007-04-03 21:49             ` Andrew Morton
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-03 20:38 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

Jakub Jelinek wrote:

> My guess is that all the page zeroing is pretty expensive as well and
> takes significant time, but I haven't profiled it.

I'm pretty sure that page freeing, reallocating and zeroing
is more expensive than just letting the page sit there and
only reclaim it lazily when we need the memory.

I'll try to whip up a patch this week.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:17         ` Ulrich Drepper
  2007-04-03 20:29           ` Jakub Jelinek
@ 2007-04-03 20:51           ` Andrew Morton
  2007-04-03 20:57             ` Ulrich Drepper
                               ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 20:51 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Andi Kleen, Rik van Riel, Linux Kernel, Jakub Jelinek, linux-mm,
	Hugh Dickins

On Tue, 03 Apr 2007 13:17:09 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> Andrew Morton wrote:
> > Ulrich, could you suggest a little test app which would demonstrate this
> > behaviour?
> 
> It's not really reliably possible to demonstrate this with a small
> program using malloc.  You'd need something like this mysql test case
> which Rik said is not hard to run by yourself.
> 
> If somebody adds a kernel interface I can easily produce a glibc patch
> so that the test can be run in the new environment.
> 
> But it's of course easy enough to simulate the specific problem in a
> micro benchmark.  If you want that let me know.
> 
> 
> > Question:
> > 
> >>   - if an access to a page in the range happens in the future it must
> >>     succeed.  The old page content can be provided or a new, empty page
> >>    can be provided
> > 
> > How important is this "use the old page if it is available" feature?  If we
> > were to simply implement a fast unconditional-free-the-page, so that
> > subsequent accesses always returned a new, zeroed page, do we expect that
> > this will be a 90%-good-enough thing, or will it be significantly
> > inefficient?
> 
> My guess is that the page fault you'd get for every single page is a
> huge part of the problem.  If you don't free the pages and just leave
> them in the process processes which quickly reuse the memory pool will
> experience no noticeable slowdown.  The only difference between not
> freeing the memory and and doing it is that one madvise() syscall.
> 
> If you unconditionally free the page you we have later mprotect() call
> (one mmap_sem lock saved).  But does every page fault then later
> requires the semaphore?  Even if not, the additional kernel entry is a
> killer.

Oh.  I was assuming that we'd want to unmap these pages from pagetables and
mark then super-easily-reclaimable.  So a later touch would incur a minor
fault.

But you think that we should leave them mapped into pagetables so no such
fault occurs.

I guess we can still do that - if we follow the "this is just like clean
swapcache" concept, things should just work.

Leaving the pages mapped into pagetables means that they are considerably
less likely to be reclaimed.

But whatever we do, with the current MM design we need to at least take the
mmap_sem for reading so we can descend the vma tree and locate the
pageframes.  And if that locking is the main problem then none of this is
likely to help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:51           ` missing madvise functionality Andrew Morton
@ 2007-04-03 20:57             ` Ulrich Drepper
  2007-04-03 21:00             ` Rik van Riel
  2007-04-04 18:49             ` Anton Blanchard
  2 siblings, 0 replies; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-03 20:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Rik van Riel, Linux Kernel, Jakub Jelinek, linux-mm,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

Andrew Morton wrote:
> But whatever we do, with the current MM design we need to at least take the
> mmap_sem for reading so we can descend the vma tree and locate the
> pageframes.  And if that locking is the main problem then none of this is
> likely to help.

At least it's done only once for the madvise call and not twice as of
today with mmap and mprotect both needing the semaphore.  This can
reduce the contention quite a bit.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:51           ` missing madvise functionality Andrew Morton
  2007-04-03 20:57             ` Ulrich Drepper
@ 2007-04-03 21:00             ` Rik van Riel
  2007-04-03 21:10               ` Eric Dumazet
  2007-04-03 21:16               ` Andrew Morton
  2007-04-04 18:49             ` Anton Blanchard
  2 siblings, 2 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-03 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, Andi Kleen, Linux Kernel, Jakub Jelinek,
	linux-mm, Hugh Dickins

Andrew Morton wrote:

> Oh.  I was assuming that we'd want to unmap these pages from pagetables and
> mark then super-easily-reclaimable.  So a later touch would incur a minor
> fault.
> 
> But you think that we should leave them mapped into pagetables so no such
> fault occurs.

> Leaving the pages mapped into pagetables means that they are considerably
> less likely to be reclaimed.

If we move the pages to a place where they are very likely to be
reclaimed quickly (end of the inactive list, or a separate
reclaim list) and clear the dirty and referenced lists, we can
both reclaim the page easily *and* avoid the page fault penalty.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:00             ` Rik van Riel
@ 2007-04-03 21:10               ` Eric Dumazet
  2007-04-03 21:12                 ` Jörn Engel
                                   ` (3 more replies)
  2007-04-03 21:16               ` Andrew Morton
  1 sibling, 4 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-03 21:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Ulrich Drepper, Andi Kleen, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

Rik van Riel a A(C)crit :
> Andrew Morton wrote:
> 
>> Oh.  I was assuming that we'd want to unmap these pages from 
>> pagetables and
>> mark then super-easily-reclaimable.  So a later touch would incur a minor
>> fault.
>>
>> But you think that we should leave them mapped into pagetables so no such
>> fault occurs.
> 
>> Leaving the pages mapped into pagetables means that they are considerably
>> less likely to be reclaimed.
> 
> If we move the pages to a place where they are very likely to be
> reclaimed quickly (end of the inactive list, or a separate
> reclaim list) and clear the dirty and referenced lists, we can
> both reclaim the page easily *and* avoid the page fault penalty.
> 

There is one possible speedup :

- If an user app does a madvise(MADV_DONTNEED), we can assume the pages can 
later be bring back without need to zero them. The application doesnt care.

A page fault is not that expensive. But clearing N*PAGE_SIZE bytes is, because 
it potentially evicts a large part of CPU cache.

If I recall well, mysql bench Ulrich mentioned was allocating/freeing large 
areas (100 Kbytes or so) in a loop.

mmap()/brk() must give fresh NULL pages, but maybe madvise(MADV_DONTNEED) can 
relax this requirement (if the pages were reclaimed, then a page fault could 
bring a new page with random content)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:10               ` Eric Dumazet
@ 2007-04-03 21:12                 ` Jörn Engel
  2007-04-03 21:15                 ` Rik van Riel
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 87+ messages in thread
From: Jörn Engel @ 2007-04-03 21:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rik van Riel, Andrew Morton, Ulrich Drepper, Andi Kleen,
	Linux Kernel, Jakub Jelinek, linux-mm, Hugh Dickins

On Tue, 3 April 2007 23:10:14 +0200, Eric Dumazet wrote:
> 
> mmap()/brk() must give fresh NULL pages, but maybe madvise(MADV_DONTNEED) 
> can relax this requirement (if the pages were reclaimed, then a page fault 
> could bring a new page with random content)

...provided that it doesn't leak information from the kernel?

JA?rn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:10               ` Eric Dumazet
  2007-04-03 21:12                 ` Jörn Engel
@ 2007-04-03 21:15                 ` Rik van Riel
  2007-04-03 21:30                   ` Eric Dumazet
  2007-04-03 21:22                 ` Jeremy Fitzhardinge
  2007-04-03 21:46                 ` Ulrich Drepper
  3 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2007-04-03 21:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ulrich Drepper, Andi Kleen, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> Rik van Riel a A(C)crit :
>> Andrew Morton wrote:
>>
>>> Oh.  I was assuming that we'd want to unmap these pages from 
>>> pagetables and
>>> mark then super-easily-reclaimable.  So a later touch would incur a 
>>> minor
>>> fault.
>>>
>>> But you think that we should leave them mapped into pagetables so no 
>>> such
>>> fault occurs.
>>
>>> Leaving the pages mapped into pagetables means that they are 
>>> considerably
>>> less likely to be reclaimed.
>>
>> If we move the pages to a place where they are very likely to be
>> reclaimed quickly (end of the inactive list, or a separate
>> reclaim list) and clear the dirty and referenced lists, we can
>> both reclaim the page easily *and* avoid the page fault penalty.
>>
> 
> There is one possible speedup :
> 
> - If an user app does a madvise(MADV_DONTNEED), we can assume the pages 
> can later be bring back without need to zero them. The application 
> doesnt care.

... however, the application that previously used that page might
care a lot!

> mmap()/brk() must give fresh NULL pages, but maybe 
> madvise(MADV_DONTNEED) can relax this requirement (if the pages were 
> reclaimed, then a page fault could bring a new page with random content)

If we bring in a new page, it has to be zeroed for security
reasons.

You don't want somebody else's process to get a page with
your password in it.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:00             ` Rik van Riel
  2007-04-03 21:10               ` Eric Dumazet
@ 2007-04-03 21:16               ` Andrew Morton
  1 sibling, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 21:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ulrich Drepper, Andi Kleen, Linux Kernel, Jakub Jelinek,
	linux-mm, Hugh Dickins

On Tue, 03 Apr 2007 17:00:09 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > Oh.  I was assuming that we'd want to unmap these pages from pagetables and
> > mark then super-easily-reclaimable.  So a later touch would incur a minor
> > fault.
> > 
> > But you think that we should leave them mapped into pagetables so no such
> > fault occurs.
> 
> > Leaving the pages mapped into pagetables means that they are considerably
> > less likely to be reclaimed.
> 
> If we move the pages to a place where they are very likely to be
> reclaimed quickly (end of the inactive list, or a separate
> reclaim list) and clear the dirty and referenced lists, we can
> both reclaim the page easily *and* avoid the page fault penalty.
> 

ah, yes, you're right.  That part should work nicely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:10               ` Eric Dumazet
  2007-04-03 21:12                 ` Jörn Engel
  2007-04-03 21:15                 ` Rik van Riel
@ 2007-04-03 21:22                 ` Jeremy Fitzhardinge
  2007-04-03 21:29                   ` Rik van Riel
  2007-04-03 21:46                 ` Ulrich Drepper
  3 siblings, 1 reply; 87+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-03 21:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rik van Riel, Andrew Morton, Ulrich Drepper, Andi Kleen,
	Linux Kernel, Jakub Jelinek, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> mmap()/brk() must give fresh NULL pages, but maybe
> madvise(MADV_DONTNEED) can relax this requirement (if the pages were
> reclaimed, then a page fault could bring a new page with random content) 

Only if those pages were originally from that process.  Otherwise you've
got a bit of an information leak there.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:22                 ` Jeremy Fitzhardinge
@ 2007-04-03 21:29                   ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-03 21:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric Dumazet, Andrew Morton, Ulrich Drepper, Andi Kleen,
	Linux Kernel, Jakub Jelinek, linux-mm, Hugh Dickins

Jeremy Fitzhardinge wrote:
> Eric Dumazet wrote:
>> mmap()/brk() must give fresh NULL pages, but maybe
>> madvise(MADV_DONTNEED) can relax this requirement (if the pages were
>> reclaimed, then a page fault could bring a new page with random content) 
> 
> Only if those pages were originally from that process.  Otherwise you've
> got a bit of an information leak there.

Or from another process by the same user, in the same security
context.  That gets a bit more complex though :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:15                 ` Rik van Riel
@ 2007-04-03 21:30                   ` Eric Dumazet
  0 siblings, 0 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-03 21:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Ulrich Drepper, Andi Kleen, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

Rik van Riel a A(C)crit :
> Eric Dumazet wrote:
>> Rik van Riel a A(C)crit :
>>> Andrew Morton wrote:
>>>
>>>> Oh.  I was assuming that we'd want to unmap these pages from 
>>>> pagetables and
>>>> mark then super-easily-reclaimable.  So a later touch would incur a 
>>>> minor
>>>> fault.
>>>>
>>>> But you think that we should leave them mapped into pagetables so no 
>>>> such
>>>> fault occurs.
>>>
>>>> Leaving the pages mapped into pagetables means that they are 
>>>> considerably
>>>> less likely to be reclaimed.
>>>
>>> If we move the pages to a place where they are very likely to be
>>> reclaimed quickly (end of the inactive list, or a separate
>>> reclaim list) and clear the dirty and referenced lists, we can
>>> both reclaim the page easily *and* avoid the page fault penalty.
>>>
>>
>> There is one possible speedup :
>>
>> - If an user app does a madvise(MADV_DONTNEED), we can assume the 
>> pages can later be bring back without need to zero them. The 
>> application doesnt care.
> 
> ... however, the application that previously used that page might
> care a lot!

The application that does madvise(MADV_WHATEVER_MEANS_KENREL_CAN_DROP)
doesnt care. It it cares, it would use munmap(), or no syscall at all.

> 
>> mmap()/brk() must give fresh NULL pages, but maybe 
>> madvise(MADV_DONTNEED) can relax this requirement (if the pages were 
>> reclaimed, then a page fault could bring a new page with random content)
> 
> If we bring in a new page, it has to be zeroed for security
> reasons.
> 
> You don't want somebody else's process to get a page with
> your password in it.

Then an application that cares of passwd wont use 
madvise(MADV_WHATEVER_MEANS_I_DONT_CARE)

;)

Maybe I was not clear, but I was refering to a pool of 'discardable' pages, 
that would be feeded by applications that want to notify kernel some pages can 
be completly discarded (contains no security data of course, nor data that the 
applications dont want to forget), and might be given to a consumer without 
the need of zeroing it.

We might make this pool private to each process, but then it would benefit to 
less workloads I guess...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:10               ` Eric Dumazet
                                   ` (2 preceding siblings ...)
  2007-04-03 21:22                 ` Jeremy Fitzhardinge
@ 2007-04-03 21:46                 ` Ulrich Drepper
  2007-04-03 22:51                   ` Andi Kleen
  3 siblings, 1 reply; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-03 21:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rik van Riel, Andrew Morton, Andi Kleen, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

Eric Dumazet wrote:
> A page fault is not that expensive. But clearing N*PAGE_SIZE bytes is,
> because it potentially evicts a large part of CPU cache.

*A* page fault is not that expensive.  The problem is that you get a
page fault for every single page.  For 200k allocated you get 50 page
faults.  It quickly adds up.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:29           ` Jakub Jelinek
  2007-04-03 20:38             ` Rik van Riel
@ 2007-04-03 21:49             ` Andrew Morton
  2007-04-03 23:01               ` Eric Dumazet
                                 ` (2 more replies)
  2007-04-04 13:09             ` William Lee Irwin III
                               ` (3 subsequent siblings)
  5 siblings, 3 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 21:49 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andi Kleen, Rik van Riel, Linux Kernel, linux-mm,
	Hugh Dickins

On Tue, 3 Apr 2007 16:29:37 -0400
Jakub Jelinek <jakub@redhat.com> wrote:

> On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote:
> > Andrew Morton wrote:
> > > Ulrich, could you suggest a little test app which would demonstrate this
> > > behaviour?
> > 
> > It's not really reliably possible to demonstrate this with a small
> > program using malloc.  You'd need something like this mysql test case
> > which Rik said is not hard to run by yourself.
> > 
> > If somebody adds a kernel interface I can easily produce a glibc patch
> > so that the test can be run in the new environment.
> > 
> > But it's of course easy enough to simulate the specific problem in a
> > micro benchmark.  If you want that let me know.
> 
> I think something like following testcase which simulates what free
> and malloc do when trimming/growing a non-main arena.
> 
> My guess is that all the page zeroing is pretty expensive as well and
> takes significant time, but I haven't profiled it.
> 
> #include <pthread.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
>     exit (1);
>   int i;
>   for (i = 0; i < 100000; i++)
>     {
>       /* Pretend to use the buffer.  */
>       char *q, *r = (char *) p + 128 * ps;
>       size_t s;
>       for (q = (char *) p; q < r; q += ps)
>         *q = 1;
>       for (s = 0, q = (char *) p; q < r; q += ps)
>         s += *q;
>       /* Free it.  Replace this mmap with
>          madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>       if (mmap (p, 128 * ps, PROT_NONE,
>                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>         exit (2);
>       /* And immediately malloc again.  This would then be deleted.  */
>       if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>         exit (3);
>     }
>   return NULL;
> }
> 
> int
> main (void)
> {
>   pthread_t th[32];
>   int i;
>   for (i = 0; i < 32; i++)
>     if (pthread_create (&th[i], NULL, tf, NULL))
>       exit (4);
>   for (i = 0; i < 32; i++)
>     pthread_join (th[i], NULL);
>   return 0;
> }
> 

whee.  135,000 context switches/sec on a slow 2-way.  mmap_sem, most
likely.  That is ungood.

Did anyone monitor the context switch rate with the mysql test?

Interestingly, your test app (with s/100000/1000) runs to completion in 13
seocnd on the slow 2-way.  On a fast 8-way, it took 52 seconds and
sustained 40,000 context switches/sec.  That's a bit unexpected.

Both machines show ~8% idle time, too :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:46                 ` Ulrich Drepper
@ 2007-04-03 22:51                   ` Andi Kleen
  2007-04-03 23:07                     ` Ulrich Drepper
  0 siblings, 1 reply; 87+ messages in thread
From: Andi Kleen @ 2007-04-03 22:51 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Rik van Riel, Andrew Morton, Andi Kleen,
	Linux Kernel, Jakub Jelinek, linux-mm, Hugh Dickins

On Tue, Apr 03, 2007 at 02:46:09PM -0700, Ulrich Drepper wrote:
> Eric Dumazet wrote:
> > A page fault is not that expensive. But clearing N*PAGE_SIZE bytes is,
> > because it potentially evicts a large part of CPU cache.
> 
> *A* page fault is not that expensive.  The problem is that you get a
> page fault for every single page.  For 200k allocated you get 50 page
> faults.  It quickly adds up.

If you know in advance you need them it might be possible to 
batch that. e.g. MADV_WILLNEED could be extended to
work on anonymous memory and establish the mappings in the syscall. 
Would that be useful? 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:49             ` Andrew Morton
@ 2007-04-03 23:01               ` Eric Dumazet
  2007-04-04  2:22                 ` Nick Piggin
  2007-04-03 23:02               ` Andrew Morton
  2007-04-03 23:44               ` Andrew Morton
  2 siblings, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-03 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

Andrew Morton a ecrit :
> On Tue, 3 Apr 2007 16:29:37 -0400
> Jakub Jelinek <jakub@redhat.com> wrote:
> 
>> On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote:
>>> Andrew Morton wrote:
>>>> Ulrich, could you suggest a little test app which would demonstrate this
>>>> behaviour?
>>> It's not really reliably possible to demonstrate this with a small
>>> program using malloc.  You'd need something like this mysql test case
>>> which Rik said is not hard to run by yourself.
>>>
>>> If somebody adds a kernel interface I can easily produce a glibc patch
>>> so that the test can be run in the new environment.
>>>
>>> But it's of course easy enough to simulate the specific problem in a
>>> micro benchmark.  If you want that let me know.
>> I think something like following testcase which simulates what free
>> and malloc do when trimming/growing a non-main arena.
>>
>> My guess is that all the page zeroing is pretty expensive as well and
>> takes significant time, but I haven't profiled it.
>>
>> #include <pthread.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <unistd.h>
>>
>> void *
>> tf (void *arg)
>> {
>>   (void) arg;
>>   size_t ps = sysconf (_SC_PAGE_SIZE);
>>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>   if (p == MAP_FAILED)
>>     exit (1);
>>   int i;
>>   for (i = 0; i < 100000; i++)
>>     {
>>       /* Pretend to use the buffer.  */
>>       char *q, *r = (char *) p + 128 * ps;
>>       size_t s;
>>       for (q = (char *) p; q < r; q += ps)
>>         *q = 1;
>>       for (s = 0, q = (char *) p; q < r; q += ps)
>>         s += *q;
>>       /* Free it.  Replace this mmap with
>>          madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>>       if (mmap (p, 128 * ps, PROT_NONE,
>>                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>>         exit (2);
>>       /* And immediately malloc again.  This would then be deleted.  */
>>       if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>>         exit (3);
>>     }
>>   return NULL;
>> }
>>
>> int
>> main (void)
>> {
>>   pthread_t th[32];
>>   int i;
>>   for (i = 0; i < 32; i++)
>>     if (pthread_create (&th[i], NULL, tf, NULL))
>>       exit (4);
>>   for (i = 0; i < 32; i++)
>>     pthread_join (th[i], NULL);
>>   return 0;
>> }
>>
> 
> whee.  135,000 context switches/sec on a slow 2-way.  mmap_sem, most
> likely.  That is ungood.
> 
> Did anyone monitor the context switch rate with the mysql test?
> 
> Interestingly, your test app (with s/100000/1000) runs to completion in 13
> seocnd on the slow 2-way.  On a fast 8-way, it took 52 seconds and
> sustained 40,000 context switches/sec.  That's a bit unexpected.
> 
> Both machines show ~8% idle time, too :(

Yes... then add to this some futex work, and you get the picture.

I do think such workloads might benefit from a vma_cache not shared by all 
threads but private to each thread. A sequence could invalidate the cache(s).

ie instead of a mm->mmap_cache, having a mm->sequence, and each thread having 
a current->mmap_cache and current->mm_sequence

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:49             ` Andrew Morton
  2007-04-03 23:01               ` Eric Dumazet
@ 2007-04-03 23:02               ` Andrew Morton
  2007-04-04  9:15                 ` Hugh Dickins
  2007-04-03 23:44               ` Andrew Morton
  2 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 23:02 UTC (permalink / raw)
  To: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Tue, 3 Apr 2007 14:49:48 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > int
> > main (void)
> > {
> >   pthread_t th[32];
> >   int i;
> >   for (i = 0; i < 32; i++)
> >     if (pthread_create (&th[i], NULL, tf, NULL))
> >       exit (4);
> >   for (i = 0; i < 32; i++)
> >     pthread_join (th[i], NULL);
> >   return 0;
> > }
> > 
> 
> whee.  135,000 context switches/sec on a slow 2-way.  mmap_sem, most
> likely.  That is ungood.
> 
> Did anyone monitor the context switch rate with the mysql test?
> 
> Interestingly, your test app (with s/100000/1000) runs to completion in 13
> seocnd on the slow 2-way.  On a fast 8-way, it took 52 seconds and
> sustained 40,000 context switches/sec.  That's a bit unexpected.
> 
> Both machines show ~8% idle time, too :(

All of which indicates that if we can remove the down_write(mmap_sem) from
this glibc operation, things should get a lot better - there will be no
additional context switches at all.

And we can surely do that if all we're doing is looking up pageframes,
putting pages into fake-swapcache and moving them around on the page LRUs.

Hugh?  Sanity check?

That difference between the 2-way and the 8-way sure is odd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 22:51                   ` Andi Kleen
@ 2007-04-03 23:07                     ` Ulrich Drepper
  0 siblings, 0 replies; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-03 23:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rik van Riel, Andrew Morton, Linux Kernel, Jakub Jelinek,
	linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

Andi Kleen wrote:
> If you know in advance you need them it might be possible to 
> batch that. e.g. MADV_WILLNEED could be extended to
> work on anonymous memory and establish the mappings in the syscall. 
> Would that be useful? 

Not in the exact way you think.  The problem is that not all pages would
be needed right away.  An allocator requests address space from the
kernel in larger chunks and then uses it piece by piece.  The so-far
unused memory remains untouched and therefore not mapped.  It would be
wasteful to allocate all pages.  It would mean the allocator has to
request smaller blocks from the kernel which in turn means more system
calls.

The behavior is also not good for the malloc()'ed blocks.  A large block
might also not be used fully or at least not necessary.

But I definitely could see cases where I would want that functionality.
 For instance, for memory regions which contain only administrative data
and where every page is used right away.  Substituting N page faults for
one madvise call probably is a win.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 21:49             ` Andrew Morton
  2007-04-03 23:01               ` Eric Dumazet
  2007-04-03 23:02               ` Andrew Morton
@ 2007-04-03 23:44               ` Andrew Morton
  2 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-03 23:44 UTC (permalink / raw)
  To: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Tue, 3 Apr 2007 14:49:48 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > int
> > main (void)
> > {
> >   pthread_t th[32];
> >   int i;
> >   for (i = 0; i < 32; i++)
> >     if (pthread_create (&th[i], NULL, tf, NULL))
> >       exit (4);
> >   for (i = 0; i < 32; i++)
> >     pthread_join (th[i], NULL);
> >   return 0;
> > }
> > 
> 
> whee.  135,000 context switches/sec on a slow 2-way.  mmap_sem, most
> likely.  That is ungood.
> 
> Did anyone monitor the context switch rate with the mysql test?
> 
> Interestingly, your test app (with s/100000/1000) runs to completion in 13
> seocnd on the slow 2-way.  On a fast 8-way, it took 52 seconds and
> sustained 40,000 context switches/sec.  That's a bit unexpected.
> 
> Both machines show ~8% idle time, too :(

Rohit solved this puzzle.

The 2-way is a single package, hyperthreaded.

The 8-way is two-package, four cores in each.

So on the 8-way, that lock is getting transferred between the two packages
like crazy.  Running the benchmark on just cpus 0 and 1 (taskset -c 0,1)
took the runtime down to eight seconds (from 52!) and the context switch
rate went up to 200,000/sec (from 45,000).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 23:01               ` Eric Dumazet
@ 2007-04-04  2:22                 ` Nick Piggin
  2007-04-04  5:41                   ` Eric Dumazet
  2007-04-04  8:25                   ` missing madvise functionality Peter Zijlstra
  0 siblings, 2 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  2:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> Andrew Morton a ecrit :
> 
>> On Tue, 3 Apr 2007 16:29:37 -0400
>> Jakub Jelinek <jakub@redhat.com> wrote:
>>
>>> On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote:
>>>
>>>> Andrew Morton wrote:
>>>>
>>>>> Ulrich, could you suggest a little test app which would demonstrate 
>>>>> this
>>>>> behaviour?
>>>>
>>>> It's not really reliably possible to demonstrate this with a small
>>>> program using malloc.  You'd need something like this mysql test case
>>>> which Rik said is not hard to run by yourself.
>>>>
>>>> If somebody adds a kernel interface I can easily produce a glibc patch
>>>> so that the test can be run in the new environment.
>>>>
>>>> But it's of course easy enough to simulate the specific problem in a
>>>> micro benchmark.  If you want that let me know.
>>>
>>> I think something like following testcase which simulates what free
>>> and malloc do when trimming/growing a non-main arena.
>>>
>>> My guess is that all the page zeroing is pretty expensive as well and
>>> takes significant time, but I haven't profiled it.
>>>
>>> #include <pthread.h>
>>> #include <stdlib.h>
>>> #include <sys/mman.h>
>>> #include <unistd.h>
>>>
>>> void *
>>> tf (void *arg)
>>> {
>>>   (void) arg;
>>>   size_t ps = sysconf (_SC_PAGE_SIZE);
>>>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>>>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>   if (p == MAP_FAILED)
>>>     exit (1);
>>>   int i;
>>>   for (i = 0; i < 100000; i++)
>>>     {
>>>       /* Pretend to use the buffer.  */
>>>       char *q, *r = (char *) p + 128 * ps;
>>>       size_t s;
>>>       for (q = (char *) p; q < r; q += ps)
>>>         *q = 1;
>>>       for (s = 0, q = (char *) p; q < r; q += ps)
>>>         s += *q;
>>>       /* Free it.  Replace this mmap with
>>>          madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>>>       if (mmap (p, 128 * ps, PROT_NONE,
>>>                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>>>         exit (2);
>>>       /* And immediately malloc again.  This would then be deleted.  */
>>>       if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>>>         exit (3);
>>>     }
>>>   return NULL;
>>> }
>>>
>>> int
>>> main (void)
>>> {
>>>   pthread_t th[32];
>>>   int i;
>>>   for (i = 0; i < 32; i++)
>>>     if (pthread_create (&th[i], NULL, tf, NULL))
>>>       exit (4);
>>>   for (i = 0; i < 32; i++)
>>>     pthread_join (th[i], NULL);
>>>   return 0;
>>> }
>>>
>>
>> whee.  135,000 context switches/sec on a slow 2-way.  mmap_sem, most
>> likely.  That is ungood.
>>
>> Did anyone monitor the context switch rate with the mysql test?
>>
>> Interestingly, your test app (with s/100000/1000) runs to completion 
>> in 13
>> seocnd on the slow 2-way.  On a fast 8-way, it took 52 seconds and
>> sustained 40,000 context switches/sec.  That's a bit unexpected.
>>
>> Both machines show ~8% idle time, too :(
> 
> 
> Yes... then add to this some futex work, and you get the picture.
> 
> I do think such workloads might benefit from a vma_cache not shared by 
> all threads but private to each thread. A sequence could invalidate the 
> cache(s).
> 
> ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
> having a current->mmap_cache and current->mm_sequence

I have a patchset to do exactly this, btw.

Anyway what is the status of the private futex work. I don't think that
is very intrusive or complicated, so it should get merged ASAP (so then
at least we have the interface there).

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  2:22                 ` Nick Piggin
@ 2007-04-04  5:41                   ` Eric Dumazet
  2007-04-04  6:09                     ` [patches] threaded vma patches (was Re: missing madvise functionality) Nick Piggin
  2007-04-04  8:25                   ` missing madvise functionality Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-04  5:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Nick Piggin a ecrit :
> Eric Dumazet wrote:
>>
>> I do think such workloads might benefit from a vma_cache not shared by 
>> all threads but private to each thread. A sequence could invalidate 
>> the cache(s).
>>
>> ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
>> having a current->mmap_cache and current->mm_sequence
> 
> I have a patchset to do exactly this, btw.

Could you repost it please ?

I guess a seqlock could avoid some cache line bouncing on mmap_sem for some 
kind of operations. I wonder if it could speed up do_page_fault() ???

> 
> Anyway what is the status of the private futex work. I don't think that
> is very intrusive or complicated, so it should get merged ASAP (so then
> at least we have the interface there).
> 

It seems nobody but you and me cared.

BTW I am surprised of Ulrich bugging linux on MADV_KERNEL_CAN_DROP, while 
glibc still does :

FILE *F = fopen("/etc/passwd", "r");
fget(line, sizeof(line), F);
fclose(F);

->

open("/etc/passwd", O_RDONLY)           = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1505, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2b67097f0000
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1505
close(3)                                = 0
munmap(0x2b67097f0000, 4096)            = 0


using mmap()/munmap() to allocate one 4096 bytes area is certainly overkill. 
mmap_sem is apparently the thing we must hit forever.

Maybe nobody but me still uses fopen()/fclose() after all ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  5:41                   ` Eric Dumazet
@ 2007-04-04  6:09                     ` Nick Piggin
  2007-04-04  6:26                       ` Andrew Morton
  2007-04-04  6:42                       ` Ulrich Drepper
  0 siblings, 2 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  6:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> Nick Piggin a ecrit :
> 
>> Eric Dumazet wrote:
>>
>>>
>>> I do think such workloads might benefit from a vma_cache not shared 
>>> by all threads but private to each thread. A sequence could 
>>> invalidate the cache(s).
>>>
>>> ie instead of a mm->mmap_cache, having a mm->sequence, and each 
>>> thread having a current->mmap_cache and current->mm_sequence
>>
>>
>> I have a patchset to do exactly this, btw.
> 
> 
> Could you repost it please ?

Sure. I'll send you them privately because they're against an older
kernel.

>> Anyway what is the status of the private futex work. I don't think that
>> is very intrusive or complicated, so it should get merged ASAP (so then
>> at least we have the interface there).
>>
> 
> It seems nobody but you and me cared.

Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing? If
yes, then Andrew, do you have any objections to putting Eric's fairly
important patch at least into -mm?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:09                     ` [patches] threaded vma patches (was Re: missing madvise functionality) Nick Piggin
@ 2007-04-04  6:26                       ` Andrew Morton
  2007-04-04  6:38                         ` Nick Piggin
  2007-04-04  6:42                       ` Ulrich Drepper
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-04  6:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Andrew, do you have any objections to putting Eric's fairly
> important patch at least into -mm?

you know what to do ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:26                       ` Andrew Morton
@ 2007-04-04  6:38                         ` Nick Piggin
  0 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  6:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

(sorry to change the subjet, I was initially going to send the
threaded vma cache patches on list, but then decided they didn't
have enough changelog!)

Andrew Morton wrote:
> On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Andrew, do you have any objections to putting Eric's fairly
>>important patch at least into -mm?
> 
> 
> you know what to do ;)
> 

Well I did review them when he last posted, but simply didn't have
much to say (that happened in a much older discussion about the
private futex problem, and I ended up agreeing with this approach).
Anyway I'll have another look when they get posted again.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:09                     ` [patches] threaded vma patches (was Re: missing madvise functionality) Nick Piggin
  2007-04-04  6:26                       ` Andrew Morton
@ 2007-04-04  6:42                       ` Ulrich Drepper
  2007-04-04  6:44                         ` Nick Piggin
  2007-04-04  6:50                         ` Eric Dumazet
  1 sibling, 2 replies; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-04  6:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andrew Morton, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 298 bytes --]

Nick Piggin wrote:
> Sad. Although Ulrich did seem interested at one point I think? Ulrich,
> do you agree at least with the interface that Eric is proposing?

I have no idea what you're talking about.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:42                       ` Ulrich Drepper
@ 2007-04-04  6:44                         ` Nick Piggin
  2007-04-04  6:50                         ` Eric Dumazet
  1 sibling, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  6:44 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andrew Morton, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

Ulrich Drepper wrote:
> Nick Piggin wrote:
> 
>>Sad. Although Ulrich did seem interested at one point I think? Ulrich,
>>do you agree at least with the interface that Eric is proposing?
> 
> 
> I have no idea what you're talking about.
> 

Private futexes.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:42                       ` Ulrich Drepper
  2007-04-04  6:44                         ` Nick Piggin
@ 2007-04-04  6:50                         ` Eric Dumazet
  2007-04-04  6:54                           ` Ulrich Drepper
  1 sibling, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-04  6:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nick Piggin, Andrew Morton, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

Ulrich Drepper a A(C)crit :
> Nick Piggin wrote:
>> Sad. Although Ulrich did seem interested at one point I think? Ulrich,
>> do you agree at least with the interface that Eric is proposing?
> 
> I have no idea what you're talking about.
> 

You were CC on this one, you can find an archive here :

http://lkml.org/lkml/2007/3/15/230

This avoids mmap_sem for private futexes (PTHREAD_PROCESS_PRIVATE  semantic)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:50                         ` Eric Dumazet
@ 2007-04-04  6:54                           ` Ulrich Drepper
  2007-04-04  7:33                             ` Eric Dumazet
  0 siblings, 1 reply; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-04  6:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Piggin, Linux Kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 329 bytes --]

Eric Dumazet wrote:
> You were CC on this one, you can find an archive here :

You cc:ed my gmail account.  I don't pick out mails sent to me there.
If you want me to look at something you have to send it to my
@redhat.com address.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patches] threaded vma patches (was Re: missing madvise functionality)
  2007-04-04  6:54                           ` Ulrich Drepper
@ 2007-04-04  7:33                             ` Eric Dumazet
  0 siblings, 0 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-04  7:33 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Nick Piggin, Linux Kernel, linux-mm

On Tue, 03 Apr 2007 23:54:42 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> Eric Dumazet wrote:
> > You were CC on this one, you can find an archive here :
> 
> You cc:ed my gmail account.  I don't pick out mails sent to me there.
> If you want me to look at something you have to send it to my
> @redhat.com address.

What I meant is : You got the mails and even replied to one of them :)

http://lkml.org/lkml/2007/3/15/303

I will try to remember your email address, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
       [not found] <46128051.9000609@redhat.com>
       [not found] ` <p73648dz5oa.fsf@bingen.suse.de>
@ 2007-04-04  7:46 ` Nick Piggin
  2007-04-04  8:04   ` Nick Piggin
                     ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  7:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Rik van Riel, Andrew Morton, Linux Kernel, Jakub Jelinek,
	Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 2527 bytes --]

Ulrich Drepper wrote:
> People might remember the thread about mysql not scaling and pointing
> the finger quite happily at glibc.  Well, the situation is not like that.
> 
> The problem is glibc has to work around kernel limitations.  If the
> malloc implementation detects that a large chunk of previously allocated
> memory is now free and unused it wants to return the memory to the
> system.  What we currently have to do is this:
> 
>   to free:      mmap(PROT_NONE) over the area
>   to reuse:     mprotect(PROT_READ|PROT_WRITE)
> 
> Yep, that's expensive, both operations need to get locks preventing
> other threads from doing the same.
> 
> Some people were quick to suggest that we simply avoid the freeing in
> many situations (that's what the patch submitted by Yanmin Zhang
> basically does).  That's no solution.  One of the very good properties
> of the current allocator is that it does not use much memory.

Does mmap(PROT_NONE) actually free the memory?


> A solution for this problem is a madvise() operation with the following
> property:
> 
>   - the content of the address range can be discarded
> 
>   - if an access to a page in the range happens in the future it must
>     succeed.  The old page content can be provided or a new, empty page
>     can be provided
> 
> That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
> pages, causing *all* future reuses to create page faults.  This is what
> I guess happens in the mysql test case where the pages where unused and
> freed but then almost immediately reused.  The page faults erased all
> the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
> calls.

Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: madv-mmap_sem.patch --]
[-- Type: text/plain, Size: 1305 bytes --]

Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include <linux/hugetlb.h>
 
 /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+	switch (behavior) {
+	case MADV_DOFORK:
+	case MADV_DONTFORK:
+	case MADV_NORMAL:
+	case MADV_SEQUENTIAL:
+	case MADV_RANDOM:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
 	int error = -EINVAL;
 	size_t len;
 
-	down_write(&current->mm->mmap_sem);
+	if (madvise_need_mmap_write(behavior))
+		down_write(&current->mm->mmap_sem);
+	else
+		down_read(&current->mm->mmap_sem);
 
 	if (start & ~PAGE_MASK)
 		goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
 		vma = prev->vm_next;
 	}
 out:
-	up_write(&current->mm->mmap_sem);
+	if (madvise_need_mmap_write(behavior))
+		up_write(&current->mm->mmap_sem);
+	else
+		up_read(&current->mm->mmap_sem);
+
 	return error;
 }

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  7:46 ` Nick Piggin
@ 2007-04-04  8:04   ` Nick Piggin
  2007-04-04  8:20   ` Jakub Jelinek
  2007-04-05 18:38   ` Rik van Riel
  2 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  8:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Rik van Riel, Andrew Morton, Linux Kernel,
	Jakub Jelinek, Linux Memory Management

Nick Piggin wrote:
> Ulrich Drepper wrote:
> 
>> People might remember the thread about mysql not scaling and pointing
>> the finger quite happily at glibc.  Well, the situation is not like that.
>>
>> The problem is glibc has to work around kernel limitations.  If the
>> malloc implementation detects that a large chunk of previously allocated
>> memory is now free and unused it wants to return the memory to the
>> system.  What we currently have to do is this:
>>
>>   to free:      mmap(PROT_NONE) over the area
>>   to reuse:     mprotect(PROT_READ|PROT_WRITE)
>>
>> Yep, that's expensive, both operations need to get locks preventing
>> other threads from doing the same.
>>
>> Some people were quick to suggest that we simply avoid the freeing in
>> many situations (that's what the patch submitted by Yanmin Zhang
>> basically does).  That's no solution.  One of the very good properties
>> of the current allocator is that it does not use much memory.
> 
> 
> Does mmap(PROT_NONE) actually free the memory?
> 
> 
>> A solution for this problem is a madvise() operation with the following
>> property:
>>
>>   - the content of the address range can be discarded
>>
>>   - if an access to a page in the range happens in the future it must
>>     succeed.  The old page content can be provided or a new, empty page
>>     can be provided
>>
>> That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
>> pages, causing *all* future reuses to create page faults.  This is what
>> I guess happens in the mysql test case where the pages where unused and
>> freed but then almost immediately reused.  The page faults erased all
>> the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
>> calls.
> 
> 
> Two questions.
> 
> In the case of pages being unused then almost immediately reused, why is
> it a bad solution to avoid freeing? Is it that you want to avoid
> heuristics because in some cases they could fail and end up using memory?
> 
> Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
> than a syscall? (including the cost of the TLB fill for the memory access
> after the syscall, of course).
> 
> zapping the pages puts them on a nice LIFO cache hot list of pages that
> can be quickly used when the next fault comes in, or used for any other
> allocation in the kernel. Putting them on some sort of reclaim list seems
> a bit pointless.
> 
> Oh, also: something like this patch would help out MADV_DONTNEED, as it
> means it can run concurrently with page faults. I think the locking will
> work (but needs forward porting).

BTW. and this way it becomes much more attractive than using mmap/mprotect
can ever be, because they must take mmap_sem for writing always.

You don't actually need to protect the ranges unless running with use after
free debugging turned on, do you?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  7:46 ` Nick Piggin
  2007-04-04  8:04   ` Nick Piggin
@ 2007-04-04  8:20   ` Jakub Jelinek
  2007-04-04  8:47     ` Nick Piggin
  2007-04-05 18:38   ` Rik van Riel
  2 siblings, 1 reply; 87+ messages in thread
From: Jakub Jelinek @ 2007-04-04  8:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Rik van Riel, Andrew Morton, Linux Kernel,
	Linux Memory Management

On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:
> Does mmap(PROT_NONE) actually free the memory?

Yes.
        /* Clear old maps */
        error = -ENOMEM;
munmap_back:
        vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
        if (vma && vma->vm_start < addr + len) {
                if (do_munmap(mm, addr, len))
                        return -ENOMEM;
                goto munmap_back;
        }

> In the case of pages being unused then almost immediately reused, why is
> it a bad solution to avoid freeing? Is it that you want to avoid
> heuristics because in some cases they could fail and end up using memory?

free(3) doesn't know if the memory will be reused soon, late or never.
So avoiding trimming could substantially increase memory consumption with
certain malloc/free patterns, especially in threaded programs that use
multiple arenas.  Implementing some sort of deferred memory trimming
in malloc is "solving" the problem in a wrong place, each app really has no
idea (and should not have) what the current system memory pressure is.

> Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
> than a syscall? (including the cost of the TLB fill for the memory access
> after the syscall, of course).

That's page fault per page rather than a syscall for the whole chunk,
furthermore zeroing is expensive.

We really want something like FreeBSD MADV_FREE in Linux, see e.g.
http://mail.nl.linux.org/linux-mm/2000-03/msg00059.html
for some details.  Apparently FreeBSD malloc is using MADV_FREE for years
(according to their CVS for 10 years already).

	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  2:22                 ` Nick Piggin
  2007-04-04  5:41                   ` Eric Dumazet
@ 2007-04-04  8:25                   ` Peter Zijlstra
  2007-04-04  8:55                     ` Nick Piggin
  1 sibling, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2007-04-04  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
> Eric Dumazet wrote:

> > I do think such workloads might benefit from a vma_cache not shared by 
> > all threads but private to each thread. A sequence could invalidate the 
> > cache(s).
> > 
> > ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
> > having a current->mmap_cache and current->mm_sequence
> 
> I have a patchset to do exactly this, btw.

/me too

However, I decided against pushing it because when it does happen that a
task is not involved with a vma lookup for longer than it takes the seq
count to wrap we have a stale pointer...

We could go and walk the tasks once in a while to reset the pointer, but
it all got a tad involved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  8:20   ` Jakub Jelinek
@ 2007-04-04  8:47     ` Nick Piggin
  2007-04-05  4:23       ` Nick Piggin
  0 siblings, 1 reply; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  8:47 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Rik van Riel, Andrew Morton, Linux Kernel,
	Linux Memory Management

Jakub Jelinek wrote:
> On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:
> 
>>Does mmap(PROT_NONE) actually free the memory?
> 
> 
> Yes.
>         /* Clear old maps */
>         error = -ENOMEM;
> munmap_back:
>         vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
>         if (vma && vma->vm_start < addr + len) {
>                 if (do_munmap(mm, addr, len))
>                         return -ENOMEM;
>                 goto munmap_back;
>         }

Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent
access faults avoided?


>>In the case of pages being unused then almost immediately reused, why is
>>it a bad solution to avoid freeing? Is it that you want to avoid
>>heuristics because in some cases they could fail and end up using memory?
> 
> 
> free(3) doesn't know if the memory will be reused soon, late or never.
> So avoiding trimming could substantially increase memory consumption with
> certain malloc/free patterns, especially in threaded programs that use
> multiple arenas.  Implementing some sort of deferred memory trimming
> in malloc is "solving" the problem in a wrong place, each app really has no
> idea (and should not have) what the current system memory pressure is.

Thanks for the clarification.


>>Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
>>than a syscall? (including the cost of the TLB fill for the memory access
>>after the syscall, of course).
> 
> 
> That's page fault per page rather than a syscall for the whole chunk,
> furthermore zeroing is expensive.

Ah, for big allocations. OK, we could make a MADV_POPULATE to prefault
pages (like mmap's MAP_POPULATE, but without the down_write(mmap_sem)).

If you're just about to use the pages anyway, how much of a win would
it be to avoid zeroing? We allocate cache hot pages for these guys...

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  8:25                   ` missing madvise functionality Peter Zijlstra
@ 2007-04-04  8:55                     ` Nick Piggin
  2007-04-04  9:12                       ` William Lee Irwin III
  2007-04-04  9:34                       ` Eric Dumazet
  0 siblings, 2 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  8:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 962 bytes --]

Peter Zijlstra wrote:
> On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
> 
>>Eric Dumazet wrote:
> 
> 
>>>I do think such workloads might benefit from a vma_cache not shared by 
>>>all threads but private to each thread. A sequence could invalidate the 
>>>cache(s).
>>>
>>>ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
>>>having a current->mmap_cache and current->mm_sequence
>>
>>I have a patchset to do exactly this, btw.
> 
> 
> /me too
> 
> However, I decided against pushing it because when it does happen that a
> task is not involved with a vma lookup for longer than it takes the seq
> count to wrap we have a stale pointer...
> 
> We could go and walk the tasks once in a while to reset the pointer, but
> it all got a tad involved.

Well here is my core patch (against I think 2.6.16 + a set of vma cache
cleanups and abstractions). I didn't think the wrapping aspect was
terribly involved.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-thread-vma-cache.patch --]
[-- Type: text/plain, Size: 4389 bytes --]

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -296,6 +296,8 @@ struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
 	struct vm_area_struct *vma_cache;	/* find_vma cache */
+ 	unsigned long vma_sequence;
+
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -693,6 +695,8 @@ enum sleep_type {
 	SLEEP_INTERRUPTED,
 };
 
+#define VMA_CACHE_SIZE	4
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -734,6 +738,8 @@ struct task_struct {
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+	struct vm_area_struct *vma_cache[VMA_CACHE_SIZE];
+	unsigned long vma_cache_sequence;
 
 /* task state */
 	struct linux_binfmt *binfmt;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -32,6 +32,40 @@
 
 static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
 {
+	struct task_struct *curr = current;
+	if (mm == curr->mm) {
+		int i;
+		if (curr->vma_cache_sequence != mm->vma_sequence) {
+			curr->vma_cache_sequence = mm->vma_sequence;
+			curr->vma_cache[0] = vma;
+			for (i = 1; i < VMA_CACHE_SIZE; i++)
+				curr->vma_cache[i] = NULL;
+		} else {
+			int update_mm;
+
+			if (curr->vma_cache[0] == vma)
+				return;
+
+			for (i = 1; i < VMA_CACHE_SIZE; i++) {
+				if (curr->vma_cache[i] == vma)
+					break;
+			}
+			update_mm = 0;
+			if (i == VMA_CACHE_SIZE) {
+				update_mm = 1;
+				i = VMA_CACHE_SIZE-1;
+			}
+			while (i != 0) {
+				curr->vma_cache[i] = curr->vma_cache[i-1];
+				i--;
+			}
+			curr->vma_cache[0] = vma;
+
+			if (!update_mm)
+				return;
+		}
+	}
+
 	if (mm->vma_cache != vma) /* prevent cacheline bouncing */
 		mm->vma_cache = vma;
 }
@@ -39,27 +73,56 @@ static void vma_cache_touch(struct mm_st
 static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
 						struct vm_area_struct *repl)
 {
+	mm->vma_sequence++;
+	if (unlikely(mm->vma_sequence == 0)) {
+		struct task_struct *curr = current, *t;
+		t = curr;
+		rcu_read_lock();
+		do {
+			t->vma_cache_sequence = -1;
+			t = next_thread(t);
+		} while (t != curr);
+		rcu_read_unlock();
+	}
+
 	if (mm->vma_cache == vma)
 		mm->vma_cache = repl;
 }
 
 static void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma)
 {
- 	if (mm->vma_cache == vma)
-		mm->vma_cache = NULL;
+	vma_cache_replace(mm, vma, NULL);
 }
 
 static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
 						unsigned long addr)
 {
-	struct vm_area_struct *vma = mm->vma_cache;
+	struct task_struct *curr;
+	struct vm_area_struct *vma;
 
 	preempt_disable();
 	__inc_page_state(vma_cache_query);
-	if (vma && vma->vm_end > addr && vma->vm_start <= addr)
+
+	curr = current;
+	if (mm == curr->mm && mm->vma_sequence == curr->vma_cache_sequence) {
+		int i;
+		for (i = 0; i < VMA_CACHE_SIZE; i++) {
+			vma = curr->vma_cache[i];
+			if (vma && vma->vm_end > addr && vma->vm_start <= addr){
+				__inc_page_state(vma_cache_hit);
+				goto out;
+			}
+		}
+	}
+
+	vma = mm->vma_cache;
+	if (vma && vma->vm_end > addr && vma->vm_start <= addr) {
 		__inc_page_state(vma_cache_hit);
-	else
-		vma = NULL;
+		goto out;
+	}
+
+	vma = NULL;
+out:
 	preempt_enable();
 
 	return vma;
@@ -1439,9 +1502,9 @@ struct vm_area_struct * find_vma(struct 
 				} else
 					rb_node = rb_node->rb_right;
 			}
-			if (vma)
-				vma_cache_touch(mm, vma);
 		}
+		if (vma)
+			vma_cache_touch(mm, vma);
 	}
 	return vma;
 }
@@ -1487,6 +1550,9 @@ find_vma_prev(struct mm_struct *mm, unsi
 	}
 
 out:
+	if (vma)
+		vma_cache_touch(mm, vma);
+
 	*pprev = prev;
 	return prev ? prev->vm_next : vma;
 }
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -198,6 +198,7 @@ static inline int dup_mmap(struct mm_str
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
 	mm->vma_cache = NULL;
+	mm->vma_sequence = 0;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
 	mm->map_count = 0;

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  8:55                     ` Nick Piggin
@ 2007-04-04  9:12                       ` William Lee Irwin III
  2007-04-04  9:23                         ` Nick Piggin
  2007-04-04  9:34                       ` Eric Dumazet
  1 sibling, 1 reply; 87+ messages in thread
From: William Lee Irwin III @ 2007-04-04  9:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Eric Dumazet, Andrew Morton, Jakub Jelinek,
	Ulrich Drepper, Andi Kleen, Rik van Riel, Linux Kernel, linux-mm,
	Hugh Dickins

On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote:
> +		rcu_read_lock();
> +		do {
> +			t->vma_cache_sequence = -1;
> +			t = next_thread(t);
> +		} while (t != curr);
> +		rcu_read_unlock();

LD_ASSUME_KERNEL=2.4.18 anyone?


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 23:02               ` Andrew Morton
@ 2007-04-04  9:15                 ` Hugh Dickins
  2007-04-04 14:55                   ` Rik van Riel
  2007-04-04 18:04                   ` Andrew Morton
  0 siblings, 2 replies; 87+ messages in thread
From: Hugh Dickins @ 2007-04-04  9:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm

On Tue, 3 Apr 2007, Andrew Morton wrote:
> 
> All of which indicates that if we can remove the down_write(mmap_sem) from
> this glibc operation, things should get a lot better - there will be no
> additional context switches at all.
> 
> And we can surely do that if all we're doing is looking up pageframes,
> putting pages into fake-swapcache and moving them around on the page LRUs.
> 
> Hugh?  Sanity check?

Setting aside the fake-swapcache part, yes, Rik should be able to do what
Ulrich wants (operating on ptes and pages) without down_write(mmap_sem):
just needing down_read(mmap_sem) to keep the whole vma/pagetable structure
stable, and page table lock (literal or per-page-table) for each contents.

(I didn't understand how Rik would achieve his point 5, _no_ lock
contention while repeatedly re-marking these pages, but never mind.)

(Some mails in this thread overlook that we also use down_write(mmap_sem)
to guard simple things like vma->vm_flags: of course that in itself could
be manipulated with atomics, or spinlock; but like many of the vma fields,
changing it goes hand in hand with the chance that we have to split vma,
which does require the heavy-handed down_write(mmap_sem).  I expect that
splitting those uses apart would be harder than first appears, and better
to go for a more radical redesign - I don't know what.)

But you lose me with the fake-swapcache part of it: that came, I think,
from your initial idea that it would be okay to refault on these ptes.
Don't we all agree now that we'd prefer not to refault on those ptes,
unless some memory pressure has actually decided to pull them out?
(Hmm, yet more list balancing...)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  9:12                       ` William Lee Irwin III
@ 2007-04-04  9:23                         ` Nick Piggin
  0 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  9:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Zijlstra, Eric Dumazet, Andrew Morton, Jakub Jelinek,
	Ulrich Drepper, Andi Kleen, Rik van Riel, Linux Kernel, linux-mm,
	Hugh Dickins

William Lee Irwin III wrote:
> On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote:
> 
>>+		rcu_read_lock();
>>+		do {
>>+			t->vma_cache_sequence = -1;
>>+			t = next_thread(t);
>>+		} while (t != curr);
>>+		rcu_read_unlock();
> 
> 
> LD_ASSUME_KERNEL=2.4.18 anyone?

Meaning?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  8:55                     ` Nick Piggin
  2007-04-04  9:12                       ` William Lee Irwin III
@ 2007-04-04  9:34                       ` Eric Dumazet
  2007-04-04  9:45                         ` Nick Piggin
  2007-04-04 10:05                         ` Nick Piggin
  1 sibling, 2 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-04  9:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

On Wed, 04 Apr 2007 18:55:18 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Peter Zijlstra wrote:
> > On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
> > 
> >>Eric Dumazet wrote:
> > 
> > 
> >>>I do think such workloads might benefit from a vma_cache not shared by 
> >>>all threads but private to each thread. A sequence could invalidate the 
> >>>cache(s).
> >>>
> >>>ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
> >>>having a current->mmap_cache and current->mm_sequence
> >>
> >>I have a patchset to do exactly this, btw.
> > 
> > 
> > /me too
> > 
> > However, I decided against pushing it because when it does happen that a
> > task is not involved with a vma lookup for longer than it takes the seq
> > count to wrap we have a stale pointer...
> > 
> > We could go and walk the tasks once in a while to reset the pointer, but
> > it all got a tad involved.
> 
> Well here is my core patch (against I think 2.6.16 + a set of vma cache
> cleanups and abstractions). I didn't think the wrapping aspect was
> terribly involved.

Well, I believe this one is too expensive. I was thinking of a light one :

I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is incremented each time a vma is added/deleted, not each time mmap_sem is taken (read or write)

Each thread has its own copy of the sequence, taken at the time find_vma() had to do a full lookup.

I believe some optimized paths could call check_vma_cache() without mmap_sem read lock taken, and if it fails, take the mmap_sem lock and do the slow path.


--- linux-2.6.21-rc5/include/linux/sched.h
+++ linux-2.6.21-rc5-ed/include/linux/sched.h
@@ -319,10 +319,14 @@ typedef unsigned long mm_counter_t;
 		(mm)->hiwater_vm = (mm)->total_vm;	\
 } while (0)
 
+struct vm_area_cache {
+	struct vm_area_struct * mmap_cache; /* last find_vma result */
+	unsigned int sequence;
+	};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
-	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -336,6 +340,7 @@ struct mm_struct {
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
+	unsigned int mm_sequence;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.  These are globally strung
@@ -875,7 +880,7 @@ struct task_struct {
 	struct list_head tasks;
 
 	struct mm_struct *mm, *active_mm;
-
+	struct vm_area_cache vmacache;
 /* task state */
 	struct linux_binfmt *binfmt;
 	int exit_state;
--- linux-2.6.21-rc5/include/linux/mm.h
+++ linux-2.6.21-rc5-ed/include/linux/mm.h
@@ -1176,15 +1176,18 @@ extern int expand_upwards(struct vm_area
 #endif
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
+extern struct vm_area_struct * find_vma(struct mm_struct * mm,
+					unsigned long addr,
+					struct vm_area_cache *cache);
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
 					     struct vm_area_struct **pprev);
 
 /* Look up the first VMA which intersects the interval start_addr..end_addr-1,
    NULL if none.  Assume start_addr < end_addr. */
-static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)
+static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, 
+	unsigned long start_addr, unsigned long end_addr, struct vm_area_cache *cache)
 {
-	struct vm_area_struct * vma = find_vma(mm,start_addr);
+	struct vm_area_struct * vma = find_vma(mm,start_addr,cache);
 
 	if (vma && end_addr <= vma->vm_start)
 		vma = NULL;
--- linux-2.6.21-rc5/mm/mmap.c
+++ linux-2.6.21-rc5-ed/mm/mmap.c
@@ -267,7 +267,7 @@ asmlinkage unsigned long sys_brk(unsigne
 	}
 
 	/* Check against existing mmap mappings. */
-	if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
+	if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE, &current->vmacache))
 		goto out;
 
 	/* Ok, looks good - let it rip. */
@@ -447,6 +447,7 @@ static void vma_link(struct mm_struct *m
 		spin_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
+	mm->mm_sequence++;
 	validate_mm(mm);
 }
 
@@ -473,8 +474,7 @@ __vma_unlink(struct mm_struct *mm, struc
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
-	if (mm->mmap_cache == vma)
-		mm->mmap_cache = prev;
+	mm->mm_sequence++;
 }
 
 /*
@@ -1201,7 +1201,7 @@ arch_get_unmapped_area(struct file *filp
 
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
-		vma = find_vma(mm, addr);
+		vma = find_vma(mm, addr, &current->vmacache);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
@@ -1214,7 +1214,7 @@ arch_get_unmapped_area(struct file *filp
 	}
 
 full_search:
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr, &current->vmacache); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr) {
 			/*
@@ -1275,7 +1275,7 @@ arch_get_unmapped_area_topdown(struct fi
 	/* requesting a specific address */
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
-		vma = find_vma(mm, addr);
+		vma = find_vma(mm, addr, &current->vmacache);
 		if (TASK_SIZE - len >= addr &&
 				(!vma || addr + len <= vma->vm_start))
 			return addr;
@@ -1292,7 +1292,7 @@ arch_get_unmapped_area_topdown(struct fi
 
 	/* make sure it can fit in the remaining address space */
 	if (addr > len) {
-		vma = find_vma(mm, addr-len);
+		vma = find_vma(mm, addr-len, &current->vmacache);
 		if (!vma || addr <= vma->vm_start)
 			/* remember the address as a hint for next time */
 			return (mm->free_area_cache = addr-len);
@@ -1309,7 +1309,7 @@ arch_get_unmapped_area_topdown(struct fi
 		 * else if new region fits below vma->vm_start,
 		 * return with success:
 		 */
-		vma = find_vma(mm, addr);
+		vma = find_vma(mm, addr, &current->vmacache);
 		if (!vma || addr+len <= vma->vm_start)
 			/* remember the address as a hint for next time */
 			return (mm->free_area_cache = addr);
@@ -1397,16 +1397,28 @@ get_unmapped_area(struct file *file, uns
 
 EXPORT_SYMBOL(get_unmapped_area);
 
+struct vm_area_struct * check_vma_cache(struct mm_struct * mm, unsigned long addr, struct vm_area_cache *cache)
+{
+	struct vm_area_struct *vma = cache->mmap_cache;
+	unsigned int mmseq = mm->mm_sequence;
+	smp_rmb();
+	if (cache->sequence == mmseq &&
+		vma &&
+		addr < vma->vm_end && vma->vm_start <= addr)
+		return vma;
+	return NULL;
+}
+
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr)
+struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr, struct vm_area_cache *cache)
 {
 	struct vm_area_struct *vma = NULL;
 
 	if (mm) {
 		/* Check the cache first. */
 		/* (Cache hit rate is typically around 35%.) */
-		vma = mm->mmap_cache;
-		if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
+		vma = check_vma_cache(mm, addr, cache);
+		if (!vma) {
 			struct rb_node * rb_node;
 
 			rb_node = mm->mm_rb.rb_node;
@@ -1426,8 +1438,10 @@ struct vm_area_struct * find_vma(struct 
 				} else
 					rb_node = rb_node->rb_right;
 			}
-			if (vma)
-				mm->mmap_cache = vma;
+			if (vma) {
+				cache->mmap_cache = vma;
+				cache->sequence = mm->mm_sequence;
+			}
 		}
 	}
 	return vma;
@@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
 	unsigned long start;
 
 	addr &= PAGE_MASK;
-	vma = find_vma(mm,addr);
+	vma = find_vma(mm,addr,&current->vmacache);
 	if (!vma)
 		return NULL;
 	if (vma->vm_start <= addr)
@@ -1726,7 +1740,7 @@ detach_vmas_to_be_unmapped(struct mm_str
 	else
 		addr = vma ?  vma->vm_end : mm->mmap_base;
 	mm->unmap_area(mm, addr);
-	mm->mmap_cache = NULL;		/* Kill the cache. */
+	mm->mm_sequence++;
 }
 
 /*
@@ -1823,7 +1837,7 @@ int do_munmap(struct mm_struct *mm, unsi
 	}
 
 	/* Does it split the last one? */
-	last = find_vma(mm, end);
+	last = find_vma(mm, end, &current->vmacache);
 	if (last && end > last->vm_start) {
 		int error = split_vma(mm, last, end, 1);
 		if (error)
--- linux-2.6.21-rc5/kernel/fork.c
+++ linux-2.6.21-rc5-ed/kernel/fork.c
@@ -213,7 +213,6 @@ static inline int dup_mmap(struct mm_str
 
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
-	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
 	mm->map_count = 0;
@@ -564,6 +563,7 @@ good_mm:
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	tsk->vmacache.mmap_cache = NULL;
 	return 0;
 
 fail_nomem:
--- linux-2.6.21-rc5/mm/mempolicy.c
+++ linux-2.6.21-rc5-ed/mm/mempolicy.c
@@ -532,7 +532,7 @@ long do_get_mempolicy(int *policy, nodem
 		return -EINVAL;
 	if (flags & MPOL_F_ADDR) {
 		down_read(&mm->mmap_sem);
-		vma = find_vma_intersection(mm, addr, addr+1);
+		vma = find_vma_intersection(mm, addr, addr+1, &current->mmcache);
 		if (!vma) {
 			up_read(&mm->mmap_sem);
 			return -EFAULT;
--- linux-2.6.21-rc5/arch/i386/mm/fault.c
+++ linux-2.6.21-rc5-ed/arch/i386/mm/fault.c
@@ -374,7 +374,7 @@ fastcall void __kprobes do_page_fault(st
 		down_read(&mm->mmap_sem);
 	}
 
-	vma = find_vma(mm, address);
+	vma = find_vma(mm, address, &tsk->vmacache);
 	if (!vma)
 		goto bad_area;
 	if (vma->vm_start <= address)
--- linux-2.6.21-rc5/kernel/futex.c
+++ linux-2.6.21-rc5-ed/kernel/futex.c
@@ -346,7 +346,7 @@ static int futex_handle_fault(unsigned l
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
+	if (attempt > 2 || !(vma = find_vma(mm, address, &current->vmacache)) ||
 	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
 		return -EFAULT;
 
--- linux-2.6.21-rc5/mm/fremap.c
+++ linux-2.6.21-rc5-ed/mm/fremap.c
@@ -146,7 +146,7 @@ asmlinkage long sys_remap_file_pages(uns
 	/* We need down_write() to change vma->vm_flags. */
 	down_read(&mm->mmap_sem);
  retry:
-	vma = find_vma(mm, start);
+	vma = find_vma(mm, start, &current->vmacache);
 
 	/*
 	 * Make sure the vma is shared, that it supports prefaulting,
--- linux-2.6.21-rc5/mm/madvise.c
+++ linux-2.6.21-rc5-ed/mm/madvise.c
@@ -329,7 +329,7 @@ asmlinkage long sys_madvise(unsigned lon
 		if (prev)
 			vma = prev->vm_next;
 		else	/* madvise_remove dropped mmap_sem */
-			vma = find_vma(current->mm, start);
+			vma = find_vma(current->mm, start, &current->vmacache);
 	}
 out:
 	up_write(&current->mm->mmap_sem);
--- linux-2.6.21-rc5/mm/memory.c
+++ linux-2.6.21-rc5-ed/mm/memory.c
@@ -2556,7 +2556,7 @@ int make_pages_present(unsigned long add
 	int ret, len, write;
 	struct vm_area_struct * vma;
 
-	vma = find_vma(current->mm, addr);
+	vma = find_vma(current->mm, addr, &current->vmacache);
 	if (!vma)
 		return -1;
 	write = (vma->vm_flags & VM_WRITE) != 0;
--- linux-2.6.21-rc5/mm/mincore.c
+++ linux-2.6.21-rc5-ed/mm/mincore.c
@@ -63,7 +63,7 @@ static long do_mincore(unsigned long add
 	unsigned long nr;
 	int i;
 	pgoff_t pgoff;
-	struct vm_area_struct *vma = find_vma(current->mm, addr);
+	struct vm_area_struct *vma = find_vma(current->mm, addr, &current->vmacache);
 
 	/*
 	 * find_vma() didn't find anything above us, or we're
--- linux-2.6.21-rc5/mm/mremap.c
+++ linux-2.6.21-rc5-ed/mm/mremap.c
@@ -315,7 +315,7 @@ unsigned long do_mremap(unsigned long ad
 	 * Ok, we need to grow..  or relocate.
 	 */
 	ret = -EFAULT;
-	vma = find_vma(mm, addr);
+	vma = find_vma(mm, addr, &current->vmacache);
 	if (!vma || vma->vm_start > addr)
 		goto out;
 	if (is_vm_hugetlb_page(vma)) {
--- linux-2.6.21-rc5/mm/msync.c
+++ linux-2.6.21-rc5-ed/mm/msync.c
@@ -54,7 +54,7 @@ asmlinkage long sys_msync(unsigned long 
 	 * just ignore them, but return -ENOMEM at the end.
 	 */
 	down_read(&mm->mmap_sem);
-	vma = find_vma(mm, start);
+	vma = find_vma(mm, start, &current->vmacache);
 	for (;;) {
 		struct file *file;
 
@@ -86,7 +86,7 @@ asmlinkage long sys_msync(unsigned long 
 			if (error || start >= end)
 				goto out;
 			down_read(&mm->mmap_sem);
-			vma = find_vma(mm, start);
+			vma = find_vma(mm, start, &current->vmacache);
 		} else {
 			if (start >= end) {
 				error = 0;
--- linux-2.6.21-rc5/fs/proc/task_mmu.c
+++ linux-2.6.21-rc5-ed/fs/proc/task_mmu.c
@@ -405,9 +405,15 @@ static void *m_start(struct seq_file *m,
 	down_read(&mm->mmap_sem);
 
 	/* Start with last addr hint */
-	if (last_addr && (vma = find_vma(mm, last_addr))) {
-		vma = vma->vm_next;
-		goto out;
+	if (last_addr) {
+		struct vm_area_cache nocache = {
+			.sequence = mm->mm_sequence - 1,
+			};
+		vma = find_vma(mm, last_addr, &nocache);
+		if (vma) {
+			vma = vma->vm_next;
+			goto out;
+		}
 	}
 
 	/*
--- linux-2.6.21-rc5/drivers/char/mem.c
+++ linux-2.6.21-rc5-ed/drivers/char/mem.c
@@ -633,7 +633,7 @@ static inline size_t read_zero_pagealign
 	down_read(&mm->mmap_sem);
 
 	/* For private mappings, just map in zero pages. */
-	for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr, &current->vmacache); vma; vma = vma->vm_next) {
 		unsigned long count;
 
 		if (vma->vm_start > addr || (vma->vm_flags & VM_WRITE) == 0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  9:34                       ` Eric Dumazet
@ 2007-04-04  9:45                         ` Nick Piggin
  2007-04-04 10:05                         ` Nick Piggin
  1 sibling, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-04  9:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> On Wed, 04 Apr 2007 18:55:18 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Peter Zijlstra wrote:
>>
>>>On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
>>>
>>>
>>>>Eric Dumazet wrote:
>>>
>>>
>>>>>I do think such workloads might benefit from a vma_cache not shared by 
>>>>>all threads but private to each thread. A sequence could invalidate the 
>>>>>cache(s).
>>>>>
>>>>>ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
>>>>>having a current->mmap_cache and current->mm_sequence
>>>>
>>>>I have a patchset to do exactly this, btw.
>>>
>>>
>>>/me too
>>>
>>>However, I decided against pushing it because when it does happen that a
>>>task is not involved with a vma lookup for longer than it takes the seq
>>>count to wrap we have a stale pointer...
>>>
>>>We could go and walk the tasks once in a while to reset the pointer, but
>>>it all got a tad involved.
>>
>>Well here is my core patch (against I think 2.6.16 + a set of vma cache
>>cleanups and abstractions). I didn't think the wrapping aspect was
>>terribly involved.
> 
> 
> Well, I believe this one is too expensive. I was thinking of a light one :
> 
> I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is incremented each time a vma is added/deleted, not each time mmap_sem is taken (read or write)

That's exactly what mine does (except IIRC it doesn't invalidate when
you add a vma).


> Each thread has its own copy of the sequence, taken at the time find_vma() had to do a full lookup.
> 
> I believe some optimized paths could call check_vma_cache() without mmap_sem read lock taken, and if it fails, take the mmap_sem lock and do the slow path.

The mmap_sem for read does not only protect the mm_rb rbtree structure, but
the vmas themselves as well as their page tables, so you can't do that.

You could do it if you had a lock-per-vma to synchronise against write
operations, and rcu-freed vmas or some such... but I don't think we should
go down a road like that until we first remove mmap_sem from low hanging
things (like private futexes!) and then see who's complaining.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  9:34                       ` Eric Dumazet
  2007-04-04  9:45                         ` Nick Piggin
@ 2007-04-04 10:05                         ` Nick Piggin
  2007-04-04 11:54                           ` Eric Dumazet
  1 sibling, 1 reply; 87+ messages in thread
From: Nick Piggin @ 2007-04-04 10:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:

> Well, I believe this one is too expensive. I was thinking of a light one :

This one seems worse. Passing your vm_area_cache around everywhere, which
is just intrusive and dangerous because ot becomes decoupled from the mm
struct you are passing around. Watch this:


> @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
>  	unsigned long start;
>  
>  	addr &= PAGE_MASK;
> -	vma = find_vma(mm,addr);
> +	vma = find_vma(mm,addr,&current->vmacache);
>  	if (!vma)
>  		return NULL;
>  	if (vma->vm_start <= addr)

So now you can have current calling find_extend_vma on someone else's mm
but using their cache. So you're going to return current's vma, or current
is going to get one of mm's vmas in its cache :P

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 10:05                         ` Nick Piggin
@ 2007-04-04 11:54                           ` Eric Dumazet
  2007-04-05  2:01                             ` Nick Piggin
  0 siblings, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-04 11:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

On Wed, 04 Apr 2007 20:05:54 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
> >  	unsigned long start;
> >  
> >  	addr &= PAGE_MASK;
> > -	vma = find_vma(mm,addr);
> > +	vma = find_vma(mm,addr,&current->vmacache);
> >  	if (!vma)
> >  		return NULL;
> >  	if (vma->vm_start <= addr)
> 
> So now you can have current calling find_extend_vma on someone else's mm
> but using their cache. So you're going to return current's vma, or current
> is going to get one of mm's vmas in its cache :P

This was not a working patch, just to throw the idea, since the answers I got showed I was not understood.

In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma()

One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:29           ` Jakub Jelinek
  2007-04-03 20:38             ` Rik van Riel
  2007-04-03 21:49             ` Andrew Morton
@ 2007-04-04 13:09             ` William Lee Irwin III
  2007-04-04 13:38               ` William Lee Irwin III
  2007-04-04 18:51               ` Andrew Morton
  2007-04-04 23:00             ` preemption and rwsems (was: Re: missing madvise functionality) Andrew Morton
                               ` (2 subsequent siblings)
  5 siblings, 2 replies; 87+ messages in thread
From: William Lee Irwin III @ 2007-04-04 13:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 338 bytes --]

On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote:
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
>     exit (1);
>   int i;

Oh dear.


-- wli

[-- Attachment #2: jakub.c --]
[-- Type: text/x-csrc, Size: 6436 bytes --]

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>

enum thread_return {
	tr_success	=  0,
	tr_mmap_init	= -1,
	tr_mmap_free	= -2,
	tr_mprotect	= -3,
	tr_madvise	= -4,
	tr_unknown	= -5,
	tr_munmap	= -6,
};

enum release_method {
	release_by_mmap		= 0,
	release_by_madvise	= 1,
	release_by_max		= 2,
};

struct thread_argument {
	size_t page_size;
	int iterations, pages_per_thread, nr_threads;
	enum release_method method;
};

static enum thread_return mmap_release(void *p, size_t n)
{
	void *q;

	q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
	if (p != q) {
		perror("thread_function: mmap release failed");
		return tr_mmap_free;
	}
	if (mprotect(p, n, PROT_READ | PROT_WRITE)) {
		perror("thread_function: mprotect failed");
		return tr_mprotect;
	}
	return tr_success;
}

static enum thread_return madvise_release(void *p, size_t n)
{
	if (madvise(p, n, MADV_DONTNEED)) {
		perror("thread_function: madvise failed");
		return tr_madvise;
	}
	return tr_success;
}

static enum thread_return (*release_methods[])(void *, size_t) = {
	mmap_release,
	madvise_release,
};

static void *thread_function(void *__arg)
{
	char *p;
	int i;
	struct thread_argument *arg = __arg;
	size_t arena_size = arg->pages_per_thread * arg->page_size;

	p = (char *)mmap(NULL, arena_size,
				PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p == MAP_FAILED) {
		perror("thread_function: arena allocation failed");
		return (void *)tr_mmap_init;
	}
	for (i = 0; i < arg->iterations; i++) {
		size_t s;
		char *q, *r;
		enum thread_return ret;

		/* Pretend to use the buffer.  */
		r = p + arena_size;
		for (q = p; q < r; q += arg->page_size)
			*q = 1;
		for (s = 0, q = p; q < r; q += arg->page_size)
			s += *q;
		if (arg->method >= release_by_max) {
			perror("thread_function: "
				"unknown freeing method specified");
			return (void *)tr_unknown;
		}
		ret = (*release_methods[arg->method])(p, arena_size);
		if (ret != tr_success)
			return (void *)ret;
	}
	if (munmap(p, arena_size)) {
		perror("thread_function: munmap() failed");
		return (void *)tr_munmap;
	}
	return (void *)tr_success;
}

static int configure(struct thread_argument *arg, int argc, char *argv[])
{
	char optstring[] = "t:m:i:p:";
	int c, tmp, ret = 0;
	long n;

	n = sysconf(_SC_PAGE_SIZE);
	if (n < 0) {
		perror("configure: sysconf(_SC_PAGE_SIZE) failed");
		ret = -1;
	}
	arg->nr_threads = 32, 
	arg->page_size = (size_t)n;
	arg->method = release_by_mmap;
	arg->iterations = 100000;
	arg->pages_per_thread = 128;

	while ((c = getopt(argc, argv, optstring)) != -1) {
		switch (c) {
			case 't':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->nr_threads = tmp;
				else {
					perror("configure: non-numeric thread count");
					ret = -1;
				}
				break;
			case 'm':
				if (!strcmp(optarg, "mmap"))
					arg->method = release_by_mmap;
				else if (!strcmp(optarg, "madvise"))
					arg->method = release_by_madvise;
				else {
					perror("configure: unrecognised release method");
					ret = -1;
				}
				break;
			case 'i':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->iterations = tmp;
				else {
					perror("configure: non-numeric iteration count");
					ret = -1;
				}
				break;
			case 'p':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->pages_per_thread = tmp;
				else {
					perror("configure: non-numeric pages per thread count");
					ret = -1;
				}
				break;
			default:
				perror("unrecognignized argument");
				ret = -1;
		}
	}
	if (arg->nr_threads <= 0) {
		perror("configure: zero or negative thread count");
		ret = -1;
	}
	if (arg->iterations < 0) {
		perror("configure: negative iteration count");
		ret = -1;
	}
	if (arg->pages_per_thread <= 0) {
		perror("configure: zero or negative arena size");
		ret = -1;
	}
	if (SIZE_MAX/arg->page_size < (size_t)arg->pages_per_thread) {
		perror("configure: arena size overflow");
		ret = -1;
	}
	return ret;
}

static unsigned long long timeval_to_usec(struct timeval *tv)
{
	return 1000000*tv->tv_sec + tv->tv_usec;
}

static unsigned long long elapsed_usec(struct timeval *tv1, struct timeval *tv2)
{
	return timeval_to_usec(tv2) - timeval_to_usec(tv1);
}

#define user_usec(ru)	timeval_to_usec(&(ru)->ru_utime)
#define sys_usec(ru)	timeval_to_usec(&(ru)->ru_stime)
#define user_sec(ru)	((user_usec(ru) % 60000000ULL)/1000000.0)
#define sys_sec(ru)	((sys_usec(ru) % 60000000ULL)/1000000.0)
#define elapsed_sec(tv1, tv2)						\
		((elapsed_usec(tv1, tv2) % 60000000ULL)/1000000.0)

#define user_min(ru)	((unsigned long)((user_usec(ru)/60000000ULL) % 60))
#define sys_min(ru)	((unsigned long)((sys_usec(ru)/60000000ULL) % 60))
#define elapsed_min(tv1, tv2)						\
		((unsigned long)((elapsed_usec(tv1, tv2)/60000000ULL) % 60))

#define user_hrs(ru)	((unsigned long)(user_usec(ru)/3600000000ULL))
#define sys_hrs(ru)	((unsigned long)(user_usec(ru)/3600000000ULL))
#define elapsed_hrs(tv1, tv2)						\
		((unsigned long)(elapsed_usec(tv1, tv2)/3600000000ULL))

int main(int argc, char *argv[])
{
	int i, ret = EXIT_SUCCESS;
	struct thread_argument arg;
	struct rusage ru;
	struct timeval start, finish;
	pthread_t *th;

	if (gettimeofday(&start, NULL)) {
		perror("main: initial gettimeofday failed");
		return EXIT_FAILURE;
	}
	if (configure(&arg, argc, argv))
		return EXIT_FAILURE;
	th = calloc(arg.nr_threads, sizeof(pthread_t));
	if (!th) {
		perror("main: calloc of thread array failed");
		return EXIT_FAILURE;
	}
	for (i = 0; i < arg.nr_threads; i++) {
		if (pthread_create(&th[i], NULL, thread_function, &arg)) {
			perror("main: pthread_create failed");
			break;
		}
	}
	for (--i; i >= 0; --i) {
		if (pthread_join(th[i], NULL)) {
			perror("main: pthread_join failed");
			ret = EXIT_FAILURE;
		}
	}
	free(th);
	getrusage(RUSAGE_SELF, &ru);
	if (gettimeofday(&finish, NULL)) {
		perror("final gettimeofday failed");
		ret = EXIT_FAILURE;
	}
	if (printf("%lu:%.2lu:%05.2lf elapsed time\n"
		"%lu:%.2lu:%05.2lf user time\n"
		"%lu:%.2lu:%05.2lf system time\n"
		"%ld major faults\n"
		"%ld minor faults\n",
		elapsed_hrs(&start, &finish),
			elapsed_min(&start, &finish),
			elapsed_sec(&start, &finish),
		user_hrs(&ru), user_min(&ru), user_sec(&ru),
		sys_hrs(&ru), sys_min(&ru), sys_sec(&ru),
		ru.ru_majflt,
		ru.ru_minflt) < 0)
			ret = EXIT_FAILURE;
	return ret;
}

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 13:09             ` William Lee Irwin III
@ 2007-04-04 13:38               ` William Lee Irwin III
  2007-04-04 18:51               ` Andrew Morton
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2007-04-04 13:38 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Wed, Apr 04, 2007 at 06:09:18AM -0700, William Lee Irwin III wrote:
> 	for (--i; i >= 0; --i) {
> 		if (pthread_join(th[i], NULL)) {
> 			perror("main: pthread_join failed");
> 			ret = EXIT_FAILURE;
> 		}
> 	}

Obligatory brown paper bag patch:


--- ./jakub.c.orig	2007-04-04 05:57:23.409493248 -0700
+++ ./jakub.c	2007-04-04 06:35:34.296043432 -0700
@@ -232,10 +232,14 @@ int main(int argc, char *argv[])
 		}
 	}
 	for (--i; i >= 0; --i) {
-		if (pthread_join(th[i], NULL)) {
+		void *status;
+
+		if (pthread_join(th[i], &status)) {
 			perror("main: pthread_join failed");
 			ret = EXIT_FAILURE;
 		}
+		if (status != (void *)tr_success)
+			ret = EXIT_FAILURE;
 	}
 	free(th);
 	getrusage(RUSAGE_SELF, &ru);


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  9:15                 ` Hugh Dickins
@ 2007-04-04 14:55                   ` Rik van Riel
  2007-04-04 15:25                     ` Hugh Dickins
  2007-04-04 18:04                   ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2007-04-04 14:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Linux Kernel, linux-mm

Hugh Dickins wrote:

> (I didn't understand how Rik would achieve his point 5, _no_ lock
> contention while repeatedly re-marking these pages, but never mind.)

The CPU marks them accessed&dirty when they are reused.

The VM only moves the reused pages back to the active list
on memory pressure.  This means that when the system is
not under memory pressure, the same page can simply stay
PG_lazyfree for multiple malloc/free rounds.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 14:55                   ` Rik van Riel
@ 2007-04-04 15:25                     ` Hugh Dickins
  2007-04-05  1:44                       ` Nick Piggin
  0 siblings, 1 reply; 87+ messages in thread
From: Hugh Dickins @ 2007-04-04 15:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Linux Kernel, linux-mm

On Wed, 4 Apr 2007, Rik van Riel wrote:
> Hugh Dickins wrote:
> 
> > (I didn't understand how Rik would achieve his point 5, _no_ lock
> > contention while repeatedly re-marking these pages, but never mind.)
> 
> The CPU marks them accessed&dirty when they are reused.
> 
> The VM only moves the reused pages back to the active list
> on memory pressure.  This means that when the system is
> not under memory pressure, the same page can simply stay
> PG_lazyfree for multiple malloc/free rounds.

Sure, there's no need for repetitious locking at the LRU end of it;
but you said "if the system has lots of free memory, pages can go
through multiple free/malloc cycles while sitting on the dontneed
list, very lazily with no lock contention".  I took that to mean,
with userspace repeatedly madvising on the ranges they fall in,
which will involve mmap_sem and ptl each time - just in order
to check that no LRU movement is required each time.

(Of course, there's also the problem that we don't leave our
systems with lots of free memory: some LRU balancing decisions.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  9:15                 ` Hugh Dickins
  2007-04-04 14:55                   ` Rik van Riel
@ 2007-04-04 18:04                   ` Andrew Morton
  2007-04-04 18:08                     ` Rik van Riel
  2007-04-04 18:39                     ` Hugh Dickins
  1 sibling, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-04 18:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm

On Wed, 4 Apr 2007 10:15:41 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 3 Apr 2007, Andrew Morton wrote:
> > 
> > All of which indicates that if we can remove the down_write(mmap_sem) from
> > this glibc operation, things should get a lot better - there will be no
> > additional context switches at all.
> > 
> > And we can surely do that if all we're doing is looking up pageframes,
> > putting pages into fake-swapcache and moving them around on the page LRUs.
> > 
> > Hugh?  Sanity check?
> 
> Setting aside the fake-swapcache part, yes, Rik should be able to do what
> Ulrich wants (operating on ptes and pages) without down_write(mmap_sem):
> just needing down_read(mmap_sem) to keep the whole vma/pagetable structure
> stable, and page table lock (literal or per-page-table) for each contents.
> 
> (I didn't understand how Rik would achieve his point 5, _no_ lock
> contention while repeatedly re-marking these pages, but never mind.)
> 
> (Some mails in this thread overlook that we also use down_write(mmap_sem)
> to guard simple things like vma->vm_flags: of course that in itself could
> be manipulated with atomics, or spinlock; but like many of the vma fields,
> changing it goes hand in hand with the chance that we have to split vma,
> which does require the heavy-handed down_write(mmap_sem).  I expect that
> splitting those uses apart would be harder than first appears, and better
> to go for a more radical redesign - I don't know what.)
> 
> But you lose me with the fake-swapcache part of it: that came, I think,
> from your initial idea that it would be okay to refault on these ptes.
> Don't we all agree now that we'd prefer not to refault on those ptes,
> unless some memory pressure has actually decided to pull them out?
> (Hmm, yet more list balancing...)

The way in which we want to treat these pages is (I believe) to keep them
if there's not a lot of memory pressure, but to reclaim them "easily" if
there is some memory pressure.

A simple way to do that is to move them onto the inactive list.  But how do
we handle these pages when the vm scanner encounters them?

The treatment is identical to clean swapcache pages, with the sole
exception that they don't actually consume any swap space - hence the fake
swapcache entry thing.

There are other ways of doing it - I guess we could use a new page flag to
indicate that this is one-of-those-pages, and add new code to handle it in
all the right places.



One thing which we haven't sorted out with all this stuff: once the
application has marked an address range (and some pages) as
whatever-were-going-call-this-feature, how does the application undo that
change?  What effect will things like mremap, madvise and mlock have upon
these pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 18:04                   ` Andrew Morton
@ 2007-04-04 18:08                     ` Rik van Riel
  2007-04-04 20:56                       ` Andrew Morton
  2007-04-04 18:39                     ` Hugh Dickins
  1 sibling, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2007-04-04 18:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Linux Kernel, linux-mm

Andrew Morton wrote:

> There are other ways of doing it - I guess we could use a new page flag to
> indicate that this is one-of-those-pages, and add new code to handle it in
> all the right places.

That's what I did.  I'm currently working on the
zap_page_range() side of things.

> One thing which we haven't sorted out with all this stuff: once the
> application has marked an address range (and some pages) as
> whatever-were-going-call-this-feature, how does the application undo that
> change? 

It doesn't have to do anything.  Just access the page and the
MMU will mark it dirty/accessed and the VM will not reclaim
it.

> What effect will things like mremap, madvise and mlock have upon
> these pages?

Good point.  I had not thought about these.

Would you mind if I sent an initial proof of concept
patch that does not take these into account, before
we decide on what should happen in these cases? :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 18:04                   ` Andrew Morton
  2007-04-04 18:08                     ` Rik van Riel
@ 2007-04-04 18:39                     ` Hugh Dickins
  1 sibling, 0 replies; 87+ messages in thread
From: Hugh Dickins @ 2007-04-04 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm

On Wed, 4 Apr 2007, Andrew Morton wrote:
> 
> The treatment is identical to clean swapcache pages, with the sole
> exception that they don't actually consume any swap space - hence the fake
> swapcache entry thing.

I see, sneaking through try_to_unmap's anon PageSwapCache assumptions
as simply as possible - thanks.

(Coincidentally, Andrea pointed to precisely the same issue in the
no PAGE_ZERO thread, when we were toying with writable but clean.)

> One thing which we haven't sorted out with all this stuff: once the
> application has marked an address range (and some pages) as
> whatever-were-going-call-this-feature, how does the application undo
> that change?

By re-referencing the pages.  (Hmm, so an incorrect app which accesses
"free"d areas, will undo it: well, okay, nothing terrible about that.)

> What effect will things like mremap, madvise and mlock have upon
> these pages?

mlock will undo the state in its make_pages_present: I guess that
should happen in or near follow_page's mark_page_accessed.

mremap?  Other madvises?  Nothing much at all: mremap can move
them around, and the madvises do whatever they do - I don't notice
any problem in that direction, but it'll be easier when we have an
implementation to poke at.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:51           ` missing madvise functionality Andrew Morton
  2007-04-03 20:57             ` Ulrich Drepper
  2007-04-03 21:00             ` Rik van Riel
@ 2007-04-04 18:49             ` Anton Blanchard
  2 siblings, 0 replies; 87+ messages in thread
From: Anton Blanchard @ 2007-04-04 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, Andi Kleen, Rik van Riel, Linux Kernel,
	Jakub Jelinek, linux-mm, Hugh Dickins

Hi,

> Oh.  I was assuming that we'd want to unmap these pages from pagetables and
> mark then super-easily-reclaimable.  So a later touch would incur a minor
> fault.
> 
> But you think that we should leave them mapped into pagetables so no such
> fault occurs.

That would be very nice. The issues are not limited to threaded apps,
we have seen performance problems with single threaded HPC applications
that do a lot of large malloc/frees. It turns out the continual set up
and tear down of pagetables when malloc uses mmap/free is a problem. At
the moment the workaround is:

export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1

which forces glibc malloc to use brk instead of mmap/free. Of course brk
is good for keeping pagetables around but bad for keeping memory usage
down.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 13:09             ` William Lee Irwin III
  2007-04-04 13:38               ` William Lee Irwin III
@ 2007-04-04 18:51               ` Andrew Morton
  2007-04-05  4:14                 ` William Lee Irwin III
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-04 18:51 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:

> 
> On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote:
> > void *
> > tf (void *arg)
> > {
> >   (void) arg;
> >   size_t ps = sysconf (_SC_PAGE_SIZE);
> >   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
> >                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >   if (p == MAP_FAILED)
> >     exit (1);
> >   int i;
> 
> Oh dear.

what's all this about?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 18:08                     ` Rik van Riel
@ 2007-04-04 20:56                       ` Andrew Morton
  0 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-04 20:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Linux Kernel, linux-mm

On Wed, 04 Apr 2007 14:08:47 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > There are other ways of doing it - I guess we could use a new page flag to
> > indicate that this is one-of-those-pages, and add new code to handle it in
> > all the right places.
> 
> That's what I did.  I'm currently working on the
> zap_page_range() side of things.

Let's try to avoid consuming another page flag if poss, please.  Perhaps
use PAGE_MAPPING_ANON's neighbouring bit?

> > One thing which we haven't sorted out with all this stuff: once the
> > application has marked an address range (and some pages) as
> > whatever-were-going-call-this-feature, how does the application undo that
> > change? 
> 
> It doesn't have to do anything.  Just access the page and the
> MMU will mark it dirty/accessed and the VM will not reclaim
> it.

um, OK.  I suspect it would be good to clear the page's
PageWhateverWereGoingToCallThisThing() state when this happens.  Otherwise
when the page gets clean again (ie: added to swapcache then written out)
then it will look awfully similar to one of these new types of pages and
things might get confusing.  We'll see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-03 20:29           ` Jakub Jelinek
                               ` (2 preceding siblings ...)
  2007-04-04 13:09             ` William Lee Irwin III
@ 2007-04-04 23:00             ` Andrew Morton
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
  2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
  5 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-04 23:00 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andi Kleen, Rik van Riel, Linux Kernel, linux-mm,
	Hugh Dickins, Ingo Molnar

On Tue, 3 Apr 2007 16:29:37 -0400
Jakub Jelinek <jakub@redhat.com> wrote:

> #include <pthread.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
>     exit (1);
>   int i;
>   for (i = 0; i < 100000; i++)
>     {
>       /* Pretend to use the buffer.  */
>       char *q, *r = (char *) p + 128 * ps;
>       size_t s;
>       for (q = (char *) p; q < r; q += ps)
>         *q = 1;
>       for (s = 0, q = (char *) p; q < r; q += ps)
>         s += *q;
>       /* Free it.  Replace this mmap with
>          madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>       if (mmap (p, 128 * ps, PROT_NONE,
>                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>         exit (2);
>       /* And immediately malloc again.  This would then be deleted.  */
>       if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>         exit (3);
>     }
>   return NULL;
> }
> 
> int
> main (void)
> {
>   pthread_t th[32];
>   int i;
>   for (i = 0; i < 32; i++)
>     if (pthread_create (&th[i], NULL, tf, NULL))
>       exit (4);
>   for (i = 0; i < 32; i++)
>     pthread_join (th[i], NULL);
>   return 0;
> }

This little test app is fun.

I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
causes 160,000 context switches per second and takes 9.5 seconds (after
s/100000/1000).

The kernel has

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

and when I switch that to

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

the context switch rate falls to zilch and total runtime falls to 6.4
seconds.

Presumably the same problem will occur with CONFIG_PREEMPT_VOLUNTARY on
uniprocessor kernels.

<thinks>

What we effectively have is 32 threads on a single CPU all doing

	for (ever) {
		down_write()
		up_write()
		down_read()
		up_read();
	}

and rwsems are "fair".  So

  thread A                                     thread B

  down_write();

  cond_resched()
  ->schedule()

                                               down_read() -> blocks

  up_write()

  down_read()

  up_read()

  down_write() -> there's a reader: block

                                               down_read() -> succeeds

                                               up_read()

                                               down_write() -> there's another down_writer: block

  down_write() -> succeeds

  up_write()

  down_read() -> there's a down_writer: block

                                               down_write() succeeds

                                               up_write()

                                               down_read() -> succeeds

                                               up_read()

                                               down_write() -> there's a down_reader: block

  down_read() succeeds


ad nauseum.


If that cond_resched() was not there, none of this would ever happen - each
thread merrily chugs away doing its ups and downs until it expires its
timeslice.  Interesting, in a sad sort of way.



Setting CONFIG_PREEMPT_NONE doesn't appear to make any difference to
context switch rate or runtime when all eight CPUs are used, so this
phenomenon is unlikely to be involved in the mysql problem.

I wonder why a similar thing doesn't happen when more than one CPU is used.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 15:25                     ` Hugh Dickins
@ 2007-04-05  1:44                       ` Nick Piggin
  0 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-05  1:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Linux Kernel, linux-mm

Hugh Dickins wrote:
> On Wed, 4 Apr 2007, Rik van Riel wrote:
> 
>>Hugh Dickins wrote:
>>
>>
>>>(I didn't understand how Rik would achieve his point 5, _no_ lock
>>>contention while repeatedly re-marking these pages, but never mind.)
>>
>>The CPU marks them accessed&dirty when they are reused.
>>
>>The VM only moves the reused pages back to the active list
>>on memory pressure.  This means that when the system is
>>not under memory pressure, the same page can simply stay
>>PG_lazyfree for multiple malloc/free rounds.
> 
> 
> Sure, there's no need for repetitious locking at the LRU end of it;
> but you said "if the system has lots of free memory, pages can go
> through multiple free/malloc cycles while sitting on the dontneed
> list, very lazily with no lock contention".  I took that to mean,
> with userspace repeatedly madvising on the ranges they fall in,
> which will involve mmap_sem and ptl each time - just in order
> to check that no LRU movement is required each time.
> 
> (Of course, there's also the problem that we don't leave our
> systems with lots of free memory: some LRU balancing decisions.)

I don't agree this approach is the best one anyway. I'd rather
just the simple MADV_DONTNEED/MADV_DONEED.

Once you go through the trouble of protecting the memory and
flushing TLBs, unprotecting them afterwards and taking a trap
(even if it is a pure hardware trap), I doubt you've saved much.

You may have saved the cost of zeroing out the page, but that
has to be weighed against the fact that you have left a possibly
cache hot page sitting there to get cold, and your accesses to
initialise the malloced memory might have more cache misses.

If you just free the page, it goes onto a nice LIFO cache hot
list, and when you want to allocate another one, you'll probably
get a cache hot one.

The problem is down_write(mmap_sem) isn't it? We can and should
easily fix that problem now. If we subsequently want to look at
micro optimisations to avoid zeroing using MMU tricks, then we
have a good base to compare with.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 11:54                           ` Eric Dumazet
@ 2007-04-05  2:01                             ` Nick Piggin
  2007-04-05  6:09                               ` Eric Dumazet
  0 siblings, 1 reply; 87+ messages in thread
From: Nick Piggin @ 2007-04-05  2:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:
> On Wed, 04 Apr 2007 20:05:54 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>>@@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
>>> 	unsigned long start;
>>> 
>>> 	addr &= PAGE_MASK;
>>>-	vma = find_vma(mm,addr);
>>>+	vma = find_vma(mm,addr,&current->vmacache);
>>> 	if (!vma)
>>> 		return NULL;
>>> 	if (vma->vm_start <= addr)
>>
>>So now you can have current calling find_extend_vma on someone else's mm
>>but using their cache. So you're going to return current's vma, or current
>>is going to get one of mm's vmas in its cache :P
> 
> 
> This was not a working patch, just to throw the idea, since the answers I got showed I was not understood.
> 
> In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma()
> 
> One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config.

Oh, what sort of workload are you using to show this? The only reason that I
didn't submit my thread cache patches was that I didn't show a big enough
improvement.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04 18:51               ` Andrew Morton
@ 2007-04-05  4:14                 ` William Lee Irwin III
  0 siblings, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2007-04-05  4:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins

On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
>> Oh dear.

On Wed, Apr 04, 2007 at 11:51:05AM -0700, Andrew Morton wrote:
> what's all this about?

I rewrote Jakub's testcase and included it as a MIME attachment.
Current working version inline below. Also at

	http://holomorphy.com/~wli/jakub.c

The basic idea was that I wanted a few more niceties, such as specifying
the number of iterations and other things of that nature on the cmdline.
I threw in a little code reorganization and error checking, too.


-- wli


#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>

enum thread_return {
	tr_success	=  0,
	tr_mmap_init	= -1,
	tr_mmap_free	= -2,
	tr_mprotect	= -3,
	tr_madvise	= -4,
	tr_unknown	= -5,
	tr_munmap	= -6,
};

enum release_method {
	release_by_mmap		= 0,
	release_by_madvise	= 1,
	release_by_max		= 2,
};

struct thread_argument {
	size_t page_size;
	int iterations, pages_per_thread, nr_threads;
	enum release_method method;
};

static enum thread_return mmap_release(void *p, size_t n)
{
	void *q;

	q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
	if (p != q) {
		perror("thread_function: mmap release failed");
		return tr_mmap_free;
	}
	if (mprotect(p, n, PROT_READ | PROT_WRITE)) {
		perror("thread_function: mprotect failed");
		return tr_mprotect;
	}
	return tr_success;
}

static enum thread_return madvise_release(void *p, size_t n)
{
	if (madvise(p, n, MADV_DONTNEED)) {
		perror("thread_function: madvise failed");
		return tr_madvise;
	}
	return tr_success;
}

static enum thread_return (*release_methods[])(void *, size_t) = {
	mmap_release,
	madvise_release,
};

static void *thread_function(void *__arg)
{
	char *p;
	int i;
	struct thread_argument *arg = __arg;
	size_t arena_size = arg->pages_per_thread * arg->page_size;

	p = (char *)mmap(NULL, arena_size,
				PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p == MAP_FAILED) {
		perror("thread_function: arena allocation failed");
		return (void *)tr_mmap_init;
	}
	for (i = 0; i < arg->iterations; i++) {
		size_t s;
		char *q, *r;
		enum thread_return ret;

		/* Pretend to use the buffer.  */
		r = p + arena_size;
		for (q = p; q < r; q += arg->page_size)
			*q = 1;
		for (s = 0, q = p; q < r; q += arg->page_size)
			s += *q;
		if (arg->method >= release_by_max) {
			perror("thread_function: "
				"unknown freeing method specified");
			return (void *)tr_unknown;
		}
		ret = (*release_methods[arg->method])(p, arena_size);
		if (ret != tr_success)
			return (void *)ret;
	}
	if (munmap(p, arena_size)) {
		perror("thread_function: munmap() failed");
		return (void *)tr_munmap;
	}
	return (void *)tr_success;
}

static int configure(struct thread_argument *arg, int argc, char *argv[])
{
	char optstring[] = "t:m:i:p:";
	int c, tmp, ret = 0;
	long n;

	n = sysconf(_SC_PAGE_SIZE);
	if (n < 0) {
		perror("configure: sysconf(_SC_PAGE_SIZE) failed");
		ret = -1;
	}
	arg->nr_threads = 32, 
	arg->page_size = (size_t)n;
	arg->method = release_by_mmap;
	arg->iterations = 100000;
	arg->pages_per_thread = 128;

	while ((c = getopt(argc, argv, optstring)) != -1) {
		switch (c) {
			case 't':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->nr_threads = tmp;
				else {
					perror("configure: non-numeric thread count");
					ret = -1;
				}
				break;
			case 'm':
				if (!strcmp(optarg, "mmap"))
					arg->method = release_by_mmap;
				else if (!strcmp(optarg, "madvise"))
					arg->method = release_by_madvise;
				else {
					perror("configure: unrecognised release method");
					ret = -1;
				}
				break;
			case 'i':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->iterations = tmp;
				else {
					perror("configure: non-numeric iteration count");
					ret = -1;
				}
				break;
			case 'p':
				if (sscanf(optarg, "%d", &tmp) == 1)
					arg->pages_per_thread = tmp;
				else {
					perror("configure: non-numeric pages per thread count");
					ret = -1;
				}
				break;
			default:
				perror("unrecognignized argument");
				ret = -1;
		}
	}
	if (arg->nr_threads <= 0) {
		perror("configure: zero or negative thread count");
		ret = -1;
	}
	if (arg->iterations < 0) {
		perror("configure: negative iteration count");
		ret = -1;
	}
	if (arg->pages_per_thread <= 0) {
		perror("configure: zero or negative arena size");
		ret = -1;
	}
	if (SIZE_MAX/arg->page_size < (size_t)arg->pages_per_thread) {
		perror("configure: arena size overflow");
		ret = -1;
	}
	return ret;
}

static unsigned long long timeval_to_usec(struct timeval *tv)
{
	return 1000000*tv->tv_sec + tv->tv_usec;
}

static unsigned long long elapsed_usec(struct timeval *tv1, struct timeval *tv2)
{
	return timeval_to_usec(tv2) - timeval_to_usec(tv1);
}

#define user_usec(ru)	timeval_to_usec(&(ru)->ru_utime)
#define sys_usec(ru)	timeval_to_usec(&(ru)->ru_stime)
#define user_sec(ru)	((user_usec(ru) % 60000000ULL)/1000000.0)
#define sys_sec(ru)	((sys_usec(ru) % 60000000ULL)/1000000.0)
#define elapsed_sec(tv1, tv2)						\
		((elapsed_usec(tv1, tv2) % 60000000ULL)/1000000.0)

#define user_min(ru)	((unsigned long)((user_usec(ru)/60000000ULL) % 60))
#define sys_min(ru)	((unsigned long)((sys_usec(ru)/60000000ULL) % 60))
#define elapsed_min(tv1, tv2)						\
		((unsigned long)((elapsed_usec(tv1, tv2)/60000000ULL) % 60))

#define user_hrs(ru)	((unsigned long)(user_usec(ru)/3600000000ULL))
#define sys_hrs(ru)	((unsigned long)(user_usec(ru)/3600000000ULL))
#define elapsed_hrs(tv1, tv2)						\
		((unsigned long)(elapsed_usec(tv1, tv2)/3600000000ULL))

int main(int argc, char *argv[])
{
	int i, ret = EXIT_SUCCESS;
	struct thread_argument arg;
	struct rusage ru;
	struct timeval start, finish;
	pthread_t *th;

	if (gettimeofday(&start, NULL)) {
		perror("main: initial gettimeofday failed");
		return EXIT_FAILURE;
	}
	if (configure(&arg, argc, argv))
		return EXIT_FAILURE;
	th = calloc(arg.nr_threads, sizeof(pthread_t));
	if (!th) {
		perror("main: calloc of thread array failed");
		return EXIT_FAILURE;
	}
	for (i = 0; i < arg.nr_threads; i++) {
		if (pthread_create(&th[i], NULL, thread_function, &arg)) {
			perror("main: pthread_create failed");
			break;
		}
	}
	for (--i; i >= 0; --i) {
		void *status;

		if (pthread_join(th[i], &status)) {
			perror("main: pthread_join failed");
			ret = EXIT_FAILURE;
		} else if (status != (void *)tr_success)
			ret = EXIT_FAILURE;
	}
	free(th);
	getrusage(RUSAGE_SELF, &ru);
	if (gettimeofday(&finish, NULL)) {
		perror("final gettimeofday failed");
		ret = EXIT_FAILURE;
	}
	if (printf("%lu:%.2lu:%05.2lf elapsed time\n"
		"%lu:%.2lu:%05.2lf user time\n"
		"%lu:%.2lu:%05.2lf system time\n"
		"%ld major faults\n"
		"%ld minor faults\n"
		"%ld voluntary context switches\n"
		"%ld involuntary context switches\n",
		elapsed_hrs(&start, &finish),
			elapsed_min(&start, &finish),
			elapsed_sec(&start, &finish),
		user_hrs(&ru), user_min(&ru), user_sec(&ru),
		sys_hrs(&ru), sys_min(&ru), sys_sec(&ru),
		ru.ru_majflt,
		ru.ru_minflt,
		ru.ru_nvcsw,
		ru.ru_nivcsw) < 0)
			ret = EXIT_FAILURE;
	return ret;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  8:47     ` Nick Piggin
@ 2007-04-05  4:23       ` Nick Piggin
  0 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-05  4:23 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Rik van Riel, Andrew Morton, Linux Kernel,
	Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 1872 bytes --]

Nick Piggin wrote:
> Jakub Jelinek wrote:
> 
>> On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:
>>
>>> Does mmap(PROT_NONE) actually free the memory?
>>
>>
>>
>> Yes.
>>         /* Clear old maps */
>>         error = -ENOMEM;
>> munmap_back:
>>         vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
>>         if (vma && vma->vm_start < addr + len) {
>>                 if (do_munmap(mm, addr, len))
>>                         return -ENOMEM;
>>                 goto munmap_back;
>>         }
> 
> 
> Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent
> access faults avoided?

AFAIKS, the faults are not avoided. Not for single page allocations, not
for multi-page allocations.

So what glibc currently does to allocate, use, then deallocate a page is
this:
   mprotect(PROT_READ|PROT_WRITE) -> down_write(mmap_sem)
   touch page -> page fault -> down_read(mmap_sem)
   mmap(PROT_NONE) -> down_write(mmap_sem)

What it could be doing is:
   touch page -> page fault -> down_read(mmap_sem)
   madvise(MADV_DONTNEED) -> down_read(mmap_sem)

So after my previously posted patch (attached again) to only take down_read
in madvise where possible...

With 2 threads, the attached test.c ends up doing about 140,000 context
switches per second with just 2 threads/2CPUs, takes a little over 2
million faults, and about 80 seconds to complete, when running the
old_test() function (ie. mprotect,touch,mmap).

When running new_test() (ie. touch,madvise), context switches stay well
under 100, it takes slightly fewer faults, and it completes in about 8
seconds.

With 1 thread, new_test() actually completes in under half the time as
well (4.55 vs 9.88 seconds). This result won't have been altered by my
madvise patch, because the down_write fastpath is no slower than down_read.

Any comments?

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: madv-mmap_sem.patch --]
[-- Type: text/plain, Size: 1305 bytes --]

Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include <linux/hugetlb.h>
 
 /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+	switch (behavior) {
+	case MADV_DOFORK:
+	case MADV_DONTFORK:
+	case MADV_NORMAL:
+	case MADV_SEQUENTIAL:
+	case MADV_RANDOM:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
 	int error = -EINVAL;
 	size_t len;
 
-	down_write(&current->mm->mmap_sem);
+	if (madvise_need_mmap_write(behavior))
+		down_write(&current->mm->mmap_sem);
+	else
+		down_read(&current->mm->mmap_sem);
 
 	if (start & ~PAGE_MASK)
 		goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
 		vma = prev->vm_next;
 	}
 out:
-	up_write(&current->mm->mmap_sem);
+	if (madvise_need_mmap_write(behavior))
+		up_write(&current->mm->mmap_sem);
+	else
+		up_read(&current->mm->mmap_sem);
+
 	return error;
 }

[-- Attachment #3: test.c --]
[-- Type: text/x-csrc, Size: 1868 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <pthread.h>

#define NR_THREADS	1
#define ITERS	1000000
#define HEAPSIZE	(4*1024)

static void *old_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		if (mprotect(heap, HEAPSIZE, PROT_READ|PROT_WRITE) == -1)
			perror("mprotect"), exit(1);
		*mem = i;
		if (mmap(heap, HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED)
			perror("mmap"), exit(1);
	}

	return NULL;
}

static void old_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, NR_THREADS*HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create(&pt[i], NULL, old_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, NR_THREADS*HEAPSIZE) == -1)
		perror("munmap"), exit(1);
}

static void *new_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		*mem = i;
		if (madvise(heap, HEAPSIZE, MADV_DONTNEED) == -1)
			perror("madvise"), exit(1);
	}

	return NULL;
}

static void new_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, HEAPSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create(&pt[i], NULL, new_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, HEAPSIZE) == -1)
		perror("munmap"), exit(1);
}

int main(void)
{
	old_test();

	exit(0);
}


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  2:01                             ` Nick Piggin
@ 2007-04-05  6:09                               ` Eric Dumazet
  2007-04-05  6:19                                 ` Ulrich Drepper
  0 siblings, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-05  6:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Jakub Jelinek, Ulrich Drepper,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Nick Piggin a ecrit :
> Eric Dumazet wrote:
> >> This was not a working patch, just to throw the idea, since the
>> answers I got showed I was not understood.
>>
>> In this case, find_extend_vma() should of course have one struct 
>> vm_area_cache * argument, like find_vma()
>>
>> One single cache on one mm is not scalable. oprofile badly hits it on 
>> a dual cpu config.
> 
> Oh, what sort of workload are you using to show this? The only reason 
> that I
> didn't submit my thread cache patches was that I didn't show a big enough
> improvement.
> 

Database workload, where the user multi threaded app is constantly accessing 
GBytes of data, so L2 cache hit is very small. If you want to oprofile it, 
with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5.

Each time oprofile has an NMI, it calls find_vma(EIP/RIP) and blows out the 
target process cache (usually plugged on the data vma containing user land 
futexes). Event with private futexes, it will probably be plugged on the brk() 
vma.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  6:09                               ` Eric Dumazet
@ 2007-04-05  6:19                                 ` Ulrich Drepper
  2007-04-05  6:54                                   ` Eric Dumazet
  0 siblings, 1 reply; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-05  6:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, Jakub Jelinek,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

Eric Dumazet wrote:
> Database workload, where the user multi threaded app is constantly
> accessing GBytes of data, so L2 cache hit is very small. If you want to
> oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
> in the top 5.

We did have a workload with lots of Java and databases at some point
when many VMAs were the issue.  I brought this up here one, maybe two
years ago and I think Blaisorblade went on and looked into avoiding VMA
splits by having mprotect() not split VMAs and instead store the flags
in the page table somewhere.  I don't remember the details.

Nothing came out of this but if this is possible it would be yet another
way to avoid mmap_sem locking, right?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  6:19                                 ` Ulrich Drepper
@ 2007-04-05  6:54                                   ` Eric Dumazet
  0 siblings, 0 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-05  6:54 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, Jakub Jelinek,
	Andi Kleen, Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

Ulrich Drepper a A(C)crit :
> Eric Dumazet wrote:
>> Database workload, where the user multi threaded app is constantly
>> accessing GBytes of data, so L2 cache hit is very small. If you want to
>> oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
>> in the top 5.
> 
> We did have a workload with lots of Java and databases at some point
> when many VMAs were the issue.  I brought this up here one, maybe two
> years ago and I think Blaisorblade went on and looked into avoiding VMA
> splits by having mprotect() not split VMAs and instead store the flags
> in the page table somewhere.  I don't remember the details.
> 
> Nothing came out of this but if this is possible it would be yet another
> way to avoid mmap_sem locking, right?
> 

I was speaking about oprofile needs, that may interfere with target process 
needs, since oprofile calls find_vma() on the target process mm and thus zap 
its mmap_cache.

oprofile is yet another mmap_sem user, but also a mmap_cache destroyer.

We could at least have a separate cache, only for oprofile.

If done correctly we might avoid taking mmap_sem when the same vm_area_struct 
contains EIP/RIP snapshots.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-03 20:29           ` Jakub Jelinek
                               ` (3 preceding siblings ...)
  2007-04-04 23:00             ` preemption and rwsems (was: Re: missing madvise functionality) Andrew Morton
@ 2007-04-05  7:31             ` Rik van Riel
  2007-04-05  7:39               ` Rik van Riel
                                 ` (3 more replies)
  2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
  5 siblings, 4 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-05  7:31 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

Jakub Jelinek wrote:

> My guess is that all the page zeroing is pretty expensive as well and
> takes significant time, but I haven't profiled it.

With the attached patch (Andrew, I'll change the details around
if you want - I just wanted something to test now), your test
case run time went down considerably.

I modified the test case to only run 1000 loops, so it would run
a bit faster on my system.  I also modified it to use MADV_DONTNEED
to zap the pages, instead of the mmap(PROT_NONE) thing you use.


MADV_DONTNEED, unpatched, 1000 loops

real    0m13.672s
user    0m1.217s
sys     0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real    0m4.169s
user    0m2.033s
sys     0m3.224s


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free.patch --]
[-- Type: text/x-patch, Size: 11297 bytes --]

--- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-alpha/mman.h	2007-04-04 16:56:24.000000000 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-generic/mman.h	2007-04-04 16:56:53.000000000 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-mips/mman.h	2007-04-04 16:58:02.000000000 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-parisc/mman.h	2007-04-04 16:58:40.000000000 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise	2007-04-04 16:44:51.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-xtensa/mman.h	2007-04-04 16:59:14.000000000 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm_inline.h	2007-04-04 22:19:24.000000000 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
--- linux-2.6.20.noarch/include/linux/mm.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm.h	2007-04-04 22:06:45.000000000 -0400
@@ -716,6 +716,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.20.noarch/include/linux/page-flags.h.madvise	2007-04-03 22:54:58.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/page-flags.h	2007-04-05 01:27:38.000000000 -0400
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_lazyfree		20	/* MADV_FREE potential throwaway */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -237,6 +239,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.20.noarch/include/linux/swap.h.madvise	2007-04-05 00:29:40.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/swap.h	2007-04-04 23:35:00.000000000 -0400
@@ -181,6 +181,7 @@ extern unsigned int nr_free_pagecache_pa
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
--- linux-2.6.20.noarch/mm/madvise.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/madvise.c	2007-04-04 23:48:34.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else
-		zap_page_range(vma, start, end - start, NULL);
+	} else {
+		struct zap_details details = {
+			.madv_free = 1,
+		};
+		zap_page_range(vma, start, end - start, &details);
+	}
 	return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
+	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
--- linux-2.6.20.noarch/mm/memory.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/memory.c	2007-04-04 23:56:57.000000000 -0400
@@ -661,6 +661,26 @@ static unsigned long zap_pte_range(struc
 				    (page->index < details->first_index ||
 				     page->index > details->last_index))
 					continue;
+
+				/*
+				 * MADV_FREE is used to lazily recycle
+				 * anon memory.  The process no longer
+				 * needs the data and wants to avoid IO.
+				 */
+				if (details->madv_free && PageAnon(page)) {
+					if (unlikely(PageSwapCache(page)) &&
+					    !TestSetPageLocked(page)) {
+						remove_exclusive_swap_page(page);
+						unlock_page(page);
+					}
+					/* Optimize this... */
+					ptep_clear_flush_dirty(vma, addr, pte);
+					ptep_clear_flush_young(vma, addr, pte);
+					SetPageLazyFree(page);
+					if (PageActive(page))
+						deactivate_tail_page(page);
+					continue;
+				}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -689,7 +709,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->check_mapping, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(!details->check_mapping &&
+				!details->nonlinear_vma))
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +776,8 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->check_mapping && !details->nonlinear_vma
+			&& !details->madv_free)
 		details = NULL;
 
 	BUG_ON(addr >= end);
--- linux-2.6.20.noarch/mm/page_alloc.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/page_alloc.c	2007-04-05 01:27:55.000000000 -0400
@@ -203,6 +203,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_lazyfree |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -442,6 +443,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageLazyFree(page))
+		__ClearPageLazyFree(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -588,6 +591,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_lazyfree |
 			1 << PG_buddy ))))
 		bad_page(page);
 
--- linux-2.6.20.noarch/mm/rmap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/rmap.c	2007-04-04 23:53:29.000000000 -0400
@@ -656,7 +656,17 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	/* MADV_FREE is used to lazily free memory from userspace. */
+	if (PageLazyFree(page) && !migration) {
+		/* There is new data in the page.  Reinstate it. */
+		if (unlikely(pte_dirty(pteval))) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+		/* Throw the page away. */
+		dec_mm_counter(mm, anon_rss);
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- linux-2.6.20.noarch/mm/swap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/swap.c	2007-04-04 23:33:27.000000000 -0400
@@ -151,6 +151,20 @@ void fastcall activate_page(struct page 
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+void fastcall deactivate_tail_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		ClearPageActive(page);
+		add_page_to_inactive_list_tail(zone, page);
+		__count_vm_event(PGDEACTIVATE);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
--- linux-2.6.20.noarch/mm/vmscan.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/vmscan.c	2007-04-04 03:34:56.000000000 -0400
@@ -473,6 +473,24 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		/* 
+		 * MADV_DONTNEED pages get reclaimed lazily, unless the
+		 * process reuses it before we get to it.
+		 */
+		if (PageLazyFree(page)) {
+			switch (try_to_unmap(page, 0)) {
+			case SWAP_FAIL:
+				ClearPageLazyFree(page);
+				goto activate_locked;
+			case SWAP_AGAIN:
+				ClearPageLazyFree(page);
+				goto keep_locked;
+			case SWAP_SUCCESS:
+				ClearPageLazyFree(page);
+				goto free_it;
+			}
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
@ 2007-04-05  7:39               ` Rik van Riel
  2007-04-05  8:32                 ` Andrew Morton
  2007-04-05  8:08               ` Eric Dumazet
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2007-04-05  7:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jakub Jelinek, Ulrich Drepper, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

Rik van Riel wrote:

> MADV_DONTNEED, unpatched, 1000 loops
> 
> real    0m13.672s
> user    0m1.217s
> sys     0m45.712s
> 
> 
> MADV_DONTNEED, with patch, 1000 loops
> 
> real    0m4.169s
> user    0m2.033s
> sys     0m3.224s

I just noticed something fun with these numbers.

Without the patch, the system (a quad core CPU) is 10% idle.

With the patch, it is 66% idle - presumably I need Nick's
mmap_sem patch.

However, despite being 66% idle, the test still runs over
3 times as fast!

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
  2007-04-05  7:39               ` Rik van Riel
@ 2007-04-05  8:08               ` Eric Dumazet
  2007-04-05  8:31                 ` Rik van Riel
  2007-04-05  9:45               ` Jakub Jelinek
  2007-04-05 16:10               ` Ulrich Drepper
  3 siblings, 1 reply; 87+ messages in thread
From: Eric Dumazet @ 2007-04-05  8:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jakub Jelinek, Ulrich Drepper, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

On Thu, 05 Apr 2007 03:31:24 -0400
Rik van Riel <riel@redhat.com> wrote:

> Jakub Jelinek wrote:
> 
> > My guess is that all the page zeroing is pretty expensive as well and
> > takes significant time, but I haven't profiled it.
> 
> With the attached patch (Andrew, I'll change the details around
> if you want - I just wanted something to test now), your test
> case run time went down considerably.
> 
> I modified the test case to only run 1000 loops, so it would run
> a bit faster on my system.  I also modified it to use MADV_DONTNEED
> to zap the pages, instead of the mmap(PROT_NONE) thing you use.
> 

Interesting...

Could you please add this patch and see if it helps on your machine ?

[PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem

Avoids cache line dirtying : The first cache line of mm_struct is/should_be mostly read.

In case find_vma() hits the cache, we dont need to access the begining of mm_struct.
Since we just dirtied mmap_sem, access to its cache line is free.

In case find_vma() misses the cache, we dont need to dirty the begining of mm_struct.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

--- linux-2.6.21-rc5/include/linux/sched.h
+++ linux-2.6.21-rc5-ed/include/linux/sched.h
@@ -310,7 +310,6 @@ typedef unsigned long mm_counter_t;
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
-	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -324,6 +323,7 @@ struct mm_struct {
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
+	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.  These are globally strung



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  8:08               ` Eric Dumazet
@ 2007-04-05  8:31                 ` Rik van Riel
  2007-04-05  9:06                   ` Eric Dumazet
  0 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2007-04-05  8:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Jelinek, Ulrich Drepper, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

Eric Dumazet wrote:

> Could you please add this patch and see if it helps on your machine ?
> 
> [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem
> 
> Avoids cache line dirtying

I could, but I already know it's not going to help much.

How do I know this?  I already have 66% idle time when running
with my patch (and without Nick Piggin's patch to take the
mmap_sem for reading only).  Interestingly, despite the idle
time increasing from 10% to 66%, throughput triples...

Saving some CPU time will probably only increase the idle time,
I see no reason your patch would reduce contention and increase
throughput.

I'm not saying your patch doesn't make sense - it probably does.
I just suspect it would have zero impact on this particular
scenario, because of the already huge idle time.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  7:39               ` Rik van Riel
@ 2007-04-05  8:32                 ` Andrew Morton
  2007-04-05 15:47                   ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-05  8:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <riel@redhat.com> wrote:

> Rik van Riel wrote:
> 
> > MADV_DONTNEED, unpatched, 1000 loops
> > 
> > real    0m13.672s
> > user    0m1.217s
> > sys     0m45.712s
> > 
> > 
> > MADV_DONTNEED, with patch, 1000 loops
> > 
> > real    0m4.169s
> > user    0m2.033s
> > sys     0m3.224s
> 
> I just noticed something fun with these numbers.
> 
> Without the patch, the system (a quad core CPU) is 10% idle.
> 
> With the patch, it is 66% idle - presumably I need Nick's
> mmap_sem patch.
> 
> However, despite being 66% idle, the test still runs over
> 3 times as fast!

Please quote the context switch rate when testing this stuff (I use vmstat 1).
I've seen it vary by a factor of 10,000 depending upon what's happening.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  8:31                 ` Rik van Riel
@ 2007-04-05  9:06                   ` Eric Dumazet
  0 siblings, 0 replies; 87+ messages in thread
From: Eric Dumazet @ 2007-04-05  9:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jakub Jelinek, Ulrich Drepper, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

On Thu, 05 Apr 2007 04:31:55 -0400
Rik van Riel <riel@redhat.com> wrote:

> Eric Dumazet wrote:
> 
> > Could you please add this patch and see if it helps on your machine ?
> > 
> > [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem
> > 
> > Avoids cache line dirtying
> 
> I could, but I already know it's not going to help much.
> 
> How do I know this?  I already have 66% idle time when running
> with my patch (and without Nick Piggin's patch to take the
> mmap_sem for reading only).  Interestingly, despite the idle
> time increasing from 10% to 66%, throughput triples...
> 
> Saving some CPU time will probably only increase the idle time,
> I see no reason your patch would reduce contention and increase
> throughput.
> 
> I'm not saying your patch doesn't make sense - it probably does.
> I just suspect it would have zero impact on this particular
> scenario, because of the already huge idle time.

I know your cpus have idle time, that not the question.

But *when* your cpus are not idle, they might be slowed down because of cache line transferts between them. This patch doesnt reduce contention, just latencies (and overall performance)

I dont currently have SMP test machine, so I couldnt test it myself.

On x86_64, I am pretty sure the patch would help, because offsetof(mmap_sem) = 0x60
On i386, offsetof(mmap_sem)=0x34, so this patch wont help.

As you said, throughput can raise and idle time raise too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
  2007-04-05  7:39               ` Rik van Riel
  2007-04-05  8:08               ` Eric Dumazet
@ 2007-04-05  9:45               ` Jakub Jelinek
  2007-04-05 16:15                 ` Rik van Riel
  2007-04-05 16:10               ` Ulrich Drepper
  3 siblings, 1 reply; 87+ messages in thread
From: Jakub Jelinek @ 2007-04-05  9:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

On Thu, Apr 05, 2007 at 03:31:24AM -0400, Rik van Riel wrote:
> >My guess is that all the page zeroing is pretty expensive as well and
> >takes significant time, but I haven't profiled it.
> 
> With the attached patch (Andrew, I'll change the details around
> if you want - I just wanted something to test now), your test
> case run time went down considerably.

Thanks.

--- linux-2.6.20.noarch/mm/madvise.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/madvise.c	2007-04-04 23:48:34.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else
-		zap_page_range(vma, start, end - start, NULL);
+	} else {
+		struct zap_details details = {
+			.madv_free = 1,
+		};
+		zap_page_range(vma, start, end - start, &details);
+	}
 	return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
+	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
I think you should only use the new behavior for madvise MADV_FREE, not for
MADV_DONTNEED.  The current MADV_DONTNEED behavior (which conflicts with
POSIX POSIX_MADV_DONTNEED, but that doesn't matter since what glibc
maps posix_madvise POSIX_MADV_DONTNEED in madvise call if anything doesn't
have to be MADV_DONTNEED, but can be anything else) is apparently documented
in Linux man pages:
       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being, the application is finished with  the
              given  range, so the kernel can free resources associated with it.)  Subsequent accesses of pages in
              this range will succeed, but will result either in re-loading of the memory contents from the under-
              lying mapped file (see mmap()) or zero-fill-on-demand pages for mappings without an underlying file.
so it wouldn't surprise me if something relied on zero filling.
So IMHO madv_free in details should be only set if MADV_FREE.

Also, I think MADV_FREE shouldn't do anything at all (i.e. don't call
zap_page_range, but don't fail either) for shared or file backed vmas,
only for private anon memory it should do something.  After all, it
is just an optimization and it makes sense only for private anon mappings.

	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-03 20:29           ` Jakub Jelinek
                               ` (4 preceding siblings ...)
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
@ 2007-04-05 12:48             ` David Howells
  2007-04-05 19:11               ` Ingo Molnar
  2007-04-05 19:27               ` Andrew Morton
  5 siblings, 2 replies; 87+ messages in thread
From: David Howells @ 2007-04-05 12:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins, Ingo Molnar

Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> What we effectively have is 32 threads on a single CPU all doing
> 
> 	for (ever) {
> 		down_write()
> 		up_write()
> 		down_read()
> 		up_read();
> 	}

That's not quite so.  In that test program, most loops do two d/u writes and
then a slew of d/u reads with virtually no delay between them.  One of the
write-locked periods possibly lasts a relatively long time (it frees a bunch
of pages), and the read-locked periods last a potentially long time (have to
allocate a page).

Though, to be fair, as long as you've got way more than 16MB of RAM, the
memory stuff shouldn't take too long, but the locks will be being held for a
long time compared to the periods when you're not holding a lock of any sort.

> and rwsems are "fair".

If they weren't, you'd have to expect writer starvation in this situation.  As
it is, you're guaranteed progress on all threads.

> CONFIG_PREEMPT_VOLUNTARY=y

Which means the periods of lock-holding can be extended by preemption of the
lock holder(s), making the whole situation that much worse.  You have to
remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

> I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
> causes 160,000 context switches per second and takes 9.5 seconds (after
> s/100000/1000).

How about if you have a UP kernel?  (ie: spinlocks -> nops)

> the context switch rate falls to zilch and total runtime falls to 6.4
> seconds.

I presume you don't mean literally zero.

> If that cond_resched() was not there, none of this would ever happen - each
> thread merrily chugs away doing its ups and downs until it expires its
> timeslice.  Interesting, in a sad sort of way.

The trouble is, I think, that you spend so much more time holding (or
attempting to hold) locks than not, and preemption just exacerbates things.

I suspect that the reason the problem doesn't seem so obvious when you've got
8 CPUs crunching their way through at once is probably because you can make
progress on several read loops simultaneously fast enough that the preemption
is lost in the things having to stop to give everyone writelocks.

But short of recording the lock sequence, I don't think there's anyway to find
out for sure.  printk probably won't cut it as a recording mechanism because
its overheads are too great.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  8:32                 ` Andrew Morton
@ 2007-04-05 15:47                   ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-05 15:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

Andrew Morton wrote:
> On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
>> Rik van Riel wrote:
>>
>>> MADV_DONTNEED, unpatched, 1000 loops
>>>
>>> real    0m13.672s
>>> user    0m1.217s
>>> sys     0m45.712s
>>>
>>>
>>> MADV_DONTNEED, with patch, 1000 loops
>>>
>>> real    0m4.169s
>>> user    0m2.033s
>>> sys     0m3.224s
>> I just noticed something fun with these numbers.
>>
>> Without the patch, the system (a quad core CPU) is 10% idle.
>>
>> With the patch, it is 66% idle - presumably I need Nick's
>> mmap_sem patch.
>>
>> However, despite being 66% idle, the test still runs over
>> 3 times as fast!
> 
> Please quote the context switch rate when testing this stuff (I use vmstat 1).
> I've seen it vary by a factor of 10,000 depending upon what's happening.

About context switches 14000 per second.

I'll go compile in Nick's patch to see if that makes
things go faster.  I expect it will.

procs -----------memory---------- ---swap-- -----io---- --system-- 
-----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
  1  0      0 965232 250024 370848    0    0     0     0 1026 13914 13 
21 67  0  0
  1  0      0 965232 250024 370848    0    0     0     0 1018 14654 12 
20 68  0  0
  1  0      0 965232 250024 370848    0    0     0     0 1023 14006 12 
21 67  0  0


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  7:31             ` missing madvise functionality Rik van Riel
                                 ` (2 preceding siblings ...)
  2007-04-05  9:45               ` Jakub Jelinek
@ 2007-04-05 16:10               ` Ulrich Drepper
  2007-04-06  2:28                 ` Nick Piggin
  3 siblings, 1 reply; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-05 16:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jakub Jelinek, Andrew Morton, Andi Kleen, Linux Kernel, linux-mm,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

In case somebody wants to play around with Rik patch or another
madvise-based patch, I have x86-64 glibc binaries which can use it:

  http://people.redhat.com/drepper/rpms

These are based on the latest Fedora rawhide version.  They should work
on older systems, too, but you screw up your updates.  Use them only if
you know what you do.

By default madvise(MADV_DONTNEED) is used.  With the environment variable

  MALLOC_MADVISE

one can select a different hint.  The value of the envvar must be the
number of that other hint.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05  9:45               ` Jakub Jelinek
@ 2007-04-05 16:15                 ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-05 16:15 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Andrew Morton, Andi Kleen, Linux Kernel,
	linux-mm, Hugh Dickins

Jakub Jelinek wrote:

> +	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
>  	case MADV_DONTNEED:
> +	case MADV_FREE:
>  		error = madvise_dontneed(vma, prev, start, end);
>  		break;
>  
> I think you should only use the new behavior for madvise MADV_FREE, not for
> MADV_DONTNEED. 

I will.  However, we need to double-use MADV_DONTNEED in this
patch for now, so Ulrich's test glibc can be used easily :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-04  7:46 ` Nick Piggin
  2007-04-04  8:04   ` Nick Piggin
  2007-04-04  8:20   ` Jakub Jelinek
@ 2007-04-05 18:38   ` Rik van Riel
  2007-04-05 21:07     ` Andrew Morton
  2007-04-06  1:28     ` Nick Piggin
  2 siblings, 2 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-05 18:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andrew Morton, Linux Kernel, Jakub Jelinek,
	Linux Memory Management

Nick Piggin wrote:

> Oh, also: something like this patch would help out MADV_DONTNEED, as it
> means it can run concurrently with page faults. I think the locking will
> work (but needs forward porting).

Ironically, your patch decreases throughput on my quad core
test system, with Jakub's test case.

MADV_DONTNEED, my patch, 10000 loops  (14k context switches/second)

real    0m34.890s
user    0m17.256s
sys     0m29.797s

MADV_DONTNEED, my patch & your patch, 10000 loops  (50 context 
switches/second)

real    1m8.321s
user    0m20.840s
sys     1m55.677s

I suspect it's moving the contention onto the page table lock,
in zap_pte_range().  I guess that the thread private memory
areas must be living right next to each other, in the same
page table lock regions :)

For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.

Time to move back to debugging other stuff, though.

Andrew, it would be nice if our patches could cook in -mm
for a while.  Want me to change anything before submitting?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
@ 2007-04-05 19:11               ` Ingo Molnar
  2007-04-05 20:37                 ` Andrew Morton
  2007-04-05 19:27               ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Ingo Molnar @ 2007-04-05 19:11 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

* David Howells <dhowells@redhat.com> wrote:

> But short of recording the lock sequence, I don't think there's anyway 
> to find out for sure.  printk probably won't cut it as a recording 
> mechanism because its overheads are too great.

getting a good trace of it is easy: pick up the latest -rt kernel from:

	http://redhat.com/~mingo/realtime-preempt/

enable EVENT_TRACING in that kernel, run the workload 
and do:

	scripts/trace-it > to-ingo.txt

and send me the output. It will be large but interesting. That should 
get us a whole lot closer to what happens. A (much!) more finegrained 
result would be to also enable FUNCTION_TRACING and to do:

	echo 1 > /proc/sys/kernel/mcount_enabled

before running trace-it.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
  2007-04-05 19:11               ` Ingo Molnar
@ 2007-04-05 19:27               ` Andrew Morton
  1 sibling, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2007-04-05 19:27 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Jelinek, Ulrich Drepper, Andi Kleen, Rik van Riel,
	Linux Kernel, linux-mm, Hugh Dickins, Ingo Molnar

On Thu, 05 Apr 2007 13:48:58 +0100
David Howells <dhowells@redhat.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > What we effectively have is 32 threads on a single CPU all doing
> > 
> > 	for (ever) {
> > 		down_write()
> > 		up_write()
> > 		down_read()
> > 		up_read();
> > 	}
> 
> That's not quite so.  In that test program, most loops do two d/u writes and
> then a slew of d/u reads with virtually no delay between them.  One of the
> write-locked periods possibly lasts a relatively long time (it frees a bunch
> of pages), and the read-locked periods last a potentially long time (have to
> allocate a page).

Whatever.  I think it is still the case that the queueing behaviour of
rwsems causes us to get into this abababababab scenario.  And a single,
sole, solitary cond_resched() is sufficient to trigger the whole process
happening, and once it has started, it is sustained.

> If they weren't, you'd have to expect writer starvation in this situation.  As
> it is, you're guaranteed progress on all threads.
> 
> > CONFIG_PREEMPT_VOLUNTARY=y
> 
> Which means the periods of lock-holding can be extended by preemption of the
> lock holder(s), making the whole situation that much worse.  You have to
> remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

Of course - the same thing happens with CONFIG_PREEMPT=y.

> > I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
> > causes 160,000 context switches per second and takes 9.5 seconds (after
> > s/100000/1000).
> 
> How about if you have a UP kernel?  (ie: spinlocks -> nops)

dunno.

> > the context switch rate falls to zilch and total runtime falls to 6.4
> > seconds.
> 
> I presume you don't mean literally zero.

I do.  At least, I was unable to discern any increase in the context-switch
column in the `vmstat 1' output.

> > If that cond_resched() was not there, none of this would ever happen - each
> > thread merrily chugs away doing its ups and downs until it expires its
> > timeslice.  Interesting, in a sad sort of way.
> 
> The trouble is, I think, that you spend so much more time holding (or
> attempting to hold) locks than not, and preemption just exacerbates things.

No.  Preemption *triggers* things.  We're talking about an increase in
context switch rate by a factor of at least 10,000.  Something changed in a
fundamental way.

> I suspect that the reason the problem doesn't seem so obvious when you've got
> 8 CPUs crunching their way through at once is probably because you can make
> progress on several read loops simultaneously fast enough that the preemption
> is lost in the things having to stop to give everyone writelocks.

The context switch rate is enormous on SMP on all kernel configs.  Perhaps
a better way of looking at it is to observe that the special case of a
single processor running a non-preemptible kernel simply got lucky.

> But short of recording the lock sequence, I don't think there's anyway to find
> out for sure.  printk probably won't cut it as a recording mechanism because
> its overheads are too great.

I think any code sequence which does

	for ( ; ; ) {
		down_write()
		up_write()
		down_read()
		up_read()
	}

is vulnerable to the artifact which I described.


I don't think we can (or should) do anything about it at the lock
implementation level.  It's more a matter of being aware of the possible
failure modes of rwsems, and being more careful to avoid that situation in
the code which uses rwsems.  And, of course, being careful about when and
where we use rwsems as opposed to other types of locks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-05 19:11               ` Ingo Molnar
@ 2007-04-05 20:37                 ` Andrew Morton
  2007-04-06  9:08                   ` Ingo Molnar
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-05 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Howells, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

On Thu, 5 Apr 2007 21:11:29 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * David Howells <dhowells@redhat.com> wrote:
> 
> > But short of recording the lock sequence, I don't think there's anyway 
> > to find out for sure.  printk probably won't cut it as a recording 
> > mechanism because its overheads are too great.
> 
> getting a good trace of it is easy: pick up the latest -rt kernel from:
> 
> 	http://redhat.com/~mingo/realtime-preempt/
> 
> enable EVENT_TRACING in that kernel, run the workload 
> and do:
> 
> 	scripts/trace-it > to-ingo.txt
> 
> and send me the output.

Did that - no output was generated.  config at
http://userweb.kernel.org/~akpm/config-akpm2.txt

> It will be large but interesting. That should 
> get us a whole lot closer to what happens. A (much!) more finegrained 
> result would be to also enable FUNCTION_TRACING and to do:
> 
> 	echo 1 > /proc/sys/kernel/mcount_enabled
> 
> before running trace-it.

Did that - still no output.

I did get an interesting dmesg spew:
http://userweb.kernel.org/~akpm/dmesg-akpm2.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05 18:38   ` Rik van Riel
@ 2007-04-05 21:07     ` Andrew Morton
  2007-04-05 21:39       ` Rik van Riel
  2007-04-06  1:28     ` Nick Piggin
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-05 21:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Ulrich Drepper, Linux Kernel, Jakub Jelinek,
	Linux Memory Management

On Thu, 05 Apr 2007 14:38:30 -0400
Rik van Riel <riel@redhat.com> wrote:

> Nick Piggin wrote:
> 
> > Oh, also: something like this patch would help out MADV_DONTNEED, as it
> > means it can run concurrently with page faults. I think the locking will
> > work (but needs forward porting).
> 
> Ironically, your patch decreases throughput on my quad core
> test system, with Jakub's test case.
> 
> MADV_DONTNEED, my patch, 10000 loops  (14k context switches/second)
> 
> real    0m34.890s
> user    0m17.256s
> sys     0m29.797s
> 
> 
> MADV_DONTNEED, my patch & your patch, 10000 loops  (50 context 
> switches/second)
> 
> real    1m8.321s
> user    0m20.840s
> sys     1m55.677s
> 
> I suspect it's moving the contention onto the page table lock,
> in zap_pte_range().  I guess that the thread private memory
> areas must be living right next to each other, in the same
> page table lock regions :)

Remember that we have two different ways of doing that locking:


#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
/*
 * We tuck a spinlock to guard each pagetable page into its struct page,
 * at page->private, with BUILD_BUG_ON to make sure that this will not
 * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
 * When freeing, reset page->mapping so free_pages_check won't complain.
 */
#define __pte_lockptr(page)	&((page)->ptl)
#define pte_lock_init(_page)	do {					\
	spin_lock_init(__pte_lockptr(_page));				\
} while (0)
#define pte_lock_deinit(page)	((page)->mapping = NULL)
#define pte_lockptr(mm, pmd)	({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
#else
/*
 * We use mm->page_table_lock to guard all pagetable pages of the mm.
 */
#define pte_lock_init(page)	do {} while (0)
#define pte_lock_deinit(page)	do {} while (0)
#define pte_lockptr(mm, pmd)	({(void)(pmd); &(mm)->page_table_lock;})
#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */


I wonder which way you're using, and whether using the other way changes
things.


> For more real world workloads, like the MySQL sysbench one,
> I still suspect that your patch would improve things.
> 
> Time to move back to debugging other stuff, though.
> 
> Andrew, it would be nice if our patches could cook in -mm
> for a while.  Want me to change anything before submitting?

umm.  I took a quick squint at a patch from you this morning and it looked
OK to me.  Please send the finalish thing when it is fully baked and
performance-tested in the various regions of operation, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05 21:07     ` Andrew Morton
@ 2007-04-05 21:39       ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2007-04-05 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Ulrich Drepper, Linux Kernel, Jakub Jelinek,
	Linux Memory Management

Andrew Morton wrote:

> #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS

> I wonder which way you're using, and whether using the other way changes
> things.

I'm using the default Fedora config file, which has
NR_CPUS defined to 64 and CONFIG_SPLIT_PTLOCK_CPUS
to 4, so I am using the split locks.

However, I suspect that each 512kB malloced area
will share one page table lock with 4 others, so
some contention is to be expected.

>> For more real world workloads, like the MySQL sysbench one,
>> I still suspect that your patch would improve things.
>>
>> Time to move back to debugging other stuff, though.
>>
>> Andrew, it would be nice if our patches could cook in -mm
>> for a while.  Want me to change anything before submitting?
> 
> umm.  I took a quick squint at a patch from you this morning and it looked
> OK to me.  Please send the finalish thing when it is fully baked and
> performance-tested in the various regions of operation, thanks.

Will do.

Ulrich has a test version of glibc available that
uses MADV_DONTNEED for free(3), that should test
this thing nicely.

I'll run some tests with that when I get the
time, hopefully next week.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05 18:38   ` Rik van Riel
  2007-04-05 21:07     ` Andrew Morton
@ 2007-04-06  1:28     ` Nick Piggin
  1 sibling, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-06  1:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ulrich Drepper, Andrew Morton, Linux Kernel, Jakub Jelinek,
	Linux Memory Management

Rik van Riel wrote:
> Nick Piggin wrote:
> 
>> Oh, also: something like this patch would help out MADV_DONTNEED, as it
>> means it can run concurrently with page faults. I think the locking will
>> work (but needs forward porting).
> 
> 
> Ironically, your patch decreases throughput on my quad core
> test system, with Jakub's test case.
> 
> MADV_DONTNEED, my patch, 10000 loops  (14k context switches/second)
> 
> real    0m34.890s
> user    0m17.256s
> sys     0m29.797s
> 
> 
> MADV_DONTNEED, my patch & your patch, 10000 loops  (50 context 
> switches/second)
> 
> real    1m8.321s
> user    0m20.840s
> sys     1m55.677s
> 
> I suspect it's moving the contention onto the page table lock,
> in zap_pte_range().  I guess that the thread private memory
> areas must be living right next to each other, in the same
> page table lock regions :)
> 
> For more real world workloads, like the MySQL sysbench one,
> I still suspect that your patch would improve things.

I think it definitely would, because the app will be wanting to
do other things with mmap_sem as well (like futexes *grumble*).

Also, the test case is allocating and freeing 512K chunks, which
I think would be on the high side of typical.

You have 32 threads for 4 CPUs, so then it would actually make
sense to context switch on mmap_sem write lock rather than spin
on ptl. But the kernel doesn't know that.

Testing with a small chunk size or thread == CPUs I think would
show a swing toward my patch.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-05 16:10               ` Ulrich Drepper
@ 2007-04-06  2:28                 ` Nick Piggin
  2007-04-06  2:52                   ` Ulrich Drepper
  0 siblings, 1 reply; 87+ messages in thread
From: Nick Piggin @ 2007-04-06  2:28 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Rik van Riel, Jakub Jelinek, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

Ulrich Drepper wrote:
> In case somebody wants to play around with Rik patch or another
> madvise-based patch, I have x86-64 glibc binaries which can use it:
> 
>   http://people.redhat.com/drepper/rpms
> 
> These are based on the latest Fedora rawhide version.  They should work
> on older systems, too, but you screw up your updates.  Use them only if
> you know what you do.
> 
> By default madvise(MADV_DONTNEED) is used.  With the environment variable

Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
mmap/mprotect, which have more fundamental locking requirements, more
overhead and no benefits (except debugging, I suppose).

MADV_DONTNEED is twice as fast in single threaded performance, and an
order of magnitude faster for multiple threads, when MADV_DONTNEED only
takes mmap_sem for read.

Do you plan to include this change in general glibc releases? Maybe it
will make google malloc obsolete? ;) (I don't suppose you'd be able to
get any tests done, Andrew?)

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-06  2:28                 ` Nick Piggin
@ 2007-04-06  2:52                   ` Ulrich Drepper
  2007-04-06  2:59                     ` Nick Piggin
  0 siblings, 1 reply; 87+ messages in thread
From: Ulrich Drepper @ 2007-04-06  2:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Jakub Jelinek, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

Nick Piggin wrote:
> Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
> kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
> mmap/mprotect, which have more fundamental locking requirements, more
> overhead and no benefits (except debugging, I suppose).

It's a tiny bit faster, see

  http://people.redhat.com/drepper/dontneed.png

I just ran it once so the graph is not smooth.  This is on a UP dual
core machine.  Maybe tomorrow I'll turn on the big 4p machine.

I would have to see dramatically different results on the big machine to
make me change the libc code.  The reason is that there is a big drawback.

So far, when we allocate a new arena, we allocate address space with
PROT_NONE and only when we need memory the protection is changed to
PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
accesses.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: missing madvise functionality
  2007-04-06  2:52                   ` Ulrich Drepper
@ 2007-04-06  2:59                     ` Nick Piggin
  0 siblings, 0 replies; 87+ messages in thread
From: Nick Piggin @ 2007-04-06  2:59 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Rik van Riel, Jakub Jelinek, Andrew Morton, Andi Kleen,
	Linux Kernel, linux-mm, Hugh Dickins

Ulrich Drepper wrote:
> Nick Piggin wrote:
> 
>>Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
>>kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
>>mmap/mprotect, which have more fundamental locking requirements, more
>>overhead and no benefits (except debugging, I suppose).
> 
> 
> It's a tiny bit faster, see
> 
>   http://people.redhat.com/drepper/dontneed.png
> 
> I just ran it once so the graph is not smooth.  This is on a UP dual
> core machine.  Maybe tomorrow I'll turn on the big 4p machine.

Hmm, I saw an improvement, but that was just on a raw syscall test
with a single page chunk. Real-world use I guess will get progressively
less dramatic as other overheads start being introduced.

Multi-thread performance probably won't get a whole lot better (it does
eliminate 1 down_write(mmap_sem), but one remains) until you use my
madvise patch.


> I would have to see dramatically different results on the big machine to
> make me change the libc code.  The reason is that there is a big drawback.
> 
> So far, when we allocate a new arena, we allocate address space with
> PROT_NONE and only when we need memory the protection is changed to
> PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
> accesses.

Sure, yes. And I guess you'd always want to keep that options around as
a debugging aid.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-05 20:37                 ` Andrew Morton
@ 2007-04-06  9:08                   ` Ingo Molnar
  2007-04-06 19:30                     ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Ingo Molnar @ 2007-04-06  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Howells, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1274 bytes --]

* Andrew Morton <akpm@linux-foundation.org> wrote:

> > getting a good trace of it is easy: pick up the latest -rt kernel 
> > from:
> > 
> > 	http://redhat.com/~mingo/realtime-preempt/
> > 
> > enable EVENT_TRACING in that kernel, run the workload and do:
> > 
> > 	scripts/trace-it > to-ingo.txt
> > 
> > and send me the output.
> 
> Did that - no output was generated.  config at
> http://userweb.kernel.org/~akpm/config-akpm2.txt

sorry, i forgot to mention that you should turn off 
CONFIG_WAKEUP_TIMING.

i've attached an updated version of trace-it.c, which will turn this off 
itself, using a sysctl. I also made WAKEUP_TIMING default-off.

> I did get an interesting dmesg spew:
> http://userweb.kernel.org/~akpm/dmesg-akpm2.txt

yeah, it's stack footprint measurement/instrumentation. It's 
particularly effective at tracking the worst-case stack footprint if you 
have FUNCTION_TRACING enabled - because in that case the kernel measures 
the stack's size at every function entry point. It does a maximum search 
so after bootup (in search of the 'largest' stack frame) so it's a bit 
verbose, but gets alot rarer later on. If it bothers you then disable:

  CONFIG_DEBUG_STACKOVERFLOW=y

it could interfere with getting a quality scheduling trace anyway.

	Ingo

[-- Attachment #2: trace-it.c --]
[-- Type: text/plain, Size: 2734 bytes --]

/*
 * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com>
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 *
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_EVENT_TRACING
 * in the .config, rebuild the kernel and boot into it. The trace will
 * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING,
 * every kernel function call will be put into the trace. Note that
 * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont
 * want to use it for performance testing :)
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
	int ret;

	if (getuid() != 0) {
		fprintf(stderr, "needs to run as root.\n");
		exit(1);
	}
	ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
	if (ret) {
		fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
		exit(1);
	}
	system("echo 1 > /proc/sys/kernel/trace_user_triggered");
	system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 0 > /proc/sys/kernel/wakeup_timing");
	system("echo 1 > /proc/sys/kernel/trace_enabled");
	system("echo 1 > /proc/sys/kernel/mcount_enabled");
	system("echo 0 > /proc/sys/kernel/trace_freerunning");
	system("echo 0 > /proc/sys/kernel/trace_print_on_crash");
	system("echo 0 > /proc/sys/kernel/trace_verbose");
	system("echo 0 > /proc/sys/kernel/preempt_thresh 2>/dev/null");
	system("echo 0 > /proc/sys/kernel/preempt_max_latency 2>/dev/null");

	// start tracing
	if (prctl(0, 1)) {
		fprintf(stderr, "trace-it: couldnt start tracing!\n");
		return 1;
	}
	usleep(1000000);
	if (prctl(0, 0)) {
		fprintf(stderr, "trace-it: couldnt stop tracing!\n");
		return 1;
	}

	system("echo 0 > /proc/sys/kernel/trace_user_triggered");
	system("echo 0 > /proc/sys/kernel/trace_enabled");
	system("cat /proc/latency_trace");

	return 0;
}

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-06  9:08                   ` Ingo Molnar
@ 2007-04-06 19:30                     ` Andrew Morton
  2007-04-06 19:40                       ` Ingo Molnar
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2007-04-06 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Howells, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

On Fri, 6 Apr 2007 11:08:22 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > getting a good trace of it is easy: pick up the latest -rt kernel 
> > > from:
> > > 
> > > 	http://redhat.com/~mingo/realtime-preempt/
> > > 
> > > enable EVENT_TRACING in that kernel, run the workload and do:
> > > 
> > > 	scripts/trace-it > to-ingo.txt
> > > 
> > > and send me the output.
> > 
> > Did that - no output was generated.  config at
> > http://userweb.kernel.org/~akpm/config-akpm2.txt
> 
> sorry, i forgot to mention that you should turn off 
> CONFIG_WAKEUP_TIMING.
> 
> i've attached an updated version of trace-it.c, which will turn this off 
> itself, using a sysctl. I also made WAKEUP_TIMING default-off.

ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of

	taskset -c 0 ./jakubs-test-app

while the system was doing the 150,000 context switches/sec.

It isn't very interesting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: preemption and rwsems (was: Re: missing madvise functionality)
  2007-04-06 19:30                     ` Andrew Morton
@ 2007-04-06 19:40                       ` Ingo Molnar
  0 siblings, 0 replies; 87+ messages in thread
From: Ingo Molnar @ 2007-04-06 19:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Howells, Jakub Jelinek, Ulrich Drepper, Andi Kleen,
	Rik van Riel, Linux Kernel, linux-mm, Hugh Dickins

* Andrew Morton <akpm@linux-foundation.org> wrote:

> > i've attached an updated version of trace-it.c, which will turn this 
> > off itself, using a sysctl. I also made WAKEUP_TIMING default-off.
> 
> ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of
> 
> 	taskset -c 0 ./jakubs-test-app
> 
> while the system was doing the 150,000 context switches/sec.
> 
> It isn't very interesting.

this shows an idle CPU#7: you should taskset -c 0 trace-it too - it only 
traces the current CPU by default. (there's the 
/proc/sys/kernel/trace_all_cpus flag to trace all cpus, but in this case 
we really want the trace of CPU#0)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2007-04-06 19:40 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <46128051.9000609@redhat.com>
     [not found] ` <p73648dz5oa.fsf@bingen.suse.de>
     [not found]   ` <46128CC2.9090809@redhat.com>
     [not found]     ` <20070403172841.GB23689@one.firstfloor.org>
2007-04-03 19:59       ` missing madvise functionality Andrew Morton
2007-04-03 20:09         ` Andi Kleen
2007-04-03 20:17         ` Ulrich Drepper
2007-04-03 20:29           ` Jakub Jelinek
2007-04-03 20:38             ` Rik van Riel
2007-04-03 21:49             ` Andrew Morton
2007-04-03 23:01               ` Eric Dumazet
2007-04-04  2:22                 ` Nick Piggin
2007-04-04  5:41                   ` Eric Dumazet
2007-04-04  6:09                     ` [patches] threaded vma patches (was Re: missing madvise functionality) Nick Piggin
2007-04-04  6:26                       ` Andrew Morton
2007-04-04  6:38                         ` Nick Piggin
2007-04-04  6:42                       ` Ulrich Drepper
2007-04-04  6:44                         ` Nick Piggin
2007-04-04  6:50                         ` Eric Dumazet
2007-04-04  6:54                           ` Ulrich Drepper
2007-04-04  7:33                             ` Eric Dumazet
2007-04-04  8:25                   ` missing madvise functionality Peter Zijlstra
2007-04-04  8:55                     ` Nick Piggin
2007-04-04  9:12                       ` William Lee Irwin III
2007-04-04  9:23                         ` Nick Piggin
2007-04-04  9:34                       ` Eric Dumazet
2007-04-04  9:45                         ` Nick Piggin
2007-04-04 10:05                         ` Nick Piggin
2007-04-04 11:54                           ` Eric Dumazet
2007-04-05  2:01                             ` Nick Piggin
2007-04-05  6:09                               ` Eric Dumazet
2007-04-05  6:19                                 ` Ulrich Drepper
2007-04-05  6:54                                   ` Eric Dumazet
2007-04-03 23:02               ` Andrew Morton
2007-04-04  9:15                 ` Hugh Dickins
2007-04-04 14:55                   ` Rik van Riel
2007-04-04 15:25                     ` Hugh Dickins
2007-04-05  1:44                       ` Nick Piggin
2007-04-04 18:04                   ` Andrew Morton
2007-04-04 18:08                     ` Rik van Riel
2007-04-04 20:56                       ` Andrew Morton
2007-04-04 18:39                     ` Hugh Dickins
2007-04-03 23:44               ` Andrew Morton
2007-04-04 13:09             ` William Lee Irwin III
2007-04-04 13:38               ` William Lee Irwin III
2007-04-04 18:51               ` Andrew Morton
2007-04-05  4:14                 ` William Lee Irwin III
2007-04-04 23:00             ` preemption and rwsems (was: Re: missing madvise functionality) Andrew Morton
2007-04-05  7:31             ` missing madvise functionality Rik van Riel
2007-04-05  7:39               ` Rik van Riel
2007-04-05  8:32                 ` Andrew Morton
2007-04-05 15:47                   ` Rik van Riel
2007-04-05  8:08               ` Eric Dumazet
2007-04-05  8:31                 ` Rik van Riel
2007-04-05  9:06                   ` Eric Dumazet
2007-04-05  9:45               ` Jakub Jelinek
2007-04-05 16:15                 ` Rik van Riel
2007-04-05 16:10               ` Ulrich Drepper
2007-04-06  2:28                 ` Nick Piggin
2007-04-06  2:52                   ` Ulrich Drepper
2007-04-06  2:59                     ` Nick Piggin
2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
2007-04-05 19:11               ` Ingo Molnar
2007-04-05 20:37                 ` Andrew Morton
2007-04-06  9:08                   ` Ingo Molnar
2007-04-06 19:30                     ` Andrew Morton
2007-04-06 19:40                       ` Ingo Molnar
2007-04-05 19:27               ` Andrew Morton
2007-04-03 20:51           ` missing madvise functionality Andrew Morton
2007-04-03 20:57             ` Ulrich Drepper
2007-04-03 21:00             ` Rik van Riel
2007-04-03 21:10               ` Eric Dumazet
2007-04-03 21:12                 ` Jörn Engel
2007-04-03 21:15                 ` Rik van Riel
2007-04-03 21:30                   ` Eric Dumazet
2007-04-03 21:22                 ` Jeremy Fitzhardinge
2007-04-03 21:29                   ` Rik van Riel
2007-04-03 21:46                 ` Ulrich Drepper
2007-04-03 22:51                   ` Andi Kleen
2007-04-03 23:07                     ` Ulrich Drepper
2007-04-03 21:16               ` Andrew Morton
2007-04-04 18:49             ` Anton Blanchard
2007-04-04  7:46 ` Nick Piggin
2007-04-04  8:04   ` Nick Piggin
2007-04-04  8:20   ` Jakub Jelinek
2007-04-04  8:47     ` Nick Piggin
2007-04-05  4:23       ` Nick Piggin
2007-04-05 18:38   ` Rik van Riel
2007-04-05 21:07     ` Andrew Morton
2007-04-05 21:39       ` Rik van Riel
2007-04-06  1:28     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox