linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
  • * Re: missing madvise functionality
           [not found] <46128051.9000609@redhat.com>
           [not found] ` <p73648dz5oa.fsf@bingen.suse.de>
    @ 2007-04-04  7:46 ` Nick Piggin
      2007-04-04  8:04   ` Nick Piggin
                         ` (2 more replies)
      1 sibling, 3 replies; 87+ messages in thread
    From: Nick Piggin @ 2007-04-04  7:46 UTC (permalink / raw)
      To: Ulrich Drepper
      Cc: Rik van Riel, Andrew Morton, Linux Kernel, Jakub Jelinek,
    	Linux Memory Management
    
    [-- Attachment #1: Type: text/plain, Size: 2527 bytes --]
    
    Ulrich Drepper wrote:
    > People might remember the thread about mysql not scaling and pointing
    > the finger quite happily at glibc.  Well, the situation is not like that.
    > 
    > The problem is glibc has to work around kernel limitations.  If the
    > malloc implementation detects that a large chunk of previously allocated
    > memory is now free and unused it wants to return the memory to the
    > system.  What we currently have to do is this:
    > 
    >   to free:      mmap(PROT_NONE) over the area
    >   to reuse:     mprotect(PROT_READ|PROT_WRITE)
    > 
    > Yep, that's expensive, both operations need to get locks preventing
    > other threads from doing the same.
    > 
    > Some people were quick to suggest that we simply avoid the freeing in
    > many situations (that's what the patch submitted by Yanmin Zhang
    > basically does).  That's no solution.  One of the very good properties
    > of the current allocator is that it does not use much memory.
    
    Does mmap(PROT_NONE) actually free the memory?
    
    
    > A solution for this problem is a madvise() operation with the following
    > property:
    > 
    >   - the content of the address range can be discarded
    > 
    >   - if an access to a page in the range happens in the future it must
    >     succeed.  The old page content can be provided or a new, empty page
    >     can be provided
    > 
    > That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
    > pages, causing *all* future reuses to create page faults.  This is what
    > I guess happens in the mysql test case where the pages where unused and
    > freed but then almost immediately reused.  The page faults erased all
    > the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
    > calls.
    
    Two questions.
    
    In the case of pages being unused then almost immediately reused, why is
    it a bad solution to avoid freeing? Is it that you want to avoid
    heuristics because in some cases they could fail and end up using memory?
    
    Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
    than a syscall? (including the cost of the TLB fill for the memory access
    after the syscall, of course).
    
    zapping the pages puts them on a nice LIFO cache hot list of pages that
    can be quickly used when the next fault comes in, or used for any other
    allocation in the kernel. Putting them on some sort of reclaim list seems
    a bit pointless.
    
    Oh, also: something like this patch would help out MADV_DONTNEED, as it
    means it can run concurrently with page faults. I think the locking will
    work (but needs forward porting).
    
    -- 
    SUSE Labs, Novell Inc.
    
    [-- Attachment #2: madv-mmap_sem.patch --]
    [-- Type: text/plain, Size: 1305 bytes --]
    
    Index: linux-2.6/mm/madvise.c
    ===================================================================
    --- linux-2.6.orig/mm/madvise.c
    +++ linux-2.6/mm/madvise.c
    @@ -12,6 +12,25 @@
     #include <linux/hugetlb.h>
     
     /*
    + * Any behaviour which results in changes to the vma->vm_flags needs to
    + * take mmap_sem for writing. Others, which simply traverse vmas, need
    + * to only take it for reading.
    + */
    +static int madvise_need_mmap_write(int behavior)
    +{
    +	switch (behavior) {
    +	case MADV_DOFORK:
    +	case MADV_DONTFORK:
    +	case MADV_NORMAL:
    +	case MADV_SEQUENTIAL:
    +	case MADV_RANDOM:
    +		return 1;
    +	default:
    +		return 0;
    +	}
    +}
    +
    +/*
      * We can potentially split a vm area into separate
      * areas, each area with its own behavior.
      */
    @@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
     	int error = -EINVAL;
     	size_t len;
     
    -	down_write(&current->mm->mmap_sem);
    +	if (madvise_need_mmap_write(behavior))
    +		down_write(&current->mm->mmap_sem);
    +	else
    +		down_read(&current->mm->mmap_sem);
     
     	if (start & ~PAGE_MASK)
     		goto out;
    @@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
     		vma = prev->vm_next;
     	}
     out:
    -	up_write(&current->mm->mmap_sem);
    +	if (madvise_need_mmap_write(behavior))
    +		up_write(&current->mm->mmap_sem);
    +	else
    +		up_read(&current->mm->mmap_sem);
    +
     	return error;
     }
    
    ^ permalink raw reply	[flat|nested] 87+ messages in thread

  • end of thread, other threads:[~2007-04-06 19:40 UTC | newest]
    
    Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
    -- links below jump to the message on this page --
         [not found] <46128051.9000609@redhat.com>
         [not found] ` <p73648dz5oa.fsf@bingen.suse.de>
         [not found]   ` <46128CC2.9090809@redhat.com>
         [not found]     ` <20070403172841.GB23689@one.firstfloor.org>
    2007-04-03 19:59       ` missing madvise functionality Andrew Morton
    2007-04-03 20:09         ` Andi Kleen
    2007-04-03 20:17         ` Ulrich Drepper
    2007-04-03 20:29           ` Jakub Jelinek
    2007-04-03 20:38             ` Rik van Riel
    2007-04-03 21:49             ` Andrew Morton
    2007-04-03 23:01               ` Eric Dumazet
    2007-04-04  2:22                 ` Nick Piggin
    2007-04-04  5:41                   ` Eric Dumazet
    2007-04-04  6:09                     ` [patches] threaded vma patches (was Re: missing madvise functionality) Nick Piggin
    2007-04-04  6:26                       ` Andrew Morton
    2007-04-04  6:38                         ` Nick Piggin
    2007-04-04  6:42                       ` Ulrich Drepper
    2007-04-04  6:44                         ` Nick Piggin
    2007-04-04  6:50                         ` Eric Dumazet
    2007-04-04  6:54                           ` Ulrich Drepper
    2007-04-04  7:33                             ` Eric Dumazet
    2007-04-04  8:25                   ` missing madvise functionality Peter Zijlstra
    2007-04-04  8:55                     ` Nick Piggin
    2007-04-04  9:12                       ` William Lee Irwin III
    2007-04-04  9:23                         ` Nick Piggin
    2007-04-04  9:34                       ` Eric Dumazet
    2007-04-04  9:45                         ` Nick Piggin
    2007-04-04 10:05                         ` Nick Piggin
    2007-04-04 11:54                           ` Eric Dumazet
    2007-04-05  2:01                             ` Nick Piggin
    2007-04-05  6:09                               ` Eric Dumazet
    2007-04-05  6:19                                 ` Ulrich Drepper
    2007-04-05  6:54                                   ` Eric Dumazet
    2007-04-03 23:02               ` Andrew Morton
    2007-04-04  9:15                 ` Hugh Dickins
    2007-04-04 14:55                   ` Rik van Riel
    2007-04-04 15:25                     ` Hugh Dickins
    2007-04-05  1:44                       ` Nick Piggin
    2007-04-04 18:04                   ` Andrew Morton
    2007-04-04 18:08                     ` Rik van Riel
    2007-04-04 20:56                       ` Andrew Morton
    2007-04-04 18:39                     ` Hugh Dickins
    2007-04-03 23:44               ` Andrew Morton
    2007-04-04 13:09             ` William Lee Irwin III
    2007-04-04 13:38               ` William Lee Irwin III
    2007-04-04 18:51               ` Andrew Morton
    2007-04-05  4:14                 ` William Lee Irwin III
    2007-04-04 23:00             ` preemption and rwsems (was: Re: missing madvise functionality) Andrew Morton
    2007-04-05  7:31             ` missing madvise functionality Rik van Riel
    2007-04-05  7:39               ` Rik van Riel
    2007-04-05  8:32                 ` Andrew Morton
    2007-04-05 15:47                   ` Rik van Riel
    2007-04-05  8:08               ` Eric Dumazet
    2007-04-05  8:31                 ` Rik van Riel
    2007-04-05  9:06                   ` Eric Dumazet
    2007-04-05  9:45               ` Jakub Jelinek
    2007-04-05 16:15                 ` Rik van Riel
    2007-04-05 16:10               ` Ulrich Drepper
    2007-04-06  2:28                 ` Nick Piggin
    2007-04-06  2:52                   ` Ulrich Drepper
    2007-04-06  2:59                     ` Nick Piggin
    2007-04-05 12:48             ` preemption and rwsems (was: Re: missing madvise functionality) David Howells
    2007-04-05 19:11               ` Ingo Molnar
    2007-04-05 20:37                 ` Andrew Morton
    2007-04-06  9:08                   ` Ingo Molnar
    2007-04-06 19:30                     ` Andrew Morton
    2007-04-06 19:40                       ` Ingo Molnar
    2007-04-05 19:27               ` Andrew Morton
    2007-04-03 20:51           ` missing madvise functionality Andrew Morton
    2007-04-03 20:57             ` Ulrich Drepper
    2007-04-03 21:00             ` Rik van Riel
    2007-04-03 21:10               ` Eric Dumazet
    2007-04-03 21:12                 ` Jörn Engel
    2007-04-03 21:15                 ` Rik van Riel
    2007-04-03 21:30                   ` Eric Dumazet
    2007-04-03 21:22                 ` Jeremy Fitzhardinge
    2007-04-03 21:29                   ` Rik van Riel
    2007-04-03 21:46                 ` Ulrich Drepper
    2007-04-03 22:51                   ` Andi Kleen
    2007-04-03 23:07                     ` Ulrich Drepper
    2007-04-03 21:16               ` Andrew Morton
    2007-04-04 18:49             ` Anton Blanchard
    2007-04-04  7:46 ` Nick Piggin
    2007-04-04  8:04   ` Nick Piggin
    2007-04-04  8:20   ` Jakub Jelinek
    2007-04-04  8:47     ` Nick Piggin
    2007-04-05  4:23       ` Nick Piggin
    2007-04-05 18:38   ` Rik van Riel
    2007-04-05 21:07     ` Andrew Morton
    2007-04-05 21:39       ` Rik van Riel
    2007-04-06  1:28     ` Nick Piggin
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox