changing mm->mmap_sem (was: Re: system call for process information?)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* changing mm->mmap_sem  (was: Re: system call for process information?)
       [not found] <Pine.LNX.4.33.0103141618320.21132-100000@duckman.distro.conectiva>
@ 2001-03-15 12:24 ` Rik van Riel
  2001-03-16  9:49   ` Stephen C. Tweedie
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2001-03-15 12:24 UTC (permalink / raw)
  To: george anzinger; +Cc: Alexander Viro, linux-mm, bcrl, linux-kernel

On Wed, 14 Mar 2001, Rik van Riel wrote:
> On Wed, 14 Mar 2001, george anzinger wrote:
> 
> > Is it REALLY necessary to prevent them from seeing an
> > inconsistent state?  Seems to me that in the total picture (i.e.
> > system wide) they will never see a consistent state, so why be
> > concerned with a small corner of the system.
> 
> You're right.

Mmmm, I've looked at the code today and it turned out that
we're NOT right ;)

The mmap_sem is used in procfs to prevent the list of VMAs
from changing. In the page fault code it seems to be used
to prevent other page faults to happen at the same time with
the current page fault (and to prevent VMAs from changing
while a page fault is underway).

Maybe we should change the mmap_sem into a R/W semaphore ?

Since page faults seem to be the "common cause" of blocking
procfs access *and* since both page faults and procfs only
need to prevent the VMA list from changing, a read lock would
help here.

Write locks would be used in the code where we actually want
to change the VMA list and page faults would use an extra lock
to protect against each other (possibly a per-pagetable lock so
multithreaded apps can pagefault in different memory regions at
the same time ???).

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel
@ 2001-03-16  9:49   ` Stephen C. Tweedie
  2001-03-16 11:50     ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2001-03-16  9:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

Hi,

On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
> On Wed, 14 Mar 2001, Rik van Riel wrote:

> The mmap_sem is used in procfs to prevent the list of VMAs
> from changing. In the page fault code it seems to be used
> to prevent other page faults to happen at the same time with
> the current page fault (and to prevent VMAs from changing
> while a page fault is underway).

The page table spinlock should be quite sufficient to let us avoid
races in the page fault code.  We've had to deal with this before
there was ever a mmap_sem anyway: in ancient times, every page fault
had to do things like check to see if the pte had changed after IO was
complete and once the BKL had been retaken.  We can do the same with
the page fault spinlock without much pain.

> Maybe we should change the mmap_sem into a R/W semaphore ?

Definitely.

> Write locks would be used in the code where we actually want
> to change the VMA list and page faults would use an extra lock
> to protect against each other (possibly a per-pagetable lock

Why do we need another lock?  The critical section where we do the
final update on the pte _already_ takes the page table spinlock to
avoid races against the swapper.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16  9:49   ` Stephen C. Tweedie
@ 2001-03-16 11:50     ` Rik van Riel
  2001-03-16 12:53       ` Stephen C. Tweedie
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2001-03-16 11:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
> On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
> > On Wed, 14 Mar 2001, Rik van Riel wrote:
> 
> > The mmap_sem is used in procfs to prevent the list of VMAs
> > from changing. In the page fault code it seems to be used
> > to prevent other page faults to happen at the same time with
> > the current page fault (and to prevent VMAs from changing
> > while a page fault is underway).
> 
> The page table spinlock should be quite sufficient to let us avoid
> races in the page fault code.

> > Write locks would be used in the code where we actually want
> > to change the VMA list and page faults would use an extra lock
> > to protect against each other (possibly a per-pagetable lock
> 
> Why do we need another lock?  The critical section where we do the
> final update on the pte _already_ takes the page table spinlock to
> avoid races against the swapper.

The problem is that mmap_sem seems to be protecting the list
of VMAs, so taking _only_ the page_table_lock could let a VMA
change under us while a page fault is underway ...

Then again, I guess just making mmap_sem a R/W lock should fix
our problems ... and maybe even make it possible (in 2.5?) to
let multithreaded programs have pagefaults at the same time,
instead of having all threads queue up behind mmap_sem.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16 11:50     ` Rik van Riel
@ 2001-03-16 12:53       ` Stephen C. Tweedie
  2001-03-18  7:23         ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2001-03-16 12:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, george anzinger, Alexander Viro, linux-mm,
	bcrl, linux-kernel

Hi,

On Fri, Mar 16, 2001 at 08:50:25AM -0300, Rik van Riel wrote:
> On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
> 
> > > Write locks would be used in the code where we actually want
> > > to change the VMA list and page faults would use an extra lock
> > > to protect against each other (possibly a per-pagetable lock
> > 
> > Why do we need another lock?  The critical section where we do the
> > final update on the pte _already_ takes the page table spinlock to
> > avoid races against the swapper.
> 
> The problem is that mmap_sem seems to be protecting the list
> of VMAs, so taking _only_ the page_table_lock could let a VMA
> change under us while a page fault is underway ...

Right, I'm not suggesting removing that: making the mmap_sem
read/write is fine, but yes, we still need that semaphore.  But as for
the "page faults would use an extra lock to protect against each
other" bit --- we already have another lock, the page table lock,
which can be used in this way, so ANOTHER lock should be unnecessary.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16 12:53       ` Stephen C. Tweedie
@ 2001-03-18  7:23         ` Rik van Riel
  2001-03-18  9:56           ` Mike Galbraith
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2001-03-18  7:23 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:

> Right, I'm not suggesting removing that: making the mmap_sem
> read/write is fine, but yes, we still need that semaphore.

Initial patch (against 2.4.2-ac20) is available at
http://www.surriel.com/patches/

> But as for the "page faults would use an extra lock to protect against
> each other" bit --- we already have another lock, the page table lock,
> which can be used in this way, so ANOTHER lock should be unnecessary.

Tomorrow I'll take a look at the various ->nopage
functions and do_swap_page to see if these functions
would be able to take simultaneous faults at the same
address (from multiple threads).  If not, either we'll
need to modify these functions, or we could add a (few?)
extra lock to prevent these functions from faulting at
the same address at the same time in multiple threads.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  7:23         ` Rik van Riel
@ 2001-03-18  9:56           ` Mike Galbraith
  2001-03-18 10:46             ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Galbraith @ 2001-03-18  9:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Rik van Riel wrote:

> On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
>
> > Right, I'm not suggesting removing that: making the mmap_sem
> > read/write is fine, but yes, we still need that semaphore.
>
> Initial patch (against 2.4.2-ac20) is available at
> http://www.surriel.com/patches/
>
> > But as for the "page faults would use an extra lock to protect against
> > each other" bit --- we already have another lock, the page table lock,
> > which can be used in this way, so ANOTHER lock should be unnecessary.
>
> Tomorrow I'll take a look at the various ->nopage
> functions and do_swap_page to see if these functions
> would be able to take simultaneous faults at the same
> address (from multiple threads).  If not, either we'll
> need to modify these functions, or we could add a (few?)
> extra lock to prevent these functions from faulting at
> the same address at the same time in multiple threads.

Hi Rik,

I gave this patch a try, and the initial results are extremely encouraging.
Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :))
I also have a _nice_ throughput improvement.  There are some worrisome
warnings below along with the compile changes I made here, but for an
initial patch, things look pretty darn wonderful.

	Cheers,

	-Mike

--- ./include/linux/sched.h.org	Sun Mar 18 10:20:42 2001
+++ ./include/linux/sched.h	Sun Mar 18 10:27:48 2001
@@ -238,7 +238,7 @@
 	mm_users:	ATOMIC_INIT(2), 		\
 	mm_count:	ATOMIC_INIT(1), 		\
 	map_count:	1, 				\
-	mmap_sem:	__MUTEX_INITIALIZER(name.mmap_sem), \
+	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem, RW_LOCK_BIAS), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
 	mmlist:		LIST_HEAD_INIT(name.mmlist),	\
 }
--- ./include/linux/mm.h.org	Sun Mar 18 09:56:55 2001
+++ ./include/linux/mm.h	Sun Mar 18 10:27:59 2001
@@ -533,13 +533,13 @@
 	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
 	    ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
-	spin_lock(&mm->page_table_lock);
+	spin_lock(&vma->vm_mm->page_table_lock);
 	vma->vm_start = address;
 	vma->vm_pgoff -= grow;
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(&vma->vm_mm->page_table_lock);
 	return 0;
 }

...
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 196k freed
Adding Swap: 265064k swap-space (priority 2)
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00058d00
Unused swap offset entry in swap_dup 00011e00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00011e00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00011e00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_count 00011e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 00011e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 00011e00
Unused swap offset entry in swap_count 00011e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  9:56           ` Mike Galbraith
@ 2001-03-18 10:46             ` Rik van Riel
  2001-03-18 12:33               ` Mike Galbraith
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2001-03-18 10:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Mike Galbraith wrote:

> I gave this patch a try, and the initial results are extremely encouraging.
> Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :))
> I also have a _nice_ throughput improvement.  There are some worrisome
> warnings below along with the compile changes I made here, but for an
> initial patch, things look pretty darn wonderful.

	[snip compile fixes .. integrated]

> VFS: Mounted root (ext2 filesystem) readonly.
> Freeing unused kernel memory: 196k freed
> Adding Swap: 265064k swap-space (priority 2)
> VM: Bad swap entry 00011e00
> VM: Bad swap entry 00058d00
> Unused swap offset entry in swap_dup 00058d00
> Unused swap offset entry in swap_dup 00011e00
> VM: Bad swap entry 00011e00
> VM: Bad swap entry 00058d00

Heh, I guess do_swap_page isn't too happy when multiple threads
of the same program take a page fault at the same address at the
same time.

I take it you were testing something like mysql, jvm or apache2 ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18 10:46             ` Rik van Riel
@ 2001-03-18 12:33               ` Mike Galbraith
  0 siblings, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2001-03-18 12:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Rik van Riel wrote:

> > VFS: Mounted root (ext2 filesystem) readonly.
> > Freeing unused kernel memory: 196k freed
> > Adding Swap: 265064k swap-space (priority 2)
> > VM: Bad swap entry 00011e00
> > VM: Bad swap entry 00058d00
> > Unused swap offset entry in swap_dup 00058d00
> > Unused swap offset entry in swap_dup 00011e00
> > VM: Bad swap entry 00011e00
> > VM: Bad swap entry 00058d00
>
> Heh, I guess do_swap_page isn't too happy when multiple threads
> of the same program take a page fault at the same address at the
> same time.
>
> I take it you were testing something like mysql, jvm or apache2 ?

No, this was make -j30 bzImage.  (nscd was running though...)

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
       [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de>
@ 2001-03-18 14:43 ` Rik van Riel
  0 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2001-03-18 14:43 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: sct, linux-kernel

On Sun, 18 Mar 2001, Mike Galbraith wrote:

> > No, this was make -j30 bzImage.  (nscd was running though...)
> 
> I rebooted, shut down nscd prior to testing and did 5 builds in a row
> without a single gripe.  Started nscd for sixth run and instantly the
> kernel griped.  Yup.. threaded apps pushing swap.

OK, I'll write some code to prevent multiple threads from
stepping all over each other when they pagefault at the
same address.

What would be the preferred method of fixing this ?

- fixing do_swap_page and all ->nopage functions
- hacking handle_mm_fault to make sure no overlapping
  pagefaults will be served at the same time

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-03-18 16:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.33.0103141618320.21132-100000@duckman.distro.conectiva>
2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel
2001-03-16  9:49   ` Stephen C. Tweedie
2001-03-16 11:50     ` Rik van Riel
2001-03-16 12:53       ` Stephen C. Tweedie
2001-03-18  7:23         ` Rik van Riel
2001-03-18  9:56           ` Mike Galbraith
2001-03-18 10:46             ` Rik van Riel
2001-03-18 12:33               ` Mike Galbraith
     [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de>
2001-03-18 14:43 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox