* Re: locking question: do_mmap(), do_munmap() [not found] <Pine.LNX.4.10.9910101713010.364-100000@alpha.random> @ 1999-10-10 15:52 ` Manfred Spraul 1999-10-10 16:07 ` Alexander Viro 0 siblings, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-10 15:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm Andrea Arcangeli wrote: > > On Sun, 10 Oct 1999, Manfred Spraul wrote: > > >But this means that both locks are required if you modify the vma list. > >Single reader, multiple writer synchronization. Unusual, but interesting > >:-) > > Yes, that's always been this way also in 2.2.x. and which lock protects "mm->rss"? It's not an atomic variable, but * increased by do_swap_page() outside lock_kernel. * decreased by the swapper. I've started adding "assert_down()" and "assert_kernellocked()" macros, and now I don't see the login prompt any more... eg. sys_mprotect calls merge_segments without lock_kernel(). -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 15:52 ` locking question: do_mmap(), do_munmap() Manfred Spraul @ 1999-10-10 16:07 ` Alexander Viro 1999-10-10 16:25 ` Alexander Viro ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-10 16:07 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm [Cc'd to mingo] On Sun, 10 Oct 1999, Manfred Spraul wrote: > I've started adding "assert_down()" and "assert_kernellocked()" macros, > and now I don't see the login prompt any more... > > eg. sys_mprotect calls merge_segments without lock_kernel(). Manfred, Andrea - please stop it. Yes, it does and yes, it should. Plonking the big lock around every access to VM is _not_ a solution. If swapper doesn't use mmap_sem - _swapper_ should be fixed. How the hell does lock_kernel() have smaller deadlock potential than down(&mm->mmap_sem)? If you want to return to 2.2 you know where to find it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:07 ` Alexander Viro @ 1999-10-10 16:25 ` Alexander Viro 1999-10-10 16:45 ` Manfred Spraul 1999-10-10 17:12 ` Andrea Arcangeli 1999-10-10 16:56 ` Andrea Arcangeli 1999-10-11 15:41 ` Stephen C. Tweedie 2 siblings, 2 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-10 16:25 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Alexander Viro wrote: > > [Cc'd to mingo] > > On Sun, 10 Oct 1999, Manfred Spraul wrote: > > > I've started adding "assert_down()" and "assert_kernellocked()" macros, > > and now I don't see the login prompt any more... > > > > eg. sys_mprotect calls merge_segments without lock_kernel(). > > Manfred, Andrea - please stop it. Yes, it does and yes, it should. > Plonking the big lock around every access to VM is _not_ a solution. If > swapper doesn't use mmap_sem - _swapper_ should be fixed. How the hell > does lock_kernel() have smaller deadlock potential than > down(&mm->mmap_sem)? OK, folks. Code in swapper (unuse_process(), right?) is called only from sys_swapoff(). It's a syscall. Andrea, could you show a scenario for deadlock here? OK, some process (but not the process doing swapoff()) may have the map locked So? it is not going to release the thing - we are seriously screwed anyway (read: we already are in deadlock). We don't hold the semaphore ourselves. Andrea, post a deadlock scenario, please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:25 ` Alexander Viro @ 1999-10-10 16:45 ` Manfred Spraul 1999-10-10 17:25 ` Alexander Viro 1999-10-10 17:12 ` Andrea Arcangeli 1 sibling, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-10 16:45 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Alexander Viro wrote: > > On Sun, 10 Oct 1999, Alexander Viro wrote: > > > > > [Cc'd to mingo] > > > > On Sun, 10 Oct 1999, Manfred Spraul wrote: > > > > > I've started adding "assert_down()" and "assert_kernellocked()" macros, > > > and now I don't see the login prompt any more... > > > > > > eg. sys_mprotect calls merge_segments without lock_kernel(). > > > > Manfred, Andrea - please stop it. Yes, it does and yes, it should. Yes, it should cause oops? > > Plonking the big lock around every access to VM is _not_ a solution I never did that, I'll never do that, I only notice that the current code is filled with races. > >. If > > swapper doesn't use mmap_sem - _swapper_ should be fixed. How the hell > > does lock_kernel() have smaller deadlock potential than > > down(&mm->mmap_sem)? lock_kernel() is dropped on thread switch, the semaphore is not dropped. > > OK, folks. Code in swapper (unuse_process(), right?) is called only from > sys_swapoff(). It's a syscall. Andrea, could you show a scenario for > deadlock here? OK, some process (but not the process doing swapoff()) may > have the map locked So? it is not going to release the thing - we are > seriously screwed anyway (read: we already are in deadlock). We don't hold > the semaphore ourselves. AFAIK the problem is OOM: * a process accesses a not-present, ie page fault: ... handle_mm_fault(): this process own mm->mmap_sem. ->handle_pte_fault(). -> (eg.) do_wp_page(). -> get_free_page(). now get_free_page() notices that there is no free memory. --> wakeup kswapd. * the swapper runs, and it tries to swap out data from that process. mm->mmap_sem is already acquired --> lock-up. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:45 ` Manfred Spraul @ 1999-10-10 17:25 ` Alexander Viro 0 siblings, 0 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-10 17:25 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm [Apologies to Andrea - idiot me looked into the wrong place ;-<] On Sun, 10 Oct 1999, Manfred Spraul wrote: > AFAIK the problem is OOM: > * a process accesses a not-present, ie page fault: > ... > handle_mm_fault(): this process own mm->mmap_sem. > ->handle_pte_fault(). > -> (eg.) do_wp_page(). > -> get_free_page(). > now get_free_page() notices that there is no free memory. > --> wakeup kswapd. > > * the swapper runs, and it tries to swap out data from that process. > mm->mmap_sem is already acquired --> lock-up. Nasty... But adding a big lock around all traversals of the mm->mmap will hurt like hell. Let's see... The problem is in swap_out_mm(), right? And it looks for the first suitable page. OK, it looks like we can get a deadlock only if we call try_to_free_pages() with ->mmap_sem grabbed and __GFP_WAIT in flags. Umhm... Called only from __get_free_pages() which sets PF_MEMALLOC... OK, I'll try to look at it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:25 ` Alexander Viro 1999-10-10 16:45 ` Manfred Spraul @ 1999-10-10 17:12 ` Andrea Arcangeli 1999-10-10 17:48 ` Alexander Viro 1999-10-11 15:43 ` Stephen C. Tweedie 1 sibling, 2 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-10-10 17:12 UTC (permalink / raw) To: Alexander Viro; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Alexander Viro wrote: >sys_swapoff(). It's a syscall. Andrea, could you show a scenario for do_page_fault -> down() -> GFP -> swap_out() -> down() -> deadlock To grab the mm semaphore in swap_out we could swap_out only from kswapd doing a kind of wakeup_and_wait_kswapd() ala wakeup_bdflush(1) but it would be slow and I don't want to run worse than in 2.2.x in UP to get some more SMP scalability in SMP (that won't pay the cost). The other option is to make the mmap semaphore recursive checking that GFP is not called in the middle of a vma change. I don't like this one it sound not robust as the spinlock way to me (see below). What I like is to go as in 2.2.x with a proper spinlock for doing vma reads (I am _not_ talking about the big kernel lock!). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 17:12 ` Andrea Arcangeli @ 1999-10-10 17:48 ` Alexander Viro 1999-10-10 18:42 ` Manfred Spraul 1999-10-11 15:43 ` Stephen C. Tweedie 1 sibling, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-10 17:48 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Andrea Arcangeli wrote: > On Sun, 10 Oct 1999, Alexander Viro wrote: > > >sys_swapoff(). It's a syscall. Andrea, could you show a scenario for > > do_page_fault -> down() -> GFP -> swap_out() -> down() -> deadlock Yes, I had realized that I was looking into the wrong place. Unfortunately after I've sent a posting. My apologies. > To grab the mm semaphore in swap_out we could swap_out only from kswapd > doing a kind of wakeup_and_wait_kswapd() ala wakeup_bdflush(1) but it would > be slow and I don't want to run worse than in 2.2.x in UP to get some more > SMP scalability in SMP (that won't pay the cost). > > The other option is to make the mmap semaphore recursive checking that GFP > is not called in the middle of a vma change. I don't like this one it sound > not robust as the spinlock way to me (see below). > > What I like is to go as in 2.2.x with a proper spinlock for doing vma > reads (I am _not_ talking about the big kernel lock!). I'm not sure that it will work (we scan the thing in many places and quite a few may be blocking ;-/), unless you propose to protect individual steps of the scan, which will give you lots of overhead. I suspect that swap_out_mm() needs fixing, not everything else... And it looks like we can't drop the sucker earlier in handle_mm_fault. Or can we? As crazy as it may sound, what about keeping a small cache of pages, taking from that cache and doing refills when we are crossing the boundary of dangerous area (refusing to enter it until the number of pages in cache will grow bigger than amount of processes in dangerous part)? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 17:48 ` Alexander Viro @ 1999-10-10 18:42 ` Manfred Spraul 1999-10-10 19:03 ` Alexander Viro 1999-10-11 15:47 ` Stephen C. Tweedie 0 siblings, 2 replies; 51+ messages in thread From: Manfred Spraul @ 1999-10-10 18:42 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Alexander Viro wrote: > I'm not sure that it will work (we scan the thing in many places and > quite a few may be blocking ;-/), unless you propose to protect individual > steps of the scan, which will give you lots of overhead. The overhead should be low, we could keep the "double synchronization", ie * either down(&mm->mmap_sem) or spin_lock(&mm->vma_list_lock) for read * both locks for write. I think that 3 to 5 spin_lock() calls are required. > I suspect that > swap_out_mm() needs fixing, not everything else... And it looks like we > can't drop the sucker earlier in handle_mm_fault. Or can we? That would be a good idea: For multi-threaded applications, swap-in is currently single-threaded, ie we do not overlap the io operations if 2 threads of the same process cause page faults. Everything is fully serialized. But I think this would be a huge change, eg do_munmap() in one thread while another thread waits for page-in.... -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 18:42 ` Manfred Spraul @ 1999-10-10 19:03 ` Alexander Viro 1999-10-10 21:31 ` Manfred Spraul ` (2 more replies) 1999-10-11 15:47 ` Stephen C. Tweedie 1 sibling, 3 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-10 19:03 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Manfred Spraul wrote: > Alexander Viro wrote: > > I'm not sure that it will work (we scan the thing in many places and > > quite a few may be blocking ;-/), unless you propose to protect individual > > steps of the scan, which will give you lots of overhead. > > The overhead should be low, we could keep the "double synchronization", > ie > * either down(&mm->mmap_sem) or spin_lock(&mm->vma_list_lock) for read > * both locks for write. > > I think that 3 to 5 spin_lock() calls are required. Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to protect vma from destruction all way down to try_to_swap_out(). And to vma->swapout(). Which can sleep, so spinlocks are out of question here. I still think that just keeping a cyclic list of pages, grabbing from that list before taking mmap_sem _if_ we have a chance for blocking __get_free_page(), refilling if the list is empty (prior to down()) and returning the page into the list if we didn't use it may be the simplest way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 19:03 ` Alexander Viro @ 1999-10-10 21:31 ` Manfred Spraul 1999-10-10 21:53 ` Andrea Arcangeli 1999-10-11 15:50 ` Stephen C. Tweedie 2 siblings, 0 replies; 51+ messages in thread From: Manfred Spraul @ 1999-10-10 21:31 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Alexander Viro wrote: > Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't > block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to > protect vma from destruction all way down to try_to_swap_out(). And to > vma->swapout(). Which can sleep, so spinlocks are out of question here. I found vma->swapout() when I tried to implement it. Sh... We could make vma_list_lock a semaphore, but I haven't checked for any hidden problems yet. > > I still think that just keeping a cyclic list of pages, grabbing from that > list before taking mmap_sem _if_ we have a chance for blocking > __get_free_page(), refilling if the list is empty (prior to down()) and > returning the page into the list if we didn't use it may be the simplest > way. I don't like the idea, but it sounds possible. A problem could be that the page-in functions can allocate memory: do_nopage() -> filemap_nopage(): it calls i_op->readpage() which would call get_block(), eg ext2: load the indirect page, this needs memory --> OOM. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 19:03 ` Alexander Viro 1999-10-10 21:31 ` Manfred Spraul @ 1999-10-10 21:53 ` Andrea Arcangeli 1999-10-10 22:34 ` Alexander Viro 1999-10-11 15:50 ` Stephen C. Tweedie 2 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-10-10 21:53 UTC (permalink / raw) To: Alexander Viro; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Alexander Viro wrote: >I still think that just keeping a cyclic list of pages, grabbing from that >list before taking mmap_sem _if_ we have a chance for blocking >__get_free_page(), refilling if the list is empty (prior to down()) and >returning the page into the list if we didn't use it may be the simplest >way. I can't understand very well your plan. We just have a security pool. We just block only when the pool become low. To refill our just existing pool we have to walk the vmas. That's the problem in first place. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 21:53 ` Andrea Arcangeli @ 1999-10-10 22:34 ` Alexander Viro 1999-10-10 23:28 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-10 22:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Andrea Arcangeli wrote: > On Sun, 10 Oct 1999, Alexander Viro wrote: > > >I still think that just keeping a cyclic list of pages, grabbing from that > >list before taking mmap_sem _if_ we have a chance for blocking > >__get_free_page(), refilling if the list is empty (prior to down()) and > >returning the page into the list if we didn't use it may be the simplest > >way. > > I can't understand very well your plan. > > We just have a security pool. We just block only when the pool become low. > To refill our just existing pool we have to walk the vmas. That's the > problem in first place. I missed the fact that page-in can suck additional pages. Sorry. Original idea was to do that _before_ we are getting the mmap_sem - that would allow to grab it in swap_out_mm. We can't do _anything_ with vmas while swap_out_mm is running over their mm - no list modifications, no vma removal, etc. We could introduce a new semaphore (spinlocks are not going to work - ->swapout gets vma as argument and it can sleep. The question being: where can we trigger __get_free_pages() with __GFP_WAIT if the mmap_sem is held? And another one - where do we modify ->mmap? If they can be easily separated - fine. Then we need to protect the thing with semaphore - no contention in normal case (we already have mmap_sem), enough protection in the swapper. Probably it make sense to choose the protected area as wide as possible - there will be no contention in normal case and we can cut the overhead down. I'll try to look through the thing tonight (while the truncate stuff will eat the fs on the testbox ;-), but I'ld be really grateful if some of VM people would check the results. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 22:34 ` Alexander Viro @ 1999-10-10 23:28 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-10-10 23:28 UTC (permalink / raw) To: Alexander Viro; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Alexander Viro wrote: >mm - no list modifications, no vma removal, etc. We could introduce a new >semaphore (spinlocks are not going to work - ->swapout gets vma as We could add a reference count to each vma protected by a per-mm spinlock. Then we could drop the spinlock in swap_out_mm as soon as we incremented the refcount of the vma. I am talking by memory, I am not sure if it can be really done and if it's the right thing to do this time. (I'll check the code ASAP). >argument and it can sleep. The question being: where can we trigger >__get_free_pages() with __GFP_WAIT if the mmap_sem is held? And another All userspace allocations do exactly that. >one - where do we modify ->mmap? If they can be easily separated - We modify mmap in the mmap.c and infact we hold the semaphore there too. The reason they are not separated is that during all the page fault path you can't have _your_ vmas to change under you if another thread is running munmap in parallel. >eat the fs on the testbox ;-), but I'ld be really grateful if some of VM >people would check the results. I can check them of course ;). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 19:03 ` Alexander Viro 1999-10-10 21:31 ` Manfred Spraul 1999-10-10 21:53 ` Andrea Arcangeli @ 1999-10-11 15:50 ` Stephen C. Tweedie 1999-10-11 16:05 ` Alexander Viro 2 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:50 UTC (permalink / raw) To: Alexander Viro Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm, Stephen Tweedie Hi, On Sun, 10 Oct 1999 15:03:45 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: > Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't > block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to > protect vma from destruction all way down to try_to_swap_out(). And to > vma->swapout(). Which can sleep, so spinlocks are out of question > here. No, spinlocks would be ideal. The vma swapout codes _have_ to be prepared for the vma to be destroyed as soon as we sleep. In fact, the entire mm may disappear if the process happens to exit. Once we know which page to write where, the swapout operation becomes a per-page operation, not per-vma. We had some rather interesting bugs in the 1.0 and 1.2 timeframes surrounding exactly this. Nowadays we assume that all mm context disappears as soon as the swapper blocks. We bump the refcount on the page itself to make the page write safe, but we have to be very careful indeed to do all the mm manipulations before we get to that stage. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 15:50 ` Stephen C. Tweedie @ 1999-10-11 16:05 ` Alexander Viro 1999-10-11 18:02 ` Manfred Spraul 1999-10-11 20:13 ` Stephen C. Tweedie 0 siblings, 2 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-11 16:05 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > Hi, > > On Sun, 10 Oct 1999 15:03:45 -0400 (EDT), Alexander Viro > <viro@math.psu.edu> said: > > > Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't > > block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to > > protect vma from destruction all way down to try_to_swap_out(). And to > > vma->swapout(). Which can sleep, so spinlocks are out of question > > here. > > No, spinlocks would be ideal. The vma swapout codes _have_ to be > prepared for the vma to be destroyed as soon as we sleep. In fact, the > entire mm may disappear if the process happens to exit. Once we know > which page to write where, the swapout operation becomes a per-page > operation, not per-vma. Aha, so you propose to drop it in ->swapout(), right? (after get_file() in filemap_write_page()... Ouch. Probably we'ld better lambda-expand the call in filemap_swapout() - the thing is called from other places too)... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 16:05 ` Alexander Viro @ 1999-10-11 18:02 ` Manfred Spraul 1999-10-11 19:07 ` Kanoj Sarcar ` (3 more replies) 1999-10-11 20:13 ` Stephen C. Tweedie 1 sibling, 4 replies; 51+ messages in thread From: Manfred Spraul @ 1999-10-11 18:02 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Alexander Viro wrote: > > On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > > > Hi, > > > > On Sun, 10 Oct 1999 15:03:45 -0400 (EDT), Alexander Viro > > <viro@math.psu.edu> said: > > > > > Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't > > > block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to > > > protect vma from destruction all way down to try_to_swap_out(). And to > > > vma->swapout(). Which can sleep, so spinlocks are out of question > > > here. > > > > No, spinlocks would be ideal. The vma swapout codes _have_ to be > > prepared for the vma to be destroyed as soon as we sleep. In fact, the > > entire mm may disappear if the process happens to exit. Once we know > > which page to write where, the swapout operation becomes a per-page > > operation, not per-vma. > > Aha, so you propose to drop it in ->swapout(), right? (after get_file() in > filemap_write_page()... Ouch. Probably we'ld better lambda-expand the call > in filemap_swapout() - the thing is called from other places too)... What about something like a rw-semaphore which protects the vma list: vma-list modifiers [ie merge_segments(), insert_vm_struct() and do_munmap()] grab it exclusive, swapper grabs it "shared, starve exclusive". All other vma-list readers are protected by mm->mmap_sem. This should not dead-lock, and no changes are required in vm_ops->swapout(). -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 18:02 ` Manfred Spraul @ 1999-10-11 19:07 ` Kanoj Sarcar 1999-10-11 22:23 ` Stephen C. Tweedie 1999-10-11 20:15 ` Stephen C. Tweedie ` (2 subsequent siblings) 3 siblings, 1 reply; 51+ messages in thread From: Kanoj Sarcar @ 1999-10-11 19:07 UTC (permalink / raw) To: Manfred Spraul; +Cc: viro, sct, andrea, linux-kernel, mingo, linux-mm > > What about something like a rw-semaphore which protects the vma list: > vma-list modifiers [ie merge_segments(), insert_vm_struct() and > do_munmap()] grab it exclusive, swapper grabs it "shared, starve > exclusive". > All other vma-list readers are protected by mm->mmap_sem. > > This should not dead-lock, and no changes are required in > vm_ops->swapout(). > I have tried to follow most of the logic and solutions proposed on this thread. This is the best solution, imo. In fact, I had already coded something on these lines against a 2.2.10 kernel, which I still have around. I will try to port this against a 2.3.19 kernel over the next couple of days and post it for everyone to review. Thanks. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 19:07 ` Kanoj Sarcar @ 1999-10-11 22:23 ` Stephen C. Tweedie 1999-10-13 1:25 ` Kanoj Sarcar 0 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 22:23 UTC (permalink / raw) To: Kanoj Sarcar Cc: Manfred Spraul, viro, sct, andrea, linux-kernel, mingo, linux-mm Hi, On Mon, 11 Oct 1999 12:07:08 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: >> What about something like a rw-semaphore which protects the vma list: >> vma-list modifiers [ie merge_segments(), insert_vm_struct() and >> do_munmap()] grab it exclusive, swapper grabs it "shared, starve >> exclusive". >> All other vma-list readers are protected by mm->mmap_sem. > I have tried to follow most of the logic and solutions proposed > on this thread. It will deadlock. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 22:23 ` Stephen C. Tweedie @ 1999-10-13 1:25 ` Kanoj Sarcar 1999-10-13 7:32 ` Manfred Spraul 1999-10-13 10:45 ` Stephen C. Tweedie 0 siblings, 2 replies; 51+ messages in thread From: Kanoj Sarcar @ 1999-10-13 1:25 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: manfreds, viro, andrea, linux-kernel, mingo, linux-mm > > Hi, > > On Mon, 11 Oct 1999 12:07:08 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > >> What about something like a rw-semaphore which protects the vma list: > >> vma-list modifiers [ie merge_segments(), insert_vm_struct() and > >> do_munmap()] grab it exclusive, swapper grabs it "shared, starve > >> exclusive". > >> All other vma-list readers are protected by mm->mmap_sem. > > It will deadlock. > Not sure why you say that. Yes, I did see the nesting semdaphore grabbing scenario that you posted, but I think you can prevent that from happening if you follow some rules. Here's a primitive patch showing the direction I am thinking of. I do not have any problem with a spinning lock, but I coded this against 2.2.10, where insert_vm_struct could go to sleep, hence I had to use sleeping locks to protect the vma chain. Slight change is needed to vmscan.c to use spinning locks. Kanoj This is a skeleton of the solution that prevents kswapd from walking down a vma chain without protections. I am trying to get comments on this approach before I try a full blown implementation. The rules: 1. To modify the vmlist (add/delete), you must hold mmap_sem to guard against clones doing mmap/munmap/faults, (ie all vm system calls and faults), and from ptrace, swapin due to swap deletion etc. 2. To modify the vmlist (add/delete), you must also hold vmlist_modify_lock, to guard against page stealers scanning the list. 3. To scan the vmlist, you must either a. grab mmap_sem, which should be all cases except page stealer. or b. grab vmlist_access_lock, only done by page stealer. 4. While holding the vmlist_modify_lock, you must be able to guarantee that no code path will lead to page stealing. 5. You must be able to guarantee that while holding vmlist_modify_lock or vmlist_access_lock of mm A, you will not try to get either lock for mm B. The assumptions: 1. No code path reachable thru insert_vm_struct and merge_segments sleep for memory. --- /usr/tmp/p_rdiff_a00368/vmscan.c Tue Oct 12 17:58:50 1999 +++ mm/vmscan.c Tue Oct 12 16:55:13 1999 @@ -295,6 +295,7 @@ /* * Find the proper vm-area */ + vmlist_access_lock(mm); vma = find_vma(mm, address); if (vma) { if (address < vma->vm_start) @@ -302,8 +303,10 @@ for (;;) { int result = swap_out_vma(vma, address, gfp_mask); - if (result) + if (result) { + vmlist_access_unlock(mm); return result; + } vma = vma->vm_next; if (!vma) break; @@ -310,6 +313,7 @@ address = vma->vm_start; } } + vmlist_access_unlock(mm); /* We didn't find anything for the process */ mm->swap_cnt = 0; --- /usr/tmp/p_rdiff_a0036H/mlock.c Tue Oct 12 17:59:06 1999 +++ mm/mlock.c Tue Oct 12 16:35:25 1999 @@ -13,7 +13,9 @@ static inline int mlock_fixup_all(struct vm_area_struct * vma, int newflags) { + vmlist_modify_lock(vma->vm_mm); vma->vm_flags = newflags; + vmlist_modify_unlock(vma->vm_mm); return 0; } @@ -26,15 +28,17 @@ if (!n) return -EAGAIN; *n = *vma; - vma->vm_start = end; n->vm_end = end; - vma->vm_offset += vma->vm_start - n->vm_start; n->vm_flags = newflags; if (n->vm_file) get_file(n->vm_file); if (n->vm_ops && n->vm_ops->open) n->vm_ops->open(n); + vmlist_modify_lock(vma->vm_mm); + vma->vm_offset += end - vma->vm_start; + vma->vm_start = end; insert_vm_struct(current->mm, n); + vmlist_modify_unlock(vma->vm_mm); return 0; } @@ -47,7 +51,6 @@ if (!n) return -EAGAIN; *n = *vma; - vma->vm_end = start; n->vm_start = start; n->vm_offset += n->vm_start - vma->vm_start; n->vm_flags = newflags; @@ -55,7 +58,10 @@ get_file(n->vm_file); if (n->vm_ops && n->vm_ops->open) n->vm_ops->open(n); + vmlist_modify_lock(vma->vm_mm); + vma->vm_end = start; insert_vm_struct(current->mm, n); + vmlist_modify_unlock(vma->vm_mm); return 0; } @@ -75,10 +81,7 @@ *left = *vma; *right = *vma; left->vm_end = start; - vma->vm_start = start; - vma->vm_end = end; right->vm_start = end; - vma->vm_offset += vma->vm_start - left->vm_start; right->vm_offset += right->vm_start - left->vm_start; vma->vm_flags = newflags; if (vma->vm_file) @@ -88,8 +91,14 @@ vma->vm_ops->open(left); vma->vm_ops->open(right); } + vmlist_modify_lock(vma->vm_mm); + vma->vm_offset += start - vma->vm_start; + vma->vm_start = start; + vma->vm_end = end; + vma->vm_flags = newflags; insert_vm_struct(current->mm, left); insert_vm_struct(current->mm, right); + vmlist_modify_unlock(vma->vm_mm); return 0; } @@ -168,7 +177,9 @@ break; } } + vmlist_modify_lock(current->mm); merge_segments(current->mm, start, end); + vmlist_modify_unlock(current->mm); return error; } @@ -240,7 +251,9 @@ if (error) break; } + vmlist_modify_lock(current->mm); merge_segments(current->mm, 0, TASK_SIZE); + vmlist_modify_unlock(current->mm); return error; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-13 1:25 ` Kanoj Sarcar @ 1999-10-13 7:32 ` Manfred Spraul 1999-10-15 9:58 ` Ralf Baechle 1999-10-13 10:45 ` Stephen C. Tweedie 1 sibling, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-13 7:32 UTC (permalink / raw) To: Kanoj Sarcar Cc: Stephen C. Tweedie, viro, andrea, linux-kernel, mingo, linux-mm Kanoj Sarcar wrote: > Here's a primitive patch showing the direction I am thinking of. I do not > have any problem with a spinning lock, but I coded this against 2.2.10, > where insert_vm_struct could go to sleep, hence I had to use sleeping > locks to protect the vma chain. I found a few places where I don't know how to change them. 1) arch/mips/mm/r4xx0.c: their flush_cache_range() function internally calls find_vma(). flush_cache_range() is called by proc/mem.c, and it seems that this function cannot get the mmap semaphore. Currently, every caller of flush_cache_range() either owns the kernel lock or the mmap_sem. OTHO, this function contains a race anyway [src_vma can go away if handle_mm_fault() sleeps, src_vma is used at the end of the function.] 2) arch/sparc/mm/fault.c: > /* This conditional is 'interesting'. */ > if (pgd_val(*pgdp) && !(write && !(pte_val(*ptep) & _SUN4C_PAGE_WRITE)) > && (pte_val(*ptep) & _SUN4C_PAGE_VALID)) > /* Note: It is safe to not grab the MMAP semaphore here because > * we know that update_mmu_cache() will not sleep for > * any reason (at least not in the current implementation) > * and therefore there is no danger of another thread getting > * on the CPU and doing a shrink_mmap() on this vma. > */ > sun4c_update_mmu_cache (find_vma(current->mm, address), address, > *ptep); > else > do_sparc_fault(regs, text_fault, write, address); > } could be safe because sun4c is only UP? 3) include/ppc-asm/pgtable.h: > extern __inline__ pte_t *find_pte(struct mm_struct *mm,unsigned long va) > { > pgd_t *dir; > pmd_t *pmd; > pte_t *pte; > > va &= PAGE_MASK; > > dir = pgd_offset( mm, va ); > if (dir) > { > pmd = pmd_offset(dir, va & PAGE_MASK); > if (pmd && pmd_present(*pmd)) > { > pte = pte_offset(pmd, va); > if (pte && pte_present(*pte)) > { > pte_uncache(*pte); > flush_tlb_page(find_vma(mm,va),va); > } > } > } > return pte; > } Could be safe because only called for "init_mm"? I've not yet looked at swap_out [mm/swapfile.c and arch/m68k/atari/stram.c] and proc/array.c -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-13 7:32 ` Manfred Spraul @ 1999-10-15 9:58 ` Ralf Baechle 1999-10-15 17:50 ` Kanoj Sarcar 0 siblings, 1 reply; 51+ messages in thread From: Ralf Baechle @ 1999-10-15 9:58 UTC (permalink / raw) To: Manfred Spraul Cc: Kanoj Sarcar, Stephen C. Tweedie, viro, andrea, linux-kernel, mingo, linux-mm, linux, linux-mips, linux-mips On Wed, Oct 13, 1999 at 09:32:54AM +0200, Manfred Spraul wrote: > Kanoj Sarcar wrote: > > Here's a primitive patch showing the direction I am thinking of. I do not > > have any problem with a spinning lock, but I coded this against 2.2.10, > > where insert_vm_struct could go to sleep, hence I had to use sleeping > > locks to protect the vma chain. > > I found a few places where I don't know how to change them. > > 1) arch/mips/mm/r4xx0.c: > their flush_cache_range() function internally calls find_vma(). > flush_cache_range() is called by proc/mem.c, and it seems that this > function cannot get the mmap semaphore. > Currently, every caller of flush_cache_range() either owns the kernel > lock or the mmap_sem. > OTHO, this function contains a race anyway [src_vma can go away if > handle_mm_fault() sleeps, src_vma is used at the end of the function.] The sole reason for fiddling with the VMA is that we try to optimize icache flushing for non-VM_EXEC vmas. This optimization is broken as the MIPS hardware doesn't make a difference between read and execute in page permissions, so the icache might be dirty even though the vma has no exec permission. So I'll have to re-implement this whole things anyway. The other problem is an efficience problem. A call like flush_cache_range(some_mm_ptr, 0, TASK_SIZE) would take a minor eternity and for MIPS64 a full eternity ... Ralf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-15 9:58 ` Ralf Baechle @ 1999-10-15 17:50 ` Kanoj Sarcar 0 siblings, 0 replies; 51+ messages in thread From: Kanoj Sarcar @ 1999-10-15 17:50 UTC (permalink / raw) To: Ralf Baechle Cc: manfreds, sct, viro, andrea, linux-kernel, mingo, linux-mm, linux, linux-mips, linux-mips > > On Wed, Oct 13, 1999 at 09:32:54AM +0200, Manfred Spraul wrote: > > > Kanoj Sarcar wrote: > > > Here's a primitive patch showing the direction I am thinking of. I do not > > > have any problem with a spinning lock, but I coded this against 2.2.10, > > > where insert_vm_struct could go to sleep, hence I had to use sleeping > > > locks to protect the vma chain. > > > > I found a few places where I don't know how to change them. > > > > 1) arch/mips/mm/r4xx0.c: > > their flush_cache_range() function internally calls find_vma(). > > flush_cache_range() is called by proc/mem.c, and it seems that this > > function cannot get the mmap semaphore. > > Currently, every caller of flush_cache_range() either owns the kernel > > lock or the mmap_sem. > > OTHO, this function contains a race anyway [src_vma can go away if > > handle_mm_fault() sleeps, src_vma is used at the end of the function.] > > The sole reason for fiddling with the VMA is that we try to optimize > icache flushing for non-VM_EXEC vmas. This optimization is broken > as the MIPS hardware doesn't make a difference between read and execute > in page permissions, so the icache might be dirty even though the vma > has no exec permission. So I'll have to re-implement this whole things > anyway. The other problem is an efficience problem. A call like > flush_cache_range(some_mm_ptr, 0, TASK_SIZE) would take a minor eternity > and for MIPS64 a full eternity ... > > Ralf Ralf, Looking in 2.3.21, all the find_vma's in arch/mips/mm/r4xx0.c are used to set a flag called "text" which is not used at all. Also, if the find_vma returns null, the code basically does nothing. So the optimized icache flushing is probably not implemented yet? Then, the only reason to do the flush_vma currently is to check whether the lower level flush routine should be called. Without holding some locks, this is always tricky to do on a third party mm. Btw, this probably belongs to linux-mips, but what do you mean by saying the icache might be dirty? Its been a while since I worked on the older mips chips, but as far as I remember, the icache can not hold dirty lines. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-13 1:25 ` Kanoj Sarcar 1999-10-13 7:32 ` Manfred Spraul @ 1999-10-13 10:45 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-13 10:45 UTC (permalink / raw) To: Kanoj Sarcar Cc: Stephen C. Tweedie, manfreds, viro, andrea, linux-kernel, mingo, linux-mm Hi, On Tue, 12 Oct 1999 18:25:42 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > This is a skeleton of the solution that prevents kswapd from walking > down a vma chain without protections. I am trying to get comments on > this approach before I try a full blown implementation. > The rules: > 1. To modify the vmlist (add/delete), you must hold mmap_sem to > guard against clones doing mmap/munmap/faults, (ie all vm system > calls and faults), and from ptrace, swapin due to swap deletion > etc. > 2. To modify the vmlist (add/delete), you must also hold > vmlist_modify_lock, to guard against page stealers scanning the > list. > 3. To scan the vmlist, you must either > a. grab mmap_sem, which should be all cases except page stealer. > or > b. grab vmlist_access_lock, only done by page stealer. > 4. While holding the vmlist_modify_lock, you must be able to guarantee > that no code path will lead to page stealing. > 5. You must be able to guarantee that while holding vmlist_modify_lock > or vmlist_access_lock of mm A, you will not try to get either lock > for mm B. This looks like the same mechanism and set of rules that Al Viro proposed, and it seems watertight. I'd like the locking written down in the source somewhere if anyone implements this, btw, as otherwise we'll just be fixing it ourselves every time in the future when somebody who doesn't understand them touches anything in the mmap paths... --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 18:02 ` Manfred Spraul 1999-10-11 19:07 ` Kanoj Sarcar @ 1999-10-11 20:15 ` Stephen C. Tweedie 1999-10-11 21:14 ` Manfred Spraul 1999-10-11 21:37 ` Alexander Viro 1999-10-11 22:22 ` Stephen C. Tweedie 3 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 20:15 UTC (permalink / raw) To: Manfred Spraul Cc: Alexander Viro, Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 20:02:40 +0200, Manfred Spraul <manfreds@colorfullife.com> said: > What about something like a rw-semaphore which protects the vma list: > vma-list modifiers [ie merge_segments(), insert_vm_struct() and > do_munmap()] grab it exclusive, swapper grabs it "shared, starve > exclusive". > All other vma-list readers are protected by mm->mmap_sem. > This should not dead-lock, and no changes are required in > vm_ops-> swapout(). The swapout method will need to drop the spinlock. We need to preserve the vma over the call into the swapout method, and the method will need to be able to block. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 20:15 ` Stephen C. Tweedie @ 1999-10-11 21:14 ` Manfred Spraul 0 siblings, 0 replies; 51+ messages in thread From: Manfred Spraul @ 1999-10-11 21:14 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alexander Viro, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm "Stephen C. Tweedie" wrote: > The swapout method will need to drop the spinlock. We need to preserve > the vma over the call into the swapout method, and the method will need > to be able to block. no spinlock, a rw-semaphore, ie a multiple-reader single-writer sync object which calls schedule() when the resource is busy. IIRC, the vma-list is only modified by * insert_vma_struct(): never sleeps, doesn't allocate memory. No problems with swap-out. * merge_vm_area(): dito. * do_munmap(): the area which modifies the vma-list makes no memory allocations, should make no problems under low-memory. --> everyone who needs an exclusive access is OOM safe. Additionally, the swap-out should use a "starve writer"-policy, ie there will be no dead-locks with multiple concurrent swap-outs in the same "struct mm" [concurrent means overlapped io, still serialized by lock_kernel()]. I think the result should be OOM safe without touching vm_ops->swapout(). -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 18:02 ` Manfred Spraul 1999-10-11 19:07 ` Kanoj Sarcar 1999-10-11 20:15 ` Stephen C. Tweedie @ 1999-10-11 21:37 ` Alexander Viro 1999-10-11 22:13 ` Manfred Spraul 1999-10-11 22:22 ` Stephen C. Tweedie 3 siblings, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-11 21:37 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Manfred Spraul wrote: > Alexander Viro wrote: > > > > On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > > > > > Hi, > > > > > > On Sun, 10 Oct 1999 15:03:45 -0400 (EDT), Alexander Viro > > > <viro@math.psu.edu> said: > > > > > > > Hold on. In swap_out_mm() you have to protect find_vma() (OK, it doesn't > > > > block, but we'll have to take care of mm->mmap_cache) _and_ you'll have to > > > > protect vma from destruction all way down to try_to_swap_out(). And to > > > > vma->swapout(). Which can sleep, so spinlocks are out of question > > > > here. > > > > > > No, spinlocks would be ideal. The vma swapout codes _have_ to be > > > prepared for the vma to be destroyed as soon as we sleep. In fact, the > > > entire mm may disappear if the process happens to exit. Once we know > > > which page to write where, the swapout operation becomes a per-page > > > operation, not per-vma. > > > > Aha, so you propose to drop it in ->swapout(), right? (after get_file() in > > filemap_write_page()... Ouch. Probably we'ld better lambda-expand the call > > in filemap_swapout() - the thing is called from other places too)... > > What about something like a rw-semaphore which protects the vma list: > vma-list modifiers [ie merge_segments(), insert_vm_struct() and > do_munmap()] grab it exclusive, swapper grabs it "shared, starve > exclusive". > All other vma-list readers are protected by mm->mmap_sem. > > This should not dead-lock, and no changes are required in > vm_ops->swapout(). What does it buy you over the simple semaphore here? Do you really see a contention scenario? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 21:37 ` Alexander Viro @ 1999-10-11 22:13 ` Manfred Spraul 0 siblings, 0 replies; 51+ messages in thread From: Manfred Spraul @ 1999-10-11 22:13 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Alexander Viro wrote: > What does it buy you over the simple semaphore here? Do you really see a > contention scenario? I think you are right, I see no case where a normal semaphore would lock-up and the rw semaphore would not lock up. we win something _if_ vm_ops->swapout() is extremely slow: * with lock_kernel() [ie currently], multiple threads can sleep within vm_ops->swapout() of the same "struct mm" * an rw-semaphore would mimic that behaviour. * a normal semaphore would prevent that. I'm not sure if it is worth to implement a rw-semaphore, especially since we win something in a very obscure case, but we loose cpu-cycles for every down_rw()/up_rw() [there is no 2 asm-instruction rw-semaphore] -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 18:02 ` Manfred Spraul ` (2 preceding siblings ...) 1999-10-11 21:37 ` Alexander Viro @ 1999-10-11 22:22 ` Stephen C. Tweedie 1999-10-11 23:01 ` Alexander Viro 3 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 22:22 UTC (permalink / raw) To: Manfred Spraul Cc: Alexander Viro, Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 20:02:40 +0200, Manfred Spraul <manfreds@colorfullife.com> said: > What about something like a rw-semaphore which protects the vma list: > vma-list modifiers [ie merge_segments(), insert_vm_struct() and > do_munmap()] grab it exclusive, swapper grabs it "shared, starve > exclusive". Deadlock. Process A tries to do an mmap on mm A, gets the exclusive lock, tries to swap out from process B, and grabs mm B's shared lock. Process B in the mean time is doing the same thing and has an exclusive lock on mm B, and is trying to share-lock A. Whoops. > This should not dead-lock, Are you sure?! --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 22:22 ` Stephen C. Tweedie @ 1999-10-11 23:01 ` Alexander Viro 1999-10-12 14:06 ` [more fun] " Alexander Viro 1999-10-13 10:16 ` Stephen C. Tweedie 0 siblings, 2 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-11 23:01 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > Hi, > > On Mon, 11 Oct 1999 20:02:40 +0200, Manfred Spraul > <manfreds@colorfullife.com> said: > > > What about something like a rw-semaphore which protects the vma list: > > vma-list modifiers [ie merge_segments(), insert_vm_struct() and > > do_munmap()] grab it exclusive, swapper grabs it "shared, starve > > exclusive". > > Deadlock. Process A tries to do an mmap on mm A, gets the exclusive > lock, tries to swap out from process B, and grabs mm B's shared lock. > Process B in the mean time is doing the same thing and has an exclusive > lock on mm B, and is trying to share-lock A. Whoops. <looking at the places in question> insert_vm_struct doesn't allocate anything. Ditto for merge_segments In do_munmap() the area that should be protected (ripping the vmas from the list) doesn't allocate anything too. In the swapper we are protected from recursion, aren't we? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* [more fun] Re: locking question: do_mmap(), do_munmap() 1999-10-11 23:01 ` Alexander Viro @ 1999-10-12 14:06 ` Alexander Viro 1999-10-13 7:35 ` Manfred Spraul 1999-10-13 10:16 ` Stephen C. Tweedie 1 sibling, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-12 14:06 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, Linus Torvalds, linux-mm Funny path #1: on 386 (sucky WP) copy_to_user() -> access_ok() -> __verify_write() -> handle_mm_fault() and no mmap_sem in sight. Ditto for __verify_write() on sh. Another one: ptrace_readdata() -> access_process_vm() -> find_extend_vma() -> make_pages_present(). Again, no mmap_sem in sight. irix_brk(): calls do_brk() without mmap_sem. sys_cacheflush() (on m68k): plays with vma without mmap_sem. Patch follows (2.3.20, but these files didn't change in .21). diff -urN linux-2.3.20/arch/i386/mm/fault.c linux-bird.mm/arch/i386/mm/fault.c --- linux-2.3.20/arch/i386/mm/fault.c Sun Sep 12 11:01:01 1999 +++ linux-bird.mm/arch/i386/mm/fault.c Tue Oct 12 07:44:55 1999 @@ -35,6 +35,7 @@ if (!size) return 1; + down(¤t->mm->mmap_sem); vma = find_vma(current->mm, start); if (!vma) goto bad_area; @@ -64,6 +65,7 @@ if (!(vma->vm_flags & VM_WRITE)) goto bad_area;; } + up(¤t->mm->mmap_sem); return 1; check_stack: @@ -73,6 +75,7 @@ goto good_area; bad_area: + up(¤t->mm->mmap_sem); return 0; } diff -urN linux-2.3.20/arch/m68k/kernel/sys_m68k.c linux-bird.mm/arch/m68k/kernel/sys_m68k.c --- linux-2.3.20/arch/m68k/kernel/sys_m68k.c Mon Jun 21 12:35:55 1999 +++ linux-bird.mm/arch/m68k/kernel/sys_m68k.c Tue Oct 12 09:56:24 1999 @@ -535,6 +535,7 @@ int ret = -EINVAL; lock_kernel(); + down(¤t->mm->mmap_sem); if (scope < FLUSH_SCOPE_LINE || scope > FLUSH_SCOPE_ALL || cache & ~FLUSH_CACHE_BOTH) goto out; @@ -591,6 +592,7 @@ ret = cache_flush_060 (addr, scope, cache, len); } out: + up(¤t->mm->mmap_sem); unlock_kernel(); return ret; } diff -urN linux-2.3.20/arch/mips/kernel/sysirix.c linux-bird.mm/arch/mips/kernel/sysirix.c --- linux-2.3.20/arch/mips/kernel/sysirix.c Sun Sep 12 05:54:08 1999 +++ linux-bird.mm/arch/mips/kernel/sysirix.c Tue Oct 12 09:46:09 1999 @@ -534,6 +534,7 @@ int ret; lock_kernel(); + down(¤t->mm->mmap_sem); if (brk < current->mm->end_code) { ret = -ENOMEM; goto out; @@ -591,6 +592,7 @@ ret = 0; out: + up(¤t->mm->mmap_sem); unlock_kernel(); return ret; } diff -urN linux-2.3.20/arch/sh/mm/fault.c linux-bird.mm/arch/sh/mm/fault.c --- linux-2.3.20/arch/sh/mm/fault.c Sun Sep 12 13:29:49 1999 +++ linux-bird.mm/arch/sh/mm/fault.c Tue Oct 12 09:57:03 1999 @@ -38,6 +38,7 @@ if (!size) return 1; + down(¤t->mm->mmap_sem); vma = find_vma(current->mm, start); if (!vma) goto bad_area; @@ -67,6 +68,7 @@ if (!(vma->vm_flags & VM_WRITE)) goto bad_area;; } + up(¤t->mm->mmap_sem); return 1; check_stack: @@ -76,6 +78,7 @@ goto good_area; bad_area: + up(¤t->mm->mmap_sem); return 0; } diff -urN linux-2.3.20/kernel/ptrace.c linux-bird.mm/kernel/ptrace.c --- linux-2.3.20/kernel/ptrace.c Sun Sep 12 13:03:24 1999 +++ linux-bird.mm/kernel/ptrace.c Tue Oct 12 09:15:27 1999 @@ -79,14 +79,15 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write) { - int copied; - struct vm_area_struct * vma = find_extend_vma(tsk, addr); + int copied = 0; + struct vm_area_struct * vma; + + down(&tsk->mm->mmap_sem); + vma = find_extend_vma(tsk, addr); if (!vma) - return 0; + goto out; - down(&tsk->mm->mmap_sem); - copied = 0; for (;;) { unsigned long offset = addr & ~PAGE_MASK; int this_len = PAGE_SIZE - offset; @@ -115,6 +116,7 @@ vma = vma->vm_next; } +out: up(&tsk->mm->mmap_sem); return copied; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [more fun] Re: locking question: do_mmap(), do_munmap() 1999-10-12 14:06 ` [more fun] " Alexander Viro @ 1999-10-13 7:35 ` Manfred Spraul 1999-10-13 18:34 ` Kanoj Sarcar 0 siblings, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-13 7:35 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, Ingo Molnar, Linus Torvalds, linux-mm Alexander Viro wrote: > Another one: ptrace_readdata() -> access_process_vm() -> find_extend_vma() > -> make_pages_present(). Again, no mmap_sem in sight. Can 2 processes ptrace() each other? If yes, then this will lock-up if you acquire the mmap_sem. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [more fun] Re: locking question: do_mmap(), do_munmap() 1999-10-13 7:35 ` Manfred Spraul @ 1999-10-13 18:34 ` Kanoj Sarcar 0 siblings, 0 replies; 51+ messages in thread From: Kanoj Sarcar @ 1999-10-13 18:34 UTC (permalink / raw) To: Manfred Spraul; +Cc: viro, sct, andrea, linux-kernel, mingo, torvalds, linux-mm > > Alexander Viro wrote: > > Another one: ptrace_readdata() -> access_process_vm() -> find_extend_vma() > > -> make_pages_present(). Again, no mmap_sem in sight. Actually, I think access_process_vm() does get mmap_sem, although after doing the find_extend_vma, which should be fixed. I will take care of it in my patch. > > Can 2 processes ptrace() each other? If yes, then this will lock-up if > you acquire the mmap_sem. > Theoretically, there is no problem with 2 processes ptrace()ing each other. IMO, no deadlock is possible in the approach I posted. Kanoj > -- > Manfred > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://humbolt.geo.uu.nl/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 23:01 ` Alexander Viro 1999-10-12 14:06 ` [more fun] " Alexander Viro @ 1999-10-13 10:16 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-13 10:16 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 19:01:12 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: >> Deadlock. Process A tries to do an mmap on mm A, gets the exclusive >> lock, tries to swap out from process B, and grabs mm B's shared lock. >> Process B in the mean time is doing the same thing and has an exclusive >> lock on mm B, and is trying to share-lock A. Whoops. > <looking at the places in question> > insert_vm_struct doesn't allocate anything. > Ditto for merge_segments > In do_munmap() the area that should be protected (ripping the vmas from > the list) doesn't allocate anything too. No such luck. We don't take the mm semaphore in any of those places: we already hold it. The mmap() operation itself takes the semaphore, and it _certainly_ allocates memory. You could allow take it shared and promote it later once the memory has been allocated, but that will just open up a new class of promotion deadlocks. Or you could drop and retake after the allocations, but then you lose the protection of the semaphore and you have to back and revalidate your vma once you have got the exclusive lock (which _is_ possible, especially with a generation number in the vma, but it's hideously ugly). That's exactly why taking the semaphore in the swapper is so dangerous. > In the swapper we are protected from recursion, aren't we? Yes: the PF_MEMALLOC flag sees to that. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 16:05 ` Alexander Viro 1999-10-11 18:02 ` Manfred Spraul @ 1999-10-11 20:13 ` Stephen C. Tweedie 1999-10-11 21:40 ` Alexander Viro 1 sibling, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 20:13 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 12:05:23 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: > On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: >> No, spinlocks would be ideal. The vma swapout codes _have_ to be >> prepared for the vma to be destroyed as soon as we sleep. In fact, the >> entire mm may disappear if the process happens to exit. Once we know >> which page to write where, the swapout operation becomes a per-page >> operation, not per-vma. > Aha, so you propose to drop it in ->swapout(), right? (after get_file() in > filemap_write_page()... Ouch. Probably we'ld better lambda-expand the call > in filemap_swapout() - the thing is called from other places too)... Right now it is the big kernel lock which is used for this, and the scheduler drops it anyway for us. If anyone wants to replace that lock with another spinlock, then yes, the swapout method would have to drop it before doing anything which could block. And that is ugly: having spinlocks unbalanced over function calls is a maintenance nightmare. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 20:13 ` Stephen C. Tweedie @ 1999-10-11 21:40 ` Alexander Viro 1999-10-11 22:20 ` Stephen C. Tweedie 0 siblings, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-11 21:40 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > Hi, > > On Mon, 11 Oct 1999 12:05:23 -0400 (EDT), Alexander Viro > <viro@math.psu.edu> said: > > > On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > >> No, spinlocks would be ideal. The vma swapout codes _have_ to be > >> prepared for the vma to be destroyed as soon as we sleep. In fact, the > >> entire mm may disappear if the process happens to exit. Once we know > >> which page to write where, the swapout operation becomes a per-page > >> operation, not per-vma. > > > Aha, so you propose to drop it in ->swapout(), right? (after get_file() in > > filemap_write_page()... Ouch. Probably we'ld better lambda-expand the call > > in filemap_swapout() - the thing is called from other places too)... > > Right now it is the big kernel lock which is used for this, and the > scheduler drops it anyway for us. If anyone wants to replace that lock > with another spinlock, then yes, the swapout method would have to drop > it before doing anything which could block. And that is ugly: having > spinlocks unbalanced over function calls is a maintenance nightmare. Agreed, but the big lock does not (and IMHO should not) cover the vma list modifications. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 21:40 ` Alexander Viro @ 1999-10-11 22:20 ` Stephen C. Tweedie 1999-10-11 22:31 ` Alexander Viro 0 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 22:20 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 17:40:52 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: > Agreed, but the big lock does not (and IMHO should not) cover the vma list > modifications. Fine, but as I've said you need _something_. It doesn't matter what, but the fact that the kernel lock is no longer being held for vma updates has introduced swapper races. We can't fix those without either restoring or replacing the big lock. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 22:20 ` Stephen C. Tweedie @ 1999-10-11 22:31 ` Alexander Viro 1999-10-13 10:25 ` Stephen C. Tweedie 0 siblings, 1 reply; 51+ messages in thread From: Alexander Viro @ 1999-10-11 22:31 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > Hi, > > On Mon, 11 Oct 1999 17:40:52 -0400 (EDT), Alexander Viro > <viro@math.psu.edu> said: > > > Agreed, but the big lock does not (and IMHO should not) cover the vma list > > modifications. > > Fine, but as I've said you need _something_. It doesn't matter what, > but the fact that the kernel lock is no longer being held for vma > updates has introduced swapper races. We can't fix those without either > restoring or replacing the big lock. And spinlock being released in the ->swapout() is outright ugly. OK, so we are adding to mm_struct a new semaphore (vma_sem) and getting it around the places where the list is modified + in the swapper (for scanning). In normal situation it will never give us contention - everyone except swapper uses it with mmap_sem already held. Are there any objections against it? If it's OK I'll go ahead and do it. Comments? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 22:31 ` Alexander Viro @ 1999-10-13 10:25 ` Stephen C. Tweedie 0 siblings, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-13 10:25 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm Hi, On Mon, 11 Oct 1999 18:31:37 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: > And spinlock being released in the ->swapout() is outright ugly. OK, so > we are adding to mm_struct a new semaphore (vma_sem) and getting it around > the places where the list is modified + in the swapper (for scanning). In > normal situation it will never give us contention - everyone except > swapper uses it with mmap_sem already held. Are there any objections > against it? If it's OK I'll go ahead and do it. Comments? Looks OK as long as the swapper remains non-recursive and we never, ever allocate memory outside the swapper with vma_sem held. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 18:42 ` Manfred Spraul 1999-10-10 19:03 ` Alexander Viro @ 1999-10-11 15:47 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:47 UTC (permalink / raw) To: Manfred Spraul Cc: Alexander Viro, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm, Stephen Tweedie Hi, On Sun, 10 Oct 1999 20:42:31 +0200, Manfred Spraul <manfreds@colorfullife.com> said: >> I'm not sure that it will work (we scan the thing in many places and >> quite a few may be blocking ;-/), unless you propose to protect individual >> steps of the scan, which will give you lots of overhead. > The overhead should be low, we could keep the "double synchronization", > ie > * either down(&mm->mmap_sem) or spin_lock(&mm->vma_list_lock) for read > * both locks for write. Looks good. > That would be a good idea: > For multi-threaded applications, swap-in is currently single-threaded, > ie we do not overlap the io operations if 2 threads of the same process > cause page faults. Everything is fully serialized. That can be fixed another way --- it is relatively simple to drop the semaphore while we read the page from disk and revalidate the page fault context when we unblock again. A sequence number against vma list changes plus a double-check against the contents of the pte would be sufficient here. If the context has changed, we can just return: we'll get another page fault from scratch, and the page will already be in cache. We already have to do part of this work anyway because every time we block while processing a write fault, we have to double-check to see if the swapper removed the pte while we were asleep. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 17:12 ` Andrea Arcangeli 1999-10-10 17:48 ` Alexander Viro @ 1999-10-11 15:43 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Alexander Viro, Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm, Stephen Tweedie Hi, On Sun, 10 Oct 1999 19:12:58 +0200 (CEST), Andrea Arcangeli <andrea@suse.de> said: > The other option is to make the mmap semaphore recursive checking that GFP > is not called in the middle of a vma change. I don't like this one it sound > not robust as the spinlock way to me (see below). Doesn't work, because you can still have a process which takes one mmap semaphore and then attempts to take a different one inside the swapper as the result of a memory allocation. As soon as you have two processes doing that to each other's semaphores, you still have a deadlock. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:07 ` Alexander Viro 1999-10-10 16:25 ` Alexander Viro @ 1999-10-10 16:56 ` Andrea Arcangeli 1999-10-11 15:41 ` Stephen C. Tweedie 2 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-10-10 16:56 UTC (permalink / raw) To: Alexander Viro; +Cc: Manfred Spraul, linux-kernel, Ingo Molnar, linux-mm On Sun, 10 Oct 1999, Alexander Viro wrote: >Manfred, Andrea - please stop it. Yes, it does and yes, it should. I don't want to use the big kernel lock of course. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 16:07 ` Alexander Viro 1999-10-10 16:25 ` Alexander Viro 1999-10-10 16:56 ` Andrea Arcangeli @ 1999-10-11 15:41 ` Stephen C. Tweedie 1999-10-11 15:52 ` Alexander Viro 2 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:41 UTC (permalink / raw) To: Alexander Viro Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm, Stephen Tweedie Hi, On Sun, 10 Oct 1999 12:07:38 -0400 (EDT), Alexander Viro <viro@math.psu.edu> said: >> eg. sys_mprotect calls merge_segments without lock_kernel(). > Manfred, Andrea - please stop it. Yes, it does and yes, it should. > Plonking the big lock around every access to VM is _not_ a solution. If > swapper doesn't use mmap_sem - _swapper_ should be fixed. How the hell > does lock_kernel() have smaller deadlock potential than > down(&mm->mmap_sem)? The swapout code cannot claim the mmap semaphore. There are just too many deadlock possibilities. For example, the whole VM assumes that it is safe to try to allocate memory while holding the mmap semaphore. How are you going to make that work if we are short of immediately free pages and the allocation request recurses into the swapper? The swapper has very strict requirements: to avoid blocking it requires the big lock and the page table spinlocks, so that it can survive without the mm semaphore. Adding the mm semaphore to the swapout loop is not really an option. That means that you need the kernel lock when modifying vma lists. We can, however, improve things by using a per-mm spinlock instead of using the kernel lock to provide that guarantee. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-11 15:41 ` Stephen C. Tweedie @ 1999-10-11 15:52 ` Alexander Viro 0 siblings, 0 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-11 15:52 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Ingo Molnar, linux-mm On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > The swapper has very strict requirements: to avoid blocking it requires > the big lock and the page table spinlocks, so that it can survive > without the mm semaphore. Adding the mm semaphore to the swapout loop > is not really an option. That means that you need the kernel lock when > modifying vma lists. Ouch... > We can, however, improve things by using a per-mm spinlock instead of > using the kernel lock to provide that guarantee. ->swapout() may block. We have three areas here: 1. vma accesses in swapper. 2. vma list reads outside of swapper. 3. vma modifications/destruction. Looks like we need exclusion between 1 and 3 (on per-mm basis, that is). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* locking question: do_mmap(), do_munmap() @ 1999-10-09 12:48 Manfred Spraul 1999-10-09 13:12 ` Alexander Viro 0 siblings, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-09 12:48 UTC (permalink / raw) To: linux-kernel, linux-mm which semaphores/spinlocks protect do_mmap() and do_munmap()? do_mmap(): I wrote a test patch, and I found out that some (all?) callers call lock_kernel(), but that mm->mmap_sem is _NOT ACQUIRED_. Eg sys_uselib()-> do_load_elf_??() -> do_mmap(). do_munmap(): ???? I think here is a race: sys_munmap() doesn't call lock_kernel(), it only acquires mm->mmap_sem. do_mmap() internally calls do_munmap(), ie with the kernel lock, but without mm->mmap_sem. What about adding debugging runtime checks to these function? ie #defines which call down_trylock() and spin_trylock() and oops on missing locks? We could define them to "(void)0" before 2.4. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 12:48 Manfred Spraul @ 1999-10-09 13:12 ` Alexander Viro 1999-10-09 13:17 ` Manfred Spraul 1999-10-09 16:01 ` Andrea Arcangeli 0 siblings, 2 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-09 13:12 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-kernel, linux-mm On Sat, 9 Oct 1999, Manfred Spraul wrote: > which semaphores/spinlocks protect do_mmap() and do_munmap()? > > do_mmap(): > I wrote a test patch, and I found out that some (all?) callers call > lock_kernel(), but that mm->mmap_sem is _NOT ACQUIRED_. Eg > sys_uselib()-> do_load_elf_??() -> do_mmap(). > > do_munmap(): > ???? I think here is a race: > sys_munmap() doesn't call lock_kernel(), it only acquires mm->mmap_sem. > do_mmap() internally calls do_munmap(), ie with the kernel lock, but > without mm->mmap_sem. do_munmap() doesn't need the big lock. do_mmap() callers should grab the semaphore, unless they are sure that nobody else owns the same context. Big lock on do_mmap() is still needed. do_mmap() in do_load_elf_binary() is OK (we are the sole owners of context), but in the do_load_elf_library() it is _not_. Moreover, sys_uselib() may do interesting things to cloned processes. IMO the right thing would be to check for the number of mm users. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 13:12 ` Alexander Viro @ 1999-10-09 13:17 ` Manfred Spraul 1999-10-09 13:38 ` Alexander Viro 1999-10-09 16:01 ` Andrea Arcangeli 1 sibling, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-09 13:17 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel, linux-mm Alexander Viro wrote: > Moreover, sys_uselib() may do > interesting things to cloned processes. IMO the right thing would be to > check for the number of mm users. I don't know the details of the mm implementation, but if there is only one user, then down(&mm->mmap_sem) will never sleep, and you loose nothing by getting the semaphore. I would prefer a clean implementation, ie always down() before do_mmap(), and ASSERT_DOWN() macros to enforce this. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 13:17 ` Manfred Spraul @ 1999-10-09 13:38 ` Alexander Viro 0 siblings, 0 replies; 51+ messages in thread From: Alexander Viro @ 1999-10-09 13:38 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-kernel, linux-mm On Sat, 9 Oct 1999, Manfred Spraul wrote: > Alexander Viro wrote: > > Moreover, sys_uselib() may do > > interesting things to cloned processes. IMO the right thing would be to > > check for the number of mm users. > > I don't know the details of the mm implementation, but if there is only > one user, then down(&mm->mmap_sem) will never sleep, and you loose > nothing by getting the semaphore. Yes (especially since sys_uselib() is not _too_ time-critical), but consider what will happen if the thread A does sys_uselib() while the thread B runs in the affected area. It's a different problem. As for the details - check where the new users may come from. I don't think that it calls for ASSERT-style macros, though. IMO it's a matter of one big grep. As of 2.3.20-pre2 affected places are arch/mips/kernel/sysirix.c arch/sparc64/kernel/binfmt_aout32.c fs/binfmt_aout.c fs/binfmt_elf.c drivers/char/drm/bufs.c (WTF is that?) drivers/sgi/char/{graphics.c,shmiq.c} So there... I'm not sure that we need to protect calls in ->load_binary(), though - it's really unnecessary. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 13:12 ` Alexander Viro 1999-10-09 13:17 ` Manfred Spraul @ 1999-10-09 16:01 ` Andrea Arcangeli 1999-10-10 13:05 ` Manfred Spraul 1999-10-11 15:05 ` Stephen C. Tweedie 1 sibling, 2 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-10-09 16:01 UTC (permalink / raw) To: Alexander Viro; +Cc: Manfred Spraul, linux-kernel, linux-mm On Sat, 9 Oct 1999, Alexander Viro wrote: >do_munmap() doesn't need the big lock. do_mmap() callers should grab Look the swapout path. Without the big kernel lock you'll free vmas under swap_out(). ftp://ftp.suse.com/pub/people/andrea/kernel-patches/pending-2.3.x/munmap-lock-1 Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 16:01 ` Andrea Arcangeli @ 1999-10-10 13:05 ` Manfred Spraul 1999-10-11 15:09 ` Stephen C. Tweedie 1999-10-11 15:05 ` Stephen C. Tweedie 1 sibling, 1 reply; 51+ messages in thread From: Manfred Spraul @ 1999-10-10 13:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm Andrea Arcangeli wrote: > Look the swapout path. Without the big kernel lock you'll free vmas under > swap_out(). I checked to code in mm/*.c, and it seems that reading the vma-list is protected by either lock_kernel() [eg: swapper] or down(&mm->mmap_sem) [eg: do_mlock]. But this means that both locks are required if you modify the vma list. Single reader, multiple writer synchronization. Unusual, but interesting :-) Unfortunately, it seems that this is often ignored, eg. sys_mlock()->do_mlock()->merge_segments(). sys_brk() sys_munmap() <<<<<< fixed by your patch. It that correct? Should I write a patch or is someone working on these problems? How should we fix it? a) the swapper calls down(&mm->mmap_sem), but I guess that would lock-up. b) everyone who changes the vma list calls lock_kernel(). I think it would be a bad thing to call lock_kernel() immediately in the sys_??() function, I think we should hide the lock_kernel() call somewhere inside the vma-list code [add functions which modify the vma list, and they call lock_kernel()]. -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-10 13:05 ` Manfred Spraul @ 1999-10-11 15:09 ` Stephen C. Tweedie 0 siblings, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:09 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, linux-mm, Stephen Tweedie Hi, On Sun, 10 Oct 1999 15:05:44 +0200, Manfred Spraul <manfreds@colorfullife.com> said: > Andrea Arcangeli wrote: >> Look the swapout path. Without the big kernel lock you'll free vmas under >> swap_out(). > I checked to code in mm/*.c, and it seems that reading the vma-list is > protected by either lock_kernel() [eg: swapper] or down(&mm->mmap_sem) > [eg: do_mlock]. The swapper relies on it being protected by the big lock. The mm semaphore is required when you need additional protection: specifically, if you need to sleep while manipulating the vma lists (eg. in page faults). > But this means that both locks are required if you modify the vma list. > Single reader, multiple writer synchronization. Unusual, but interesting > :-) Correct, but you only need the one lock --- the big lock --- to read the vma list, which is what the swapper does. The swapper only needs write access to the page tables, not to the vma list. > How should we fix it? > a) the swapper calls down(&mm->mmap_sem), but I guess that would > lock-up. Massive deadlock, indeed. We've looked at this but it is soooo painful. > b) everyone who changes the vma list calls lock_kernel(). ... or an equivalent lock. The big lock itself isn't needed if we have a per-mm spinlock, but we do need something lighter weight than the mmap semaphore to let the swapper read-protect the vma lists. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: locking question: do_mmap(), do_munmap() 1999-10-09 16:01 ` Andrea Arcangeli 1999-10-10 13:05 ` Manfred Spraul @ 1999-10-11 15:05 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-10-11 15:05 UTC (permalink / raw) To: Andrea Arcangeli Cc: Alexander Viro, Manfred Spraul, linux-kernel, linux-mm, Stephen Tweedie Hi, On Sat, 9 Oct 1999 18:01:27 +0200 (CEST), Andrea Arcangeli <andrea@suse.de> said: > On Sat, 9 Oct 1999, Alexander Viro wrote: >> do_munmap() doesn't need the big lock. do_mmap() callers should grab > Look the swapout path. Without the big kernel lock you'll free vmas under > swap_out(). Yes. The swapout code relies on the big lock to freeze the vma, and on the page_table_lock to protect the ptes, so that it can avoid worrying about the mm_sem at all. If munmap ever drops vmas without the big lock, the swapper _will_ break. Making this into a per-mm lock would not be hard, btw. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~1999-10-15 17:50 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.10.9910101713010.364-100000@alpha.random>
1999-10-10 15:52 ` locking question: do_mmap(), do_munmap() Manfred Spraul
1999-10-10 16:07 ` Alexander Viro
1999-10-10 16:25 ` Alexander Viro
1999-10-10 16:45 ` Manfred Spraul
1999-10-10 17:25 ` Alexander Viro
1999-10-10 17:12 ` Andrea Arcangeli
1999-10-10 17:48 ` Alexander Viro
1999-10-10 18:42 ` Manfred Spraul
1999-10-10 19:03 ` Alexander Viro
1999-10-10 21:31 ` Manfred Spraul
1999-10-10 21:53 ` Andrea Arcangeli
1999-10-10 22:34 ` Alexander Viro
1999-10-10 23:28 ` Andrea Arcangeli
1999-10-11 15:50 ` Stephen C. Tweedie
1999-10-11 16:05 ` Alexander Viro
1999-10-11 18:02 ` Manfred Spraul
1999-10-11 19:07 ` Kanoj Sarcar
1999-10-11 22:23 ` Stephen C. Tweedie
1999-10-13 1:25 ` Kanoj Sarcar
1999-10-13 7:32 ` Manfred Spraul
1999-10-15 9:58 ` Ralf Baechle
1999-10-15 17:50 ` Kanoj Sarcar
1999-10-13 10:45 ` Stephen C. Tweedie
1999-10-11 20:15 ` Stephen C. Tweedie
1999-10-11 21:14 ` Manfred Spraul
1999-10-11 21:37 ` Alexander Viro
1999-10-11 22:13 ` Manfred Spraul
1999-10-11 22:22 ` Stephen C. Tweedie
1999-10-11 23:01 ` Alexander Viro
1999-10-12 14:06 ` [more fun] " Alexander Viro
1999-10-13 7:35 ` Manfred Spraul
1999-10-13 18:34 ` Kanoj Sarcar
1999-10-13 10:16 ` Stephen C. Tweedie
1999-10-11 20:13 ` Stephen C. Tweedie
1999-10-11 21:40 ` Alexander Viro
1999-10-11 22:20 ` Stephen C. Tweedie
1999-10-11 22:31 ` Alexander Viro
1999-10-13 10:25 ` Stephen C. Tweedie
1999-10-11 15:47 ` Stephen C. Tweedie
1999-10-11 15:43 ` Stephen C. Tweedie
1999-10-10 16:56 ` Andrea Arcangeli
1999-10-11 15:41 ` Stephen C. Tweedie
1999-10-11 15:52 ` Alexander Viro
1999-10-09 12:48 Manfred Spraul
1999-10-09 13:12 ` Alexander Viro
1999-10-09 13:17 ` Manfred Spraul
1999-10-09 13:38 ` Alexander Viro
1999-10-09 16:01 ` Andrea Arcangeli
1999-10-10 13:05 ` Manfred Spraul
1999-10-11 15:09 ` Stephen C. Tweedie
1999-10-11 15:05 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox